A unique hash function for a string of 76 characters

Here is my problem (I program in C):

I have huge text files containing DNA sequences (each file has approximately 65 million lines and a size of about 4 ~ 5 GB). There are many duplicates in these files (I don’t know how many more, but there should be many millions), and I want to return only individual values ​​to the output file. Each row has a quality-related value, so if I have 5 identical rows with different quality values, I will keep the best and discard the remaining 4.

Reducing memory requirements and increasing speed efficiency as far as I can - VITAL. My idea was to create a JudyHS array using a hash function to convert a string string String (76 letters long and have 7 possible characters) to an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes by many millions records should be achievable). That way, I could use an integer as an index and store only the best quality for that index. The problem is that I cannot find a hash function that UNIVOCALLY defines such a long string and creates a value that can be stored in integer or even long long!

My first idea for a hash function was something like the default hash function in Java: s [0] * 31 ^ (n-1) + s [1] * 31 ^ (n-2) + ... + s [n-1], but I could get the maximum value of 8.52 * 10 ^ 59 .. way tooo big. How about doing the same thing and storing it twice? Will the calculations be much slower? Please note that I would like UNIVOCALLY to define the string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, a rather expensive operation ...)

+3
source share
2 answers

Do you have 7 ^ 76 possible DNA sequences and want to map them to 2 ^ 32 no-collision hashes? Impossible.

log2 (7 ^ 76) = 214 , 27 .

, CRC32 md5, .

+3

"" - N - (, -) , N 2. 65 - 26 (2 26 65 ), 52 " ".

-, "", . MD4, MD5, SHA-1... ( ) 64 , 64- . , 65 ; , .

C -, lookup sphlib. sph_dec64le() "" 8 64- .

+1

All Articles