Here is my problem (I program in C):
I have huge text files containing DNA sequences (each file has approximately 65 million lines and a size of about 4 ~ 5 GB). There are many duplicates in these files (I don’t know how many more, but there should be many millions), and I want to return only individual values to the output file. Each row has a quality-related value, so if I have 5 identical rows with different quality values, I will keep the best and discard the remaining 4.
Reducing memory requirements and increasing speed efficiency as far as I can - VITAL. My idea was to create a JudyHS array using a hash function to convert a string string String (76 letters long and have 7 possible characters) to an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes by many millions records should be achievable). That way, I could use an integer as an index and store only the best quality for that index. The problem is that I cannot find a hash function that UNIVOCALLY defines such a long string and creates a value that can be stored in integer or even long long!
My first idea for a hash function was something like the default hash function in Java: s [0] * 31 ^ (n-1) + s [1] * 31 ^ (n-2) + ... + s [n-1], but I could get the maximum value of 8.52 * 10 ^ 59 .. way tooo big. How about doing the same thing and storing it twice? Will the calculations be much slower? Please note that I would like UNIVOCALLY to define the string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, a rather expensive operation ...)
source
share