Choosing a data structure for high-speed and efficient memory for duplicate row detection

Question

Choosing a data structure for high-speed and efficient memory for duplicate row detection

I have an interesting problem that can be solved in several ways:

I have a function that takes a string.
If this function has never encountered this line before, it should do some processing.
If the function saw the line earlier, it needs to skip processing.
After a certain time, the function should accept duplicate lines.
This function can be called thousands of times per second, and the string data can be very large.

This is a very abstract explanation of a real application, just trying to get to the basic concept for the purpose of the question.

The function must maintain state to detect duplicates. He will also need to keep the associated timestamp in order to expire duplicates.

There is no need to store the strings, the unique hash of the string will be in order if there are no false positives due to collisions (use the perfect hash?), And the hash function was quite high.

A naive implementation will be simple (in C #):

 Dictionary<String,DateTime>

although in the interest of reducing memory and potentially increasing performance, I am evaluating custom data structures to handle this instead of the underlying hash table.

So, given these limitations, what would you use?

EDIT, additional information that may change the proposed implementations:

99% of the lines will not be duplicated.
Almost all duplicates will come back almost sequentially.
, .

+3

c# hash deduplication

Jonathan Holland 14 . '12 4:51

4

Alexei Levenkov · Answer 1 · 2012-04-14T05:03:23+0000

, " ", ( # int ). , .

, - , . , , . :

stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;

cutomom hashCode , , .

Guffa · Answer 2 · 2012-04-14T05:04:28+0000

, -

, , - .

- , , .

- , . , , .

Jim Mischel · Answer 3 · 2012-04-14T05:13:49+0000

, Hashset<string> Tuple<DateTime, String>. - :

Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();

, :

if (Strings.Add(s))
{
    // string is new. process it.
    // and add it to the expiration queue
    Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}

- . , , , :

while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
    var e = Expirations.Dequeue();
    Strings.Remove(e.Item2);
}

Hashset. , , .

, DateTime.Now. a Stopwatch, , ElapsedMilliseconds. , , ( NTP) .

, , .

" ":

, ConcurrentDictionary, Hashset BlockingCollection, Queue. lock .

, 99% , , .

Ňuf · Answer 4 · 2012-04-14T05:44:37+0000

, :

1) , - ( ). - (MD5, SHA1 ..) , , .

2) Use some kind of lossless compression. Strings usually have a good compression ratio (about 10%), and some algorithms, such as ZIP, allow you to choose between fast (and less efficient) and slow (high compression) compression. Another way to compress strings is to convert them to UTF8, which is fast and easy and has an almost 50% compression ratio for strings other than unicode.

No matter which one you choose, it always combines memory with memory and hash / compression speed. You will probably need to conduct a comparative analysis to select the best solution.

Choosing a data structure for high-speed and efficient memory for duplicate row detection

More articles: