I have an interesting problem that can be solved in several ways:
- I have a function that takes a string.
- If this function has never encountered this line before, it should do some processing.
- If the function saw the line earlier, it needs to skip processing.
- After a certain time, the function should accept duplicate lines.
- This function can be called thousands of times per second, and the string data can be very large.
This is a very abstract explanation of a real application, just trying to get to the basic concept for the purpose of the question.
The function must maintain state to detect duplicates. He will also need to keep the associated timestamp in order to expire duplicates.
There is no need to store the strings, the unique hash of the string will be in order if there are no false positives due to collisions (use the perfect hash?), And the hash function was quite high.
A naive implementation will be simple (in C #):
Dictionary<String,DateTime>
although in the interest of reducing memory and potentially increasing performance, I am evaluating custom data structures to handle this instead of the underlying hash table.
So, given these limitations, what would you use?
EDIT, additional information that may change the proposed implementations:
- 99% of the lines will not be duplicated.
- Almost all duplicates will come back almost sequentially.
- , .