BST or Hash Table?

I have large text files on which I need to perform all kinds of operations, mainly related to checking line by line. The data, as a rule, has the character of sales / transactions and, as a rule, contain a huge amount of redundant information in different lines, such as customer names. Iterating and manipulating this data has become such a common task that I am writing a library in C that I hope to make available as a Python module.

In one test, I found that out of 1.3 million column values, only ~ 300,000 were unique. The memory overhead is because our Python-based web application can handle simultaneous requests for large datasets.

My first attempt was to read in a file and insert each column value into a binary search tree. If the value has never been noticed before, memory is allocated for storing the string; otherwise, a pointer to the existing storage for this value is returned. This works well for datasets of ~ 100,000 rows. Much more, and everything stops, and memory consumption is growing rapidly. I assume that the overhead of all of these node pointers in the tree does not help, and using strcmp for binary search becomes very painful.

This unsatisfactory performance makes me think that I should invest in using a hash table instead. This, however, raises another point - I do not know in advance how many records there are. It could be 10 or ten million. How can I apply the right balance of time / space to prevent the hash table from resizing again and again?

What are the best candidates for data structure in such a situation?

Thank you for your time.

+3
source share
3 answers

, , . (, 50%), O(1). , n ( n ) , n, ( , , ). , , , , 1000000 1500000 , , 500000 , . , -.

+1

-. -, , , . - ( ) , . , n - -, :

n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
  bucket[j]=bucket[i];
}
0

C, Python

Python -, . / Python. . , , , , Cython.

:

shared_table = {}
string_sharer = shared_table.setdefault

:

for i, field in enumerate(fields):
    fields[i] = string_sharer(field, field)

You can, of course, find after examining each column that some columns are badly compressed and should be excluded from "scrunching".

0
source

All Articles