I have a huge text file (more than the available RAM). I need to count the frequency of all words and output the word and frequency to a new file. The result should be sorted in decreasing order of frequency.
My approach:
- Sorting this file - appearance
- Read the frequency of each word sequentially, save the score in another file (along with the word)
- Sort output file by frequency - external sorting.
I want to know if there are better approaches for this. Have I heard of drive-based hash tables? or B +, but have never tried them before.
Note. I saw similar questions asked in SO, but none of them should solve the problem with data larger than memory.
Edit: Based on the comments, I agreed that, in practice, a dictionary should fit into the memory of today's computers. But let's take a hypothetical dictionary of words, which is huge enough not to fit into memory.
source
share