Count the word frequency of a huge text file

I have a huge text file (more than the available RAM). I need to count the frequency of all words and output the word and frequency to a new file. The result should be sorted in decreasing order of frequency.

My approach:

  • Sorting this file - appearance
  • Read the frequency of each word sequentially, save the score in another file (along with the word)
  • Sort output file by frequency - external sorting.

I want to know if there are better approaches for this. Have I heard of drive-based hash tables? or B +, but have never tried them before.

Note. I saw similar questions asked in SO, but none of them should solve the problem with data larger than memory.

Edit: Based on the comments, I agreed that, in practice, a dictionary should fit into the memory of today's computers. But let's take a hypothetical dictionary of words, which is huge enough not to fit into memory.

+5
source share
4 answers

I would go with the approach map reduce:

  • Distribute your text file on the nodes if each text in the node can fit in RAM.
  • Calculate each word frequency in node. (using hash tables)
  • Collect each result in the master node and combine them all.
+13
source

All the unique words probably fit into memory, so I would use this approach:

  • Create a dictionary ( HashMap<string, int>).
  • Read the huge text file line by line.
  • 1.
  • 1 .

, :

  • .
  • .

, .

+5

- Multimap (, Guava). , (, Berkeley JE DB MapDB). , . MapDB , .

+2

( ), - ( ).

- .

+1

All Articles