Count the word frequency of a huge text file

Question

Count the word frequency of a huge text file

I have a huge text file (more than the available RAM). I need to count the frequency of all words and output the word and frequency to a new file. The result should be sorted in decreasing order of frequency.

My approach:

Sorting this file - appearance
Read the frequency of each word sequentially, save the score in another file (along with the word)
Sort output file by frequency - external sorting.

I want to know if there are better approaches for this. Have I heard of drive-based hash tables? or B +, but have never tried them before.

Note. I saw similar questions asked in SO, but none of them should solve the problem with data larger than memory.

Edit: Based on the comments, I agreed that, in practice, a dictionary should fit into the memory of today's computers. But let's take a hypothetical dictionary of words, which is huge enough not to fit into memory.

+5

java algorithm data-structures text-files word-count

vikky.rk Feb 07 '13 at 8:09

source share

4 answers

All the unique words probably fit into memory, so I would use this approach:

Create a dictionary ( HashMap<string, int>).
Read the huge text file line by line.
1.
1 .

, :

.
.

, .

+5

Sani Singh Huttunen 07 . '13 8:16

- Multimap (, Guava). , (, Berkeley JE DB MapDB). , . MapDB , .

+2

pushy 07 . '13 8:24

( ), - ( ).

- .

+1

Matteo 07 . '13 8:15

ogzd · Accepted Answer · 2013-02-07T08:15:39+0000

I would go with the approach map reduce:

Distribute your text file on the nodes if each text in the node can fit in RAM.
Calculate each word frequency in node. (using hash tables)
Collect each result in the master node and combine them all.

Count the word frequency of a huge text file

More articles: