Build a system to support the top k frequent words in real time

Suppose we want the system to keep the top k common words that appear on tweets in the last hour. How to create it?

I can find hashmap, heap, log or MapReduce, but I cannot find a very efficient way to do this.

Actually, this is a question in an interview.
First I used a hash map to count the frequency of each word. In addition, I kept a journal, since the time passed by, I could count the oldest word frequencies. Then I saved an array of records with a length of K (an array of the top K) and the number N, which is the smallest number of counters in the array.
Every time a new word arrives, I update the hash counter and get the number of samples of this new word. If it is greater than N, I will find if this word is in the array. If so, I am updating this entry in the array. If not, I delete the smallest entry in the array and insert a new word into it. (Update N accordingly)

Here is the problem, my approach cannot handle the deletion. I may need to repeat all the hashmap counts to find the new top K.
Also, as the interviewer said, the system should get the result very quickly. I think that several machines work together, and each machine takes a few words. However, combining results is also becoming a problem.

+3
source share
4 answers

( 0 1), , , O (N) , N - , ( , ). ( , , ) O(1) . , , .

, 1, - . , ( ), node , . , node node. , -, node, .

, , , O(N'), N' - , . , node .

, . 1, node ( ); count , node O(1). , ( ) , .

, , :

Count list      word lists (each node points back to the count node)

  17            a <--> the <--> for
  ^
  |
  v
  12            Wilbur <--> drawing
  ^
  |
  v
  11            feature

, a Wilbur. 13; , 12 13, 13 count node . Wilbur , , node, count Wilbur, node.

, drawing , 11. , 12 11, node ; drawing 11, -, . , , 12, , 12 count node .

0, 0 count node, , node. , 1 count node, node, .

, . , - , , .

---

, . . , . , , / .

, , - . , . , , .

- . ( -, , -.) k , k ; , .

+5

. , - "Lossy Counting" "Sticky Sampling", , , . .

: ( , )

, algos per se, , , 60 , , . , . 1 .

, , , , , , Rici algo , - , . % , .

, . Rici, , . - 100/ → 100 / 1h → 7d, .

Hastables, Rici algo, , .

0

, : -

  • , hashmap, , .
  • hashmap .
  • , k .
  • , (-1, ).
  • , , , heapify , .
  • , .

: -

top k: - O(logk) heapify, insert, delete

: O(|W|) where |W| is length of word

: O(k)

, HashMap, : - O(N) N -

0

You can use TreeMap, which is basically a sorted hash. In java, you can make the TreeMap list entries in descending order (by overriding the comparison method in the Comparable interface). In this case, the top k entries after the specified time period will give you the result.

-1
source

All Articles