The most efficient way / library to detect predefined keywords in billions of lines?

Say I have several billion lines of text and several million keywords. The challenge is to go through these lines and see which line contains the keywords. In other words, given a map (K1 -> V1)and (K2 -> V2), create a map (K2 -> K1), where K1=lineID, V1=text, K2=keywordIDand V2=keyword. Please also note that:

  • All Text / Keywords - English
  • Text (V1) may contain spelling errors.
  • Most keywords (V2) are single words, but some keywords may consist of several English words (for example, “clean towel”).

So far, my initial idea to solve this problem is this:

1) Chop up all my keywords into single words and 
   create a large set of single words (K3)
2) Construct a BK-Tree out of these chopped up keywords,
   using Levenshtein distance
3) For each line of data (V1), 
    3.1) Chop up the text (V1) into words
    3.2) For each said word,
        3.2.1) Retrieve words (K3) from the BK-Tree that
               are close enough to said word
    3.3) Since at this point we still have false positives,
        (e.g. we would have matched "clean" from "clean water" against
         keyword "clean towel"), we check all possible combination
          using a trie of keyword (V2) to filter such false 
          positives out. We construct this trie so that at the
          end of an successful match, the keywordID (K2) can be retrieved.
    3.4) Return the correct set of keywordID (K2) for this line (V1)!
4) Profit! 

  • ? - ? - ?
  • , ? -, Java.

!

+5
2

/2D-. . . , hasoop map/reduce?

0

, , (K2- > K1), (http://en.wikipedia.org/wiki/Inverted_index).

, Lucene/Solr ( /), , Lucene ( "IndexReader" javadoc Lucene).

Lucene, : 1) 2) - (), K2- > K1, , .

, K2- > K1, , , Lucene.

SOLR , .

EDIT: LUKE Lucene (https://code.google.com/p/luke/)

0

All Articles