Parsing large amounts of text based on a constant set of search terms

Question

Parsing large amounts of text based on a constant set of search terms

I have a set of search terms such as [+ dog - "jack russels" + "fox terrier"], [+ cat + persian -tabby]. They can be quite long, and perhaps 30 sub-terms make up each term.

Now I have online news articles such as ["My Fox Terrier is the cutest dog in the world ..."] and ["Has anyone seen my lost Persian cat? Missing ..."]. They are not too long, possibly no more than 500 characters.

Traditional search engines expect a huge number of articles that are pre-processed into indexes, which allows you to speed up the search for specified "search terms" using set theory theory / Boolean logic to reduce the number of articles only for those that match the phrase. However, in this situation, the order of my search queries is ~ 10 ^ 5, and I would like to process one article at a time to see ALL the many search terms with which the article will be matched (i.e. all + terms are in the text and not one from the terms - ).

I have a possible solution using two cards (one for positive subphrases, one for negative subphrases), but I do not think it will be very effective.

The first prize is a library that solves this problem, the second prize is a push in the right direction to solve it.

Yours faithfully,

+5

java algorithm

Noxville May 18 '12 at 10:05

source share

2 answers

, , , , , .

-, ( +, -), , ( ). , "", ! , . "", . : " ?". , , , " ", , "" . , "Loo ooo ooo ooo ooo ong" - "" .

0

Helium 18 '12 10:50

Stefan Haustein · Accepted Answer · 2012-05-18T11:26:55+0000

Assuming that all positive sub-terms are required for matching:

Put all the sub terms from your search terms in a hash table. A subsegment is a key, a value is a pointer to the entire data structure of the search term (which should include a unique identifier and a sub-term map for the logical one).

In addition, when processing a news item, create a “candidate” map indexed by id. Each candidate structure has a pointer to a definition of a term, a set that contains visible sub-terms and a “rejected” flag.

Iterations according to a news article.

. , .

, .

. , . , .

, . , , .

: https://docs.google.com/document/d/1boieLJboLTy7X2NH1Grybik4ERTpDtFVggjZeEDQH74/edit

O (n * m), n - , m - , - ( ).

Parsing large amounts of text based on a constant set of search terms

More articles: