Solr: using regex fragmentation to extract paragraphs

Question

Solr: using regex fragmentation to extract paragraphs

I sent this message to the Solr mailing list, but I also try here if there is a Solr expert there.

I'm trying to use a regex fragmenter, and it's hard for me to get the results that I want. I try to get fragments that begin with the word character and end with punctuation, but for some reason the fragments returned to me seem very inflexible, despite the fact that I provided a big slope. Here are the relevant options I'm using, maybe someone can indicate where I made a mistake:

<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>

This must match between 400-600 characters, starting with the word character and ending with one of.!?. Here is an example of a typical result:

. Check out these pictures. Nine panda kids on display for the first time Thursday in southwestern China. They are less than a year old. They just recently stopped breastfeeding. There are only 1,600 of these guys who went into the mountain forests of central China, another 120 people in Chinese breeding facilities and zoos. And they are about 20 who live outside of China in zoos. They almost completely exist on bamboo. They can live up to 30 years. And also these little guys will end up with a lot more. They will grow

, ! , , , , , . , , . , , ...

,

+2

highlighting regex solr

Markus 12 . '08 22:01

3

, (Solr), . 402 602 , , - :

\w.{400,600}[.!?]

, . \w .

3 , ( 602), , , .

, :

\w.{400,600}?[.!?]

, , :

\w[^.!?]{400,600}[.!?]

, Solr Perl. , \w {400,600}, .

+1

Jan Goyvaerts 13 . '08 12:55

, , WordDelimiterFilterFactory. http://www.mail-archive.com/solr-user@lucene.apache.org/msg30631.html

As described in the link above, one solution might be to add preserveOriginal="1"to yours WordDelimiterFilterFactory. I tried this and it worked for me. However, being new to SOLR, I don't know if there are any drawbacks to this approach (other than increasing the size of the index).

0

raymi Jun 28 '11 at 8:58

source share

VonC · Accepted Answer · 2008-12-12T22:15:42+0000

Try:

\w[^\.!\?]{400,600}[\.!\?]

\w

.

, .* ({400,600}) , .{400,600}

? , .

. -, [^\.!\?], -, .

Solr: using regex fragmentation to extract paragraphs

More articles: