Removing commas when using the space tokenizer

Question

Removing commas when using the space tokenizer

When using the space tokenizer, text like "is, it is." will be divided into "there," "he," and "is." Naturally, I would like to remove these punctuation marks, which the standard tokenizer would automatically delete.

My questions:

How to crop these punctuation marks? (in elasticsearch setup, for example, adding another token filter or charfilter).
I need to use the space tokenizer mainly because I don't want the decrypted words to be broken. Is there a way I can achieve this while still using a standard tokenizer?

+3

elasticsearch

Dionysian Feb 23 '14 at 12:39

source share

3 answers

user3340677 · Answer 1 · 2014-02-23T13:14:00+0000

char, ",". Char

Thomas Decaux · Answer 2 · 2015-01-15T15:32:03+0000

:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

̶ ̶ = > ̶ Http: ̶/̶/̶e̶s̶.̶s̶u̶b̶i̶t̶o̶l̶a̶b̶s̶.̶c̶o̶m̶/̶ ̶ #/̶t̶e̶s̶t̶r̶/̶m̶6̶m̶f̶b̶4̶a̶h̶i̶m̶8̶6̶w̶2̶9̶

Rishi dwivedi · Answer 3 · 2014-02-23T12:44:15+0000

you can use split () to remove all punctuation

String str ="there, he is.";
String[] ss = str.split("[ ,.]");   
for (String string : ss) {
System.out.println(string);
}

try this it helps u

Removing commas when using the space tokenizer

More articles: