Removing commas when using the space tokenizer

When using the space tokenizer, text like "is, it is." will be divided into "there," "he," and "is." Naturally, I would like to remove these punctuation marks, which the standard tokenizer would automatically delete.

My questions:

  • How to crop these punctuation marks? (in elasticsearch setup, for example, adding another token filter or charfilter).
  • I need to use the space tokenizer mainly because I don't want the decrypted words to be broken. Is there a way I can achieve this while still using a standard tokenizer?
+3
source share
3 answers

char, ",". Char

+1

:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

̶ ̶ = > ̶ Http: ̶/̶/̶e̶s̶.̶s̶u̶b̶i̶t̶o̶l̶a̶b̶s̶.̶c̶o̶m̶/̶ ̶ #/̶t̶e̶s̶t̶r̶/̶m̶6̶m̶f̶b̶4̶a̶h̶i̶m̶8̶6̶w̶2̶9̶

+1

you can use split () to remove all punctuation

String str ="there, he is.";
String[] ss = str.split("[ ,.]");   
for (String string : ss) {
System.out.println(string);
}

try this it helps u

-1
source

All Articles