When using the space tokenizer, text like "is, it is." will be divided into "there," "he," and "is." Naturally, I would like to remove these punctuation marks, which the standard tokenizer would automatically delete.
My questions:
- How to crop these punctuation marks? (in elasticsearch setup, for example, adding another token filter or charfilter).
- I need to use the space tokenizer mainly because I don't want the decrypted words to be broken. Is there a way I can achieve this while still using a standard tokenizer?
source
share