Stanford POS Tagger Does Not Mark Chinese Text

I am using Stanford POS Tagger (for the first time), and although it correctly points to English, it does not seem to recognize (simplified) Chinese, even when changing the model parameter. Did I miss something?

I downloaded and unpacked the latest full version from here: http://nlp.stanford.edu/software/tagger.shtml

Then I entered a sample text in "sample-input.txt".

这 是 一个 测试 的 句子. 这 是 另一个 句子.

Then i just run

./stanford-postagger.sh models / chinese-distsim.tagger sample-input.txt

The expected conclusion is to mark each word as part of speech, but instead, it recognizes the entire line of text as one word:

Loading default properties from tag models / chinese -distsim.tagger

Reading the POS tag model from / chinese -distsim.tagger ... done models [3.5 sec].

這 是 一個 測試 的 句子. 這 是 另一個 句子. # NR

1 word is noted at 30.30 words per second.

I appreciate any help.

+5
source share
1 answer

Finally, I realized that tokenization / segmentation is not included in this tag tag. It seems that the words should be separated by spaces before submitting them to the tagger. For those who are interested in the maximum entropy segmentation of Chinese words, there is a separate package:

http://nlp.stanford.edu/software/segmenter.shtml

Thanks to everyone.

+6
source

All Articles