Stanford POS Tagger Does Not Mark Chinese Text

Question

Stanford POS Tagger Does Not Mark Chinese Text

I am using Stanford POS Tagger (for the first time), and although it correctly points to English, it does not seem to recognize (simplified) Chinese, even when changing the model parameter. Did I miss something?

I downloaded and unpacked the latest full version from here: http://nlp.stanford.edu/software/tagger.shtml

Then I entered a sample text in "sample-input.txt".

这是一个测试的句子. 这是另一个句子.

Then i just run

./stanford-postagger.sh models / chinese-distsim.tagger sample-input.txt

The expected conclusion is to mark each word as part of speech, but instead, it recognizes the entire line of text as one word:

Loading default properties from tag models / chinese -distsim.tagger
Reading the POS tag model from / chinese -distsim.tagger ... done models [3.5 sec].
這是一個測試的句子. 這是另一個句子. # NR
1 word is noted at 30.30 words per second.

I appreciate any help.

+5

linux nlp stanford-nlp pos-tagger

Ryan rapp Apr 18 '13 at 4:00

source share

1 answer

Ryan rapp · Accepted Answer · 2013-04-18T21:14:31+0000

Finally, I realized that tokenization / segmentation is not included in this tag tag. It seems that the words should be separated by spaces before submitting them to the tagger. For those who are interested in the maximum entropy segmentation of Chinese words, there is a separate package:

http://nlp.stanford.edu/software/segmenter.shtml

Thanks to everyone.

Stanford POS Tagger Does Not Mark Chinese Text

More articles: