Apache Tika can extract Unicode text from supported file formats. While the file format can store Unicode text (for example, Chinese or Japanese characters), Apache Tika can extract it
Tika , , . . Tika app , , :
$ java -jar tika-app-1.4.jar --text testMSG_chinese.msg | head
Alfresco MSG format testing ( MSG 格式測試 )
From
Tests Chang@FT (張毓倫)
To
Tests Chang@FT (張毓倫)
Recipients
tests.chang@fengttt.com
:
$ java -jar tika-app-1.4.jar --text testRTFJapanese.rtf | head -2
ゾルゲの処刑記録、
ゾルゲと尾崎、淡々と最期
, (, utf8), , , !