How to tag all CJK text in a document?

Question

How to tag all CJK text in a document?

I have a file file1.txtcontaining text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each area of the text in the file according to the language, with the exception of English, and output a new file, for example, here is an example line:

The 恐龙 ate 鱼.

Since it contains text in Chinese characters, it will look like this:

The \language[cn]{恐龙} ate \language[cn]{鱼}.

The document is saved as UTF-8.
Chinese text should be marked \language[cn]{*}.
Japanese text should be marked \language[ja]{*}.
Korean text should be marked \language[ko]{*}.
Content never continues from one line to another.
If the code ever doubts that something is Chinese, Japanese or Korean, it is best if it is in Chinese by default.

How can I mark the text according to the present language?

+5

unicode character-properties cjk multilingual

Village May 7, '12 at 13:23

source share

3 answers

daxim · Answer 1 · 2012-05-07T14:08:07+0000

Rough algorithm:

use 5.014;
use utf8;
while (<DATA>) {
    s
        {(\p{Hangul}+)}
        {\\language[ko]{$1}}g;
    s
        {(\p{Hani}+)}
        {\\language[zh]{$1}}g;
    s
        {(\p{Hiragana}+|\p{Katakana}+)}
        {\\language[ja]{$1}}g;
    say;
}

__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.

(Also see Chinese character detection with perl? )

There are problems with this. Daenyth comments, for example. 恐竜 is incorrectly identified as Chinese. It is unlikely to me that you really work with mixed English CJK and just show a bad example of text. Perform lexical analysis first to distinguish between Chinese and Japanese.

wuliang · Answer 2 · 2012-05-07T20:55:39+0000

Python. , Unicode Script ( Unicode, UCD). Perl UCD Python.
Python Script, "unicodedata". - https://gist.github.com/2204527 ( ). . BTW, - ( - ).

    # coding=utf8
    import unicodedata2
    text=u"""The恐龙ate鱼.
    The 恐竜ate 魚.
    Theキョウリュウ ate うお.
    The공룡 ate 물고기. """

    langs = {
    'Han':'cn',
    'Katakana':'ja',
    'Hiragana':'ja',
    'Hangul':'ko'
    }

    alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
    # Add Last
    alist.append(("",""))
    newlist = []
    langlist = []
    prevlang = ""
    for raw, lang in alist:
        if prevlang in langs and prevlang != lang:
            newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
            langlist = []

        if lang not in langs:
            newlist.append(raw)
        else:                      
            langlist.append(raw)
        prevlang = lang

    newtext = "".join(newlist)
    print newtext

:

    $ python test.py 
    The\language[cn]{恐龙}ate\language[cn]{鱼}.
    The \language[cn]{恐竜}ate \language[cn]{魚}.
    The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
    The\language[ko]{공룡} ate \language[ko]{물고기}.

dda · Answer 3 · 2012-05-19T17:04:59+0000

[漢字/Kanji], - . , 竜, , . , . , "" . hiragana/katakana + kanji, , . , , .

, , , , : kZVariant char. , kSpecializedSemanticVariant .内內 , - , ( ).

-, , , script. . , . - .

EDIT:

http://pastebin.com/e276zn6y

:

, Unicode.org... , Unihan - CJK. , 3. , kXXX Unihan, A/I , OP, B/it , OP . . . "" , ( "" + "" ), . , , "" + "" (, "" ), . , "" "" / "" .

QUICK TEST

Some code to be used with a function previously associated with it.

function guessLanguage(x) {
  var results={};
  var s='';
  var i,j=x.length;
  for(i=0;i<j;i++) {
    s=scriptName(x.substr(i,1));
    if(results.hasOwnProperty(s)) {
      results[s]+=1;
    } else {
      results[s]=1;
    }
  }
  console.log(results);
  mostCount=0;
  mostName='';
  for(x in results) {
    if (results.hasOwnProperty(x)) {
      if(results[x]>mostCount) {
        mostCount=results[x];
        mostName=x;
      }
    }
  }
  return mostName;
}

Some tests:

r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ！");
Object
  Common: 2
  Han: 5
  Hiragana: 9
  Katakana: 4
  __proto__: Object
"Hiragana"

The object rcontains the number of occurrences of each script. Hiragana is the most frequent, while Hiragana + Katakana → 2/3 sentences.

r=guessLanguage("我唔知道,佢講乜話.")
Object
  Common: 2
  Han: 8
  __proto__: Object
"Han"

An obvious case of Chinese (Cantonese in this case).

r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
  Common: 11
  Han: 4
  Hangul: 19
  __proto__: Object
"Hangul"

Some khan characters and a lot of hangulas. Korean offer, no doubt.

How to tag all CJK text in a document?

More articles: