How to tag all CJK text in a document?

I have a file file1.txtcontaining text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each area of ​​the text in the file according to the language, with the exception of English, and output a new file, for example, here is an example line:

The 恐龙 ate 鱼.

Since it contains text in Chinese characters, it will look like this:

The \language[cn]{恐龙} ate \language[cn]{鱼}.
  • The document is saved as UTF-8.
  • Chinese text should be marked \language[cn]{*}.
  • Japanese text should be marked \language[ja]{*}.
  • Korean text should be marked \language[ko]{*}.
  • Content never continues from one line to another.
  • If the code ever doubts that something is Chinese, Japanese or Korean, it is best if it is in Chinese by default.

How can I mark the text according to the present language?

+5
source share
3 answers

Rough algorithm:

use 5.014;
use utf8;
while (<DATA>) {
    s
        {(\p{Hangul}+)}
        {\\language[ko]{$1}}g;
    s
        {(\p{Hani}+)}
        {\\language[zh]{$1}}g;
    s
        {(\p{Hiragana}+|\p{Katakana}+)}
        {\\language[ja]{$1}}g;
    say;
}

__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.

(Also see Chinese character detection with perl? )

There are problems with this. Daenyth comments, for example. 恐 竜 is incorrectly identified as Chinese. It is unlikely to me that you really work with mixed English CJK and just show a bad example of text. Perform lexical analysis first to distinguish between Chinese and Japanese.

+6
source

Python. , Unicode Script ( Unicode, UCD). Perl UCD Python.
Python Script, "unicodedata". - https://gist.github.com/2204527 ( ). . BTW, - ( - ).

    # coding=utf8
    import unicodedata2
    text=u"""The恐龙ate鱼.
    The 恐竜ate 魚.
    Theキョウリュウ ate うお.
    The공룡 ate 물고기. """

    langs = {
    'Han':'cn',
    'Katakana':'ja',
    'Hiragana':'ja',
    'Hangul':'ko'
    }

    alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
    # Add Last
    alist.append(("",""))
    newlist = []
    langlist = []
    prevlang = ""
    for raw, lang in alist:
        if prevlang in langs and prevlang != lang:
            newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
            langlist = []

        if lang not in langs:
            newlist.append(raw)
        else:                      
            langlist.append(raw)
        prevlang = lang

    newtext = "".join(newlist)
    print newtext

:

    $ python test.py 
    The\language[cn]{恐龙}ate\language[cn]{鱼}.
    The \language[cn]{恐竜}ate \language[cn]{魚}.
    The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
    The\language[ko]{공룡} ate \language[ko]{물고기}.
+5

[漢字/Kanji], - . , 竜, , . , . , "" . hiragana/katakana + kanji, , . , , .

, , , , : kZVariant char. , kSpecializedSemanticVariant .内 內 , - , ( ).

-, , , script. . , . - .

EDIT:

http://pastebin.com/e276zn6y

:

, Unicode.org... , Unihan - CJK. , 3. , kXXX Unihan, A/I , OP, B/it , OP . . . "" , ( "" + "" ), . , , "" + "" (, "" ), . , "" "" / "" .

QUICK TEST

Some code to be used with a function previously associated with it.

function guessLanguage(x) {
  var results={};
  var s='';
  var i,j=x.length;
  for(i=0;i<j;i++) {
    s=scriptName(x.substr(i,1));
    if(results.hasOwnProperty(s)) {
      results[s]+=1;
    } else {
      results[s]=1;
    }
  }
  console.log(results);
  mostCount=0;
  mostName='';
  for(x in results) {
    if (results.hasOwnProperty(x)) {
      if(results[x]>mostCount) {
        mostCount=results[x];
        mostName=x;
      }
    }
  }
  return mostName;
}

Some tests:

r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ!");
Object
  Common: 2
  Han: 5
  Hiragana: 9
  Katakana: 4
  __proto__: Object
"Hiragana"

The object rcontains the number of occurrences of each script. Hiragana is the most frequent, while Hiragana + Katakana → 2/3 sentences.

r=guessLanguage("我唔知道,佢講乜話.")
Object
  Common: 2
  Han: 8
  __proto__: Object
"Han"

An obvious case of Chinese (Cantonese in this case).

r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
  Common: 11
  Han: 4
  Hangul: 19
  __proto__: Object
"Hangul"

Some khan characters and a lot of hangulas. Korean offer, no doubt.

+3
source

All Articles