Python map reduces simple wordcount in cyrillic text

Question

Python map reduces simple wordcount in cyrillic text

I am trying to implement a very simple wordcount example with MRJob. Everything works fine with ascii input, but when I mix cyrillic words into input, I get something like this as output

"\u043c\u0438\u0440"    1
"again!"    1
"hello" 2
"world" 1

As I understand it, the first line above is the encoded one-time occurrence of the Cyrillic word "world", which is the correct result in relation to my text input example. Here is the MR code

class MRWordCount(MRJob):

    def mapper(self, key, line):
       line = line.decode('cp1251').strip()
       words = line.split()
       for term in words:
          yield term, 1

    def reducer(self, term, howmany):
        yield term, sum(howmany)

if __name__ == '__main__':
        MRWordCount.run()

I am using Python 2.7 and mrjob 0.4.2 on windows. My questions:

a) how do I manage to correctly output the readable cyrillic output to the cyrillic input? b) what is the reason for this behavior - is it because of the python / MR version or is it expected that it will work differently on any windows - any hints?

python -c "print u'mir '"

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>

+3

python cyrillic mrjob

Anton 22 . '14 14:49

2

jonrsharpe · Answer 1 · 2014-02-22T15:10:54+0000

Python 2.x, , :

>>> print(u"\u043c\u0438\u0440") # note leading u

Unicode, unicode:

>>> print(unicode("\u043c\u0438\u0440", "unicode_escape"))

Max Noel · Answer 2 · 2014-02-22T17:01:18+0000

, , . UTF-8: print u"\u043c\u0438\u0440".encode("utf-8"), Windows (cp1251, ?).

Python map reduces simple wordcount in cyrillic text

More articles: