Unicode endian puzzled me

Question

Unicode endian puzzled me

I am editing three files that have the same "你" content ( youin English) in it in three different forms: gbk \ utf-8 \ ucs-2 with gedit named "ok1, ok2, ok3".

>>> f1 = open('ok1', 'rb').read()
>>> f2 = open('ok2', 'rb').read()
>>> f3 = open('ok3', 'rb').read()
>>> f1
'\xc4\xe3\n'
>>> f2
'\xe4\xbd\xa0\n'
>>> f3
'`O\n\x00'
>>> hex(ord("`"))
'0x60'
>>> hex(ord("O")) 
'0x4f'

actually f3 is '\ x60 \ x4f', but the following result confused me

>>> '\xe4\xbd\xa0'.decode("utf-8")
u'\u4f60'
>>> '\xc4\xe3'.decode("gbk")
u'\u4f60'
>>>

why is there only endian problem in ucs-2 (or they say unicode), and not in utf-8, and not in gbk?

+5

python encoding endianness utf-8 ucs2

Dd pp Sep 08 '12 at 6:59

source share

2 answers

Endian-ness , UTF-8 8 ( 8 ). .

, . A , , 0x41. , , , . .

GBK ; 1 , UTF-8, .

UCS-2 ( , UTF-16), , 2- . 16 , 16 . 2 , , , . , endianess, 2 . little-endianess, , . 0x4F 0x60 .

, python , endian UTF-16 ; endianess, ( ):

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'
>>> '`O\n\x00'.decode('utf-16-le')
u'\u4f60\n'
>>> 'O`\x00\n'.decode('utf-16-be')
u'\u4f60\n'

big-endian.

+3

Martijn Pieters 08 . '12 7:21

Tugrul Ates · Accepted Answer · 2012-09-08T07:17:04+0000

UTF-8 and GBK store data in a sequence of bytes. It is highly defined what byte value comes after that in these encodings. This byte order is not changed using the architecture used in encoding, transmission, or decoding.

, UCS-2 UTF-16 2-. . , , UCS-2.

Unicode U + 4F60 UCS-2 2- 0x4F60. , ('0x60', '0x4F') . , .

Python , 2- :

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'

Unicode endian puzzled me

More articles: