Unicode endian puzzled me

I am editing three files that have the same "你" content ( youin English) in it in three different forms: gbk \ utf-8 \ ucs-2 with gedit named "ok1, ok2, ok3".

>>> f1 = open('ok1', 'rb').read()
>>> f2 = open('ok2', 'rb').read()
>>> f3 = open('ok3', 'rb').read()
>>> f1
'\xc4\xe3\n'
>>> f2
'\xe4\xbd\xa0\n'
>>> f3
'`O\n\x00'
>>> hex(ord("`"))
'0x60'
>>> hex(ord("O")) 
'0x4f'

actually f3 is '\ x60 \ x4f', but the following result confused me

>>> '\xe4\xbd\xa0'.decode("utf-8")
u'\u4f60'
>>> '\xc4\xe3'.decode("gbk")
u'\u4f60'
>>> 

why is there only endian problem in ucs-2 (or they say unicode), and not in utf-8, and not in gbk?

+5
source share
2 answers

UTF-8 and GBK store data in a sequence of bytes. It is highly defined what byte value comes after that in these encodings. This byte order is not changed using the architecture used in encoding, transmission, or decoding.

, UCS-2 UTF-16 2-. . , , UCS-2.

Unicode U + 4F60 UCS-2 2- 0x4F60. , ('0x60', '0x4F') . , .

Python , 2- :

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'
+5

Endian-ness , UTF-8 8 ( 8 ). .

, . A , , 0x41. , , , . .

GBK ; 1 , UTF-8, .

UCS-2 ( , UTF-16), , 2- . 16 , 16 . 2 , , , . , endianess, 2 . little-endianess, , . 0x4F 0x60 .

, python , endian UTF-16 ; endianess, ( ):

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'
>>> '`O\n\x00'.decode('utf-16-le')
u'\u4f60\n'
>>> 'O`\x00\n'.decode('utf-16-be')
u'\u4f60\n'

big-endian.

+3

All Articles