What character string should be sent to the source to eliminate the byte encoding that they use?

Question

What character string should be sent to the source to eliminate the byte encoding that they use?

I decrypted bytestreams to Unicode characters without knowing the encoding used by each of the hundreds of senders.

Many of the senders are not technically astute and will not be able to tell which encoding they use. It will be determined by the case of the tool chains used to generate the data.

Currently, senders in all British / English languages use different operating systems.

Can I ask all senders to send me a specific character string that will unambiguously demonstrate which encoding each sender uses?

I understand that there are libraries that use heuristics to guess the encoding - I will also chase this as a backup at runtime, but first I would like to try to determine which encodings are used if I can.

(I do not think this is relevant, but I work in Python)

+5

encoding unicode decoding

Jonathan hartley Aug 21 '12 at 17:24

source share

1 answer

Jim DeLaHunt · Accepted Answer · 2012-11-18T06:18:32+0000

The full answer to this question depends on many factors, such as the range of encodings used by different ascending systems, and how much your users will follow the instructions for entering a sequence of character characters in text fields and how skilled they will be in obscure key combinations to enter sequences of magic characters .

, . "" (), , UTF-8, UTF-16, iso8859_5 koi8_r. , , , , .

, ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE UTF-16BE. , '€', U + 20AC , :

byte ['\ xa4'] iso-8859-15
bytes ['\ xe2', '\ x82', '\ xac'] utf-8
bytes ['\ x00', '\ xac'] utf-16be
bytes ['\ xac', '\ x00'] utf-16le
byte ['\ x80'] cp1252 ( "Windows ANSI" )
byte ['\ xdb']
iso-8859-1 . iso-8859-15 - iso-8859-1.
U.S. , , , . (, , 3% .)

, , , , , , . , "\ xa4" iso-8859-15 Euro iso-8859-1 cp1252 UTF-16le "¤", "§" UTF-16, U + A4xx Yi Syll, U + 01A4 LATIN SMALL LETTER OI. UTF-8. Yi, .

Python 3.x, 7.2.3. , Python. , :

>>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
... 'cp1252', 'macroman']:
...     print e, list( euro.encode(e, 'backslashreplace'))

, , , , "€" , - . . , '€ €'; , .

What character string should be sent to the source to eliminate the byte encoding that they use?

More articles: