The correct way to load text in Unicode from Python 2.7 is something like this:
content = open('filename').read().decode('encoding'):
for line in content.splitlines():
process(line)
( Update : No, it is not. See answers.)
However, if the file is very large, I can read, decode and process it one line at a time so that the entire file is never loaded into memory at once. Sort of:
for line in open('filename'):
process(line.decode('encoding'))
A loop iteration forover an open file descriptor is a generator that reads one line at a time.
This does not work, because if the file, for example, is encoded in utf32, then the bytes in the file (in hexadecimal format) look something like this:
hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)
, for, 0a \n, ( ):
lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000
, \n 1, 2 ( - 2.) decode , UnicodeDecodeError.
UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data
, , 0a . (0x0a000000). , , \n, - , .
. ?