How to decode unicode one line at a time in Python 2.7?

The correct way to load text in Unicode from Python 2.7 is something like this:

content = open('filename').read().decode('encoding'):
for line in content.splitlines():
    process(line)

( Update : No, it is not. See answers.)

However, if the file is very large, I can read, decode and process it one line at a time so that the entire file is never loaded into memory at once. Sort of:

for line in open('filename'):
    process(line.decode('encoding'))        

A loop iteration forover an open file descriptor is a generator that reads one line at a time.

This does not work, because if the file, for example, is encoded in utf32, then the bytes in the file (in hexadecimal format) look something like this:

hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)

, for, 0a \n, ( ):

lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000

, \n 1, 2 ( - 2.) decode , UnicodeDecodeError.

UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data

, , 0a . (0x0a000000). , , \n, - , .

. ?

+5
4

codecs.open , io.open - ( ). , , (- > ).

io.open Python 2.6 , Py3, open, , , codecs.open, , . codecs.open - , Python 2.5 , io.open .

import io

# Use with statement for guaranteed, predictable cleanup
with io.open('filename', encoding='utf-32') as f:
    for line in f:
        process(line)

, - io.TextIOWrapper, , - else - , :

def process_file(f):
    if 'b' in f.mode:  # Or some better test...
        f = io.TextIOWrapper(f, encoding='utf-32')
    for line in f:
        process(line)
+1

, - :

for line in codecs.open("filename", "rt", "utf32"):
    print line

, .

codecs .

+7

Try using the codec module:

for line in codecs.open(filename, encoding='utf32'):
    do_something(line)
+4
source

Use codecs.open instead of the built-in open:

import codecs
for line in codecs.open('filename', encoding='encoding'):
    print repr(line)

http://docs.python.org/library/codecs.html#codecs.open

Of course, I discovered this a few moments after completing my elaborate stack question.

+1
source

All Articles