What is the best way to handle a very large (over 30 GB) text file, as well as showing progress

[newbie question]

Hi,

I am working on a huge text file larger than 30 GB.

I need to do some processing on each line and then write it in db in JSON format. When I read the file and the loop using the "for", the computer crashes and displays a blue screen after about 10% of the data processing.

I am currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also, how can I show overall progress in how much data has crunched so far?

Thank you very much.

+3
source share
3 answers

File handlers are iterable, and you should probably use a context manager. Try the following:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That may be enough.

+4

​​ . iterable.

for one_line in f.readlines():

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

, , . ​​

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val
+1

Using readline allows you to find the end of each line in your file. If some lines are very long, this may cause your interpreter to crash (not enough memory to fill the full line).

To show progress, you can check the file size, for example:

import os
f = open(file_path, 'r')
fsize = os.fstat(f).st_size

As a result of your task, there may be the number of bytes processed divided by the file size 100 times to have a percentage.

0
source

All Articles