Processing large amounts of data in Python

Question

Processing large amounts of data in Python

I am trying to process a good piece of data (several GB), but my personal computer is resisting to do this in a reasonable amount of time, so I was wondering what my options are? I used python csv.reader , but it was very slow even to extract 200,000 lines. Then I transferred this data to the sqlite database , which got the results faster and didn't use that much memory, but slowness remained a serious problem.

So again ... what parameters should I process with this data? I was interested in using amazon point instances that seem useful for this kind of purpose, but maybe there are other solutions to explore.

Suppose spot instances are a good option, and given that I have never used them before, I would like to ask, what can I expect from them? Does anyone have experience using them for this kind of thing? If so, what is your workflow? I thought I could find several blog posts describing workflows for scientific computing, image processing, or such things, but I didn’t find anything, so if you can explain this a bit or provide some links, I would appreciate it.

Thanks in advance.

+5

python amazon-ec2 csv machine-learning scientific computing

Robert Smith Sep 22 '12 at 18:45

source share

2 answers

python, dumbo, Hadoop python. . hadoop . : https://github.com/klbostee/dumbo/wiki/Short-tutorial

yelp: https://github.com/Yelp/mrjob

+1

greeness 19 . '12 7:10

bmu · Accepted Answer · 2012-10-10T09:18:46+0000

I would try to use numpyto work with your large datasets locally. Arrays with numbers should use less memory csv.reader, and the computation time should be much faster when using vectorized numpy functions.

However, a memory issue may occur while reading the file. numpy.loadtxtor numpy.genfromtxtalso consume a lot of memory when reading files. If this is a problem, some (new) alternative parsers are compared here . According to this post, the new parameter pandas(a library built on top of numpy) is apparently an option.

, , HDF5, . HDF5 ( , sqlite ). HDF5 - pandas

import pandas as pd

data = pd.read_csv(filename, options...)
store = pd.HDFStore('data.h5')
store['mydata'] = data
store.close()

,

import pandas as pd

store = pd.HDFStore('data.h5')
data = store['mydata']
store.close()

Processing large amounts of data in Python

More articles: