Python - get column iterator from file (without reading the whole file)

My main goal is to calculate the median (by column) from the HUGE float matrix. Example:

a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))

numpy.median(a, axis=0)

Out[38]: array([ 1.,  5.,  8.,  2.,  3.])

The matrix is ​​too large to fit in Python memory (~ 5 terabytes), so I store it in a CSV file. So I want to run each column and calculate the median.

Is there any way to get a column iterator without reading the whole file?

Any other ideas on calculating the median for the matrix would be good too. Thank!

+5
source share
4 answers

If you can put each column in memory (which you seem to imply you can), then this should work:

import itertools
import csv

def columns(file_name):
   with open(file_name) as file:
       data = csv.reader(file)
       columns = len(next(data))
   for column in range(columns):
       with open(file_name) as file:
           data = csv.reader(file)
           yield [row[column] for row in data]

, , , , . , . . , , , .

+3

, N , . . , .

. .

+1

, , , csv ( ). , , "", . CSV . , :

>>> import csv
>>> with open('foo.csv', 'wb') as f:
...     writer = csv.writer(f)
...     for i in range(0, 100, 10):
...         writer.writerow(range(i, i + 10))
... 
>>> with open('foo.csv', 'r') as f:
...     f.read()
... 
'0,1,2,3,4,5,6,7,8,9\r\n10,11,12,13,14,15,16,17,18,19\r\n20..(output truncated)..

, ; 2, , . , . , csv , , . ( , - , , .)

, , , , . , , . PyTables, -, numpy.

+1

bucketsort , . .

UNIX awk sort , .

0

All Articles