The fastest way to split a large file based on text inside a file in linux

I have a large file that contains data for 10 years. I want to split it into files, each of which contains 1 year of data.

The data in the file is in the following format:

GBPUSD, 20100201,000200,1.5969,1.5969,1.5967,1.5967,4 GBPUSD, 20100201,000300,1.5967,1.5967,1.5960,1.5962,4

Symbols 8-11 contain the year. I would like to use this as a file name with .txt at the end. So 2011.txt, 2012.txt, etc.

The file contains about 4 million lines.

I am using Ubuntu Linux

+5
source share
3 answers

Here is one way awk:

awk '{ print > substr($0,8,4) ".txt" }' file

If the length of the first field may vary, you may prefer:

awk -F, '{ print > substr($2,0,4) ".txt" }' file
+6
source

, :

YEARS=`cat FILE | sed -e 's/^.......//' -e 's/\(....\).*$/\1/' | sort | uniq` ; for Y in $YEARS ; do echo Processing $Y... ; egrep '^.......'$Y FILE > $Y.txt ; done

0

, . @steve AWK .

, grep : ^.......2010 2010 . script grep, - :

for year in 2010 2011 2012; do
    grep "^.......$year" datafile > $year.txt
done

, .

Python AWK.

import sys

def next_line():
    if len(sys.argv) == 1:
        for line in sys.stdin:
            yield line
    else:
        for name in sys.argv[1:]:
            with open(name) as f:
                for line in f:
                    yield line


_open_files = {}
def output(fname, line):
    if fname not in _open_files:
        _open_files[fname] = open(fname, "w")
    _open_files[fname].write(line)


for line in next_line():
    year = line[7:11]
    fname = year + ".txt"
    output(fname, line)

AWK, , . next_line() , , ; AWK . output(), , AWK .

, AWK, , Python . ( Python... , , .)

0

All Articles