I have several CSV files, for example:
site,run,id,payload,dir
1,1,1,528,1
1,1,1,540,2
1,1,3,532,1
(In the actual case I'm working with, there are three files with a total of 1,408,378 lines.) To plot the graph, I want to shuffle them into this format:
label,stream,dir,i,payload
A,1,1,1,586
A,1,1,2,586
A,1,1,3,586
where the "label" is derived from the CSV file name; “stream” is the serial number assigned to each combination “site”, “start” and “identifier” in one file (therefore it is unique only inside the “label”); "i" - line number in each "stream"; and 'dir' and 'payload' are taken directly from the source file. I also want to drop everything except the first 20 lines of each thread. I know in advance that every cell in the CSV file (except the header) is a positive integer, and that 'dir' only ever takes values 1 and 2.
plyr, R 6 . foreach parallelism plyr : 10 , , , , .
, script Python, :
import sys
def processOne(fname):
clusters = {}
nextCluster = 1
with open(fname + ".csv", "r") as f:
for line in f:
line = line.strip()
if line == "site,run,id,payload,dir": continue
(site, run, id, payload, dir) = line.split(',')
clind = ",".join((site,run,id))
clust = clusters.setdefault(clind,
{ "i":nextCluster, "1":0, "2":0 })
if clust["i"] == nextCluster:
nextCluster += 1
clust[dir] += 1
if clust[dir] > 20: continue
sys.stdout.write("{label},{i},{dir},{j},{payload}\n"
.format(label=fname,
i=clust["i"],
dir=dir,
j=clust[dir],
payload=payload))
sys.stdout.write("label,stream,dir,i,payload\n")
for fn in sys.argv[1:]: processOne(fn)
R script:
all <- read.csv(pipe("python preprocess.py A B C", open="r"))
.
, : R? , . , , - , . , R ggplot2, , , , matplotlib.