I have about 3,500 files consisting of string lines. Files vary in size (from 200 to 1 mb). I am trying to compare each file with each other's file and find a common subsequence of 20 characters in length between two files. Note that the subsequence is used for only two files during each comparison and is not common to all files.
I got a little confused about this problem, and since I'm not an expert, I got a little special solution. I use itertools.combinations to create a list in Python that combines about 6,239,278 combinations. Then I transfer the files one at a time to the Perl script, which acts as a wrapper for a suffix tree library written in C called libstree.I tried to avoid this type of solution, but the only comparable suffix tree wrapper in Python suffers from memory leak .
So here is my problem. I timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. Thus, it will take about 3 days of continuous processing to complete the task. And then I have to do all this to look at 25 characters instead of 20. Please note that I went out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I would ask him here and issue my code to find out if anyone has a suggestion on how to complete this task faster.
Python Code:
from itertools import combinations
import glob, subprocess
glist = glob.glob("Data/*.g")
i = 0
for a,b in combinations(glist, 2):
i += 1
p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
p = p.stdout.read()
a = a.split("/")
b = b.split("/")
a = a[1].split(".")
b = b[1].split(".")
print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
if p != "" and len(p) == 20:
with open("tmp.list", "a") as openf:
openf.write(a[0] + " " + b[0] + "\n")
Perl Code:
use strict;
use Tree::Suffix;
open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };
open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };
my @g = ($a,$b);
my $st = Tree::Suffix->new(@g);
my ($c) = $st->lcs($ARGV[2],-1);
print "$c";