Optimization: Python, Perl, and C suffix C library

I have about 3,500 files consisting of string lines. Files vary in size (from 200 to 1 mb). I am trying to compare each file with each other's file and find a common subsequence of 20 characters in length between two files. Note that the subsequence is used for only two files during each comparison and is not common to all files.

I got a little confused about this problem, and since I'm not an expert, I got a little special solution. I use itertools.combinations to create a list in Python that combines about 6,239,278 combinations. Then I transfer the files one at a time to the Perl script, which acts as a wrapper for a suffix tree library written in C called libstree.I tried to avoid this type of solution, but the only comparable suffix tree wrapper in Python suffers from memory leak .

So here is my problem. I timed it, and on my machine, the solution processes about 500 comparisons in 25 seconds. Thus, it will take about 3 days of continuous processing to complete the task. And then I have to do all this to look at 25 characters instead of 20. Please note that I went out of my comfort zone and not a very good programmer, so I'm sure there is a much more elegant way to do this. I thought I would ask him here and issue my code to find out if anyone has a suggestion on how to complete this task faster.

Python Code:

from itertools import combinations
import glob, subprocess

glist = glob.glob("Data/*.g")
i = 0

for a,b in combinations(glist, 2):
    i += 1
    p = subprocess.Popen(["perl", "suffix_tree.pl", a, b, "20"], shell=False, stdout=subprocess.PIPE)
    p = p.stdout.read()
    a = a.split("/")
    b = b.split("/")
    a = a[1].split(".")
    b = b[1].split(".")
    print str(i) + ":" + str(a[0]) + " --- " + str(b[0])
    if p != "" and len(p) == 20:
        with open("tmp.list", "a") as openf:
            openf.write(a[0] + " " + b[0] + "\n")

Perl Code:

use strict;
use Tree::Suffix;

open FILE, "<$ARGV[0]";
my $a = do { local $/; <FILE> };

open FILE, "<$ARGV[1]";
my $b = do { local $/; <FILE> };

my @g = ($a,$b);

my $st  = Tree::Suffix->new(@g);
my ($c) = $st->lcs($ARGV[2],-1);

print "$c";
+5
source share
1 answer

Instead of writing Python to call Perl to call C, I’m sure that you better give up Python code and write all this in Perl.

, ,

my @g = <>;

, , Python Perl, , libstree.

, , . , , - / .

use strict;
use warnings;

use Math::Combinatorics;
use Tree::Suffix;

my @glist = glob "Data/*.g";
my $iterator = Math::Combinatorics->new(count => 2, data => \@glist);

open my $fh, '>', 'tmp.list' or die $!;

my $n = 0;
while (my @pair = $iterator->next_combination) {
  $n++;
  @ARGV = @pair;
  my @g = <>;
  my $tree  = Tree::Suffix->new(@g);
  my $lcs = $tree->lcs;
  @pair = map m|/(.+?)\.|, @pair;
  print "$n: $pair[0] --- $pair[1]\n";
  print $fh, "@pair\n" if $lcs and length $lcs >= 20;
}
+4

All Articles