How to quickly combine the fields of two sorted files, but one of them is a subset of the other

I have two sorted files and you want to combine them to make a third, but I need output for sorting. One column in the second file is a subset of the first and any place where the second file does not match the first should be filled with NA. Files are large (~ 20,000,000) records, so loading things into memory is tough, and speed is a problem.

File 1 is as follows:

1 a
2 b
3 c
4 d
5 e

File 2 is as follows:

1 aa
2 bb
4 dd
5 ee

And the output should look like this

1 a aa
2 b bb
3 c NA
4 d cc
5 e ee
+3
source share
4 answers

join is your friend here.

join -a 1 file1 file2

. , file1, .. NA.

. , NA s:

join -a 1 -e NA -o 1.1 1.2 2.2 file1 file2
+6

:

  • # 1 # 2
  • # 2, # 1.
  • , , .

, №2 # 1, . # 2, # 1, , , .

- :

Read first line from file #2
While read line from file #1
    if line from file #2 > line from file #1
        write line from file #1 and "NA"
    else
        write line from file #1 and file #2
        Read another line from file #2
    fi
done

- (, , # 1 , # 2? , # 1 .) - ( №2 , №1?)

, . . , , .

, , , , .


#! /usr/bin/env perl

use warnings;
use strict;
use feature qw(say);

use constant {
    TEXT1 =>        "foo1.txt",
    TEXT2 =>        "foo2.txt",
};


open (FILE1, "<", TEXT1) or die qq(Can't open file ) . TEXT1 . qq(for reading\n);
open (FILE2, "<", TEXT2) or die qq(Can't open file ) . TEXT2 . qq(for reading\n);

my $line2 = <FILE2>;
chomp $line2;
my ($lineNum2, $value2) = split(/\s+/, $line2, 2);
while (my $line1 = <FILE1>) {
    chomp $line1;
    my ($lineNum1, $value1) = split(/\s+/, $line1, 2);
    if (not defined $line2) {
        say "$lineNum1 - $value1 - NA";
    }
    elsif  ($lineNum1 lt $lineNum2) {               #Use "<" if numeric match and not string match
        say "$lineNum1 - $value1 - NA";
    }
    elsif ($lineNum1 eq $lineNum2) {
        say "$lineNum1 - $value1 - $value2";
        $line2 = <FILE2>;
        if (defined $line2) {
            chomp $line2;
            ($lineNum2, $value2) = split(/\s+/, $line2, 2);
        }
    }
    else {
        die qq(Something went wrong: Line 1 = "$line1" Line 2 = "$line2"\n);
    }
}

, .

+1

:

sort file.1 > file.1.sorted
sort file.2 > file.2.sorted
join -e NA file.1.sorted file.2.sorted > file.joined
0

Python:    "" "" "

def merge_files(file1, file2, merge_file):
    with (open(file1) as file1,
          open(file2) as file2,
          open(merge_file, 'w')) as merge:
        for line2 in file2:
            index2, value2 = line2.split(' ', 1)
            for line1 in file1:
                index1, value1 = line1.split(' ', 1)
                if index1 != index2:
                    merge.write(line1)
                    continue
                merge.write("%s %s %s" % (index1, value1[:-1], value2))
                break
        for line1 in file1:  # grab any remaining lines in file1
            merge.write(line1)

if __name__ == '__main__':
    merge_files('test1.txt','test2.txt','test3.txt')
0

All Articles