Basic regular expressions and string manipulations for DNA analysis using perl

I am new to perl and would like to do what, in my opinion, is the basic DNA sequence string manipulation stored in the rtf file.

Essentially, my file is being read (the file is in FASTA format):

>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

What I would like to do is read in my file and print the header (header is> LM1), then match the next DNA sequence GTGCCAGCAGCCGCand then print the previous DNA sequence.
Therefore, my output will look like this:

>LM1 
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

I wrote the following program:

#!/usr/bin/perl

use strict; use warnings;

open(FASTA, "<seq_V3_V6_130227.rtf") or die "The file could not be found.\n";

while(<FASTA>) {
    chomp($_);
    if ($_ =~  m/^>/ ) {
        my $header = $_;
        print "$header\n";
    }

    my $dna = <FASTA>;
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
        print "$dna";
    }

}
close(FASTA);

The problem is that my program reads the file line by line, and the output I get is the following:

>LM1
GACGGTATCTAACCAGAAAGCCACGGCTAACTAC

, , $dna , , , . : $dna (m//) stacked.pl 14, 1113.

- , .

+5
4

pos function:

use strict;
use warnings;

my $dna = "";
my $seq = "GTGCCAGCAGCCGC";
while (<DATA>) {
  if (/^>/) {
    print;
  } else {
    if (/^[AGCT]/) {
      $dna .= $_;
    }
  }

}

if ($dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "\n";
}

__DATA__
>LM1

AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

:

while (<DATA>) {
  if (/^>/) {
    if ($dna =~ /$seq/g) {
      print substr($dna, 0, pos($dna) - length($seq)), "\n";
      $dna = ""; 
    }   
    print;
  } elsif (/^[AGCT]/) {
    $dna .= $_; 
  }   
}

if ($dna && $dna =~ /$seq/g) {
  print substr($dna, 0, pos($dna) - length($seq)), "\n";
}
+3

while . , $_ ​​ <FASTA>. $dna = <FASTA> , . , , , .

while(<FASTA>) { #Reads a line here
  chomp($_);
  if ($_ =~  m/^>/ ) {
    my $header = $_;
    print "$header\n";
  }
  $dna = <FASTA> # reads another line here - Causes skips over every other line
}

$dna. while else. , , , $dna.

while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {
    # It is a header line, so print it
    my $header = $_;
    print "$header\n";
  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

.

. , fasta 1 . , $dna .

:

my $dna = "";
while(<FASTA>) {
  chomp($_);
  if ($_ =~  m/^>/ ) {

    # Does $dna match the regex?
    if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
      print "$1\n";
    }

    # Reset the sequence
    $dna = "";

    # It is a header line, so print it
    my $header = $_;
    print "$header\n";

  } else {
    # if it is not a header line, add to your dna sequence.
    $dna .= $_;
  }
}

# Check the last sequence
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
  print "$1\n";
}
+2

I came up with a solution using BioSeqIO (and a method truncfrom BioSeq from BioPerl . I also used index to find a subsequence, rather than using a regex.

This solution does not print id, (the string begins with s>) if the subsequence was not found or if the subsequence begins at the first position (and therefore does not have the preceding characters).

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "fasta_junk.fasta" ,
                           -format => 'fasta');

my $out = Bio::SeqIO->new( -file   => '>test.dat',
                           -format => 'fasta');

my $lookup = 'GTGCCAGCAGCCGC';

while ( my $seq = $in->next_seq() ) {
    my $pos = index $seq->seq, $lookup;

    # if $pos != -1, ($lookup not found),
    # or $pos != 0, (found $lookup at first position, thus
    #   no preceding characters).
    if ($pos > 0) {
        my $trunc = $seq->trunc(1,$pos);
        $out->write_seq($trunc);
    }
}

__END__
*** fasta_junk.fasta
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAA
AGTACTGTCCGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTT
GACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGG
TAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGC
GCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAG
GGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCC
ACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAG
GCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCA
AACAGGATTAGATACCCTGGTAGTCCACGCCGT

*** contents of test.dat
>LM1
AAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTCC
GTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAG
CCACGGCTAACTAC
+2
source

read the entire file in memory, then find regexp

while(<FASTA>) {
    chomp($_);
    if ($_ =~  m/^>/ ) {
        my $header = $_;
        print "$header\n";
    } else {
    $dna .= $_;
    }
}
if ($dna =~ /(.*?)GTGCCAGCAGCCGC/) {
    print $1;
}
0
source

All Articles