Perl 6 - Regular Expressions

Reading: Deitel 8

advanced: "man perlre"


The match operator m//

One of the main reasons for using perl is for it's powerful text manipulation capabilities. A lot of bioinformatics is about data munging - essentially taking data input often as text, doing transformations on the data and outputting it.

One of the most powerful ways of searching, manipulating and transforming text data is through regular expressions. Perl has regexps (as they are often called) built into the language.

There is a lot to regular expressions. The perl regexp engine is a powerful toolkit of which you may need to utilise a few tools.

We saw our first regular expression in the previous lecture, when we were trying to check our sequence IDs for the pattern

  Sequence ID.3prime

using the following code
program 1
 if ($seq_id =~ m/3prime$/) {...}
download

Another common way to use regular expressions is for finding patterns in the sequences themselves:


program 2
  while ( <> ) {
     chomp;
     if ($_ =~ m/GAATTC/){
        print "Found an EcoRI site!\n";
        $sites++;
     }
   }
   print "I counted $sites EcoRI sites total.\n"
download

This above bit of code loops through user input. Finds all lines containing an EcoRI site, and bumps up a counter.

The m at the beginning of the match operator is optional; perl treats anything between two slashes as a match operator. Perl will also operate on the special $_ variable by default, so the above code could be rewritten as:


program 3
   while ( <> ) {
     chomp;
     if (/GAATTC/){
        print "Found an EcoRI site!\n";
        $sites++;
     }
   }
   print "I counted $sites EcoRI sites total.\n"
download

In fact, any nonalphanumeric, nonwhitespace character can serve as a delimiter; we could have written any of these:

m/GAATTC/
m@GAATTC@
m[GAATTC]
m#GAATTC#

Here is a more advanced regexp. This counts instances of the methylation site (Pu-C-X-G):


program 4
   while (<>) {
     chomp;
     if (/[GA]C.?G/) {  
        print "Found a methylation site!\n";
        $sites++;
     }
   }
   print "$sites methylation sites total.\n"
download

The . matches any character

The ? says "match zero or one occurences"


Subpatterns


program 5
# primer.pl

$primer1 = "AATTGG"; $primer2 = "CCCTTT";

$seq = "ATATATATAAATTGGAAAAAAAAAAAAAAAAAAAACCCTTTATATATATA"; if ($seq =~ /$primer1(.*)$primer2/) { print "Sequence between primers: $1\n"; } else { print "Couldn't find primer sequences in dna sequence"; }

download

If we run this:

Sequence between primers: AAAAAAAAAAAAAAAAAAAA

The . matches any character

The * says "match any number of occurences"

Patterns between parentheses go into special variables $1, $2, $3, ...

If we want to extract information from fasta headers
>gi|159608|gb|M59776.1|MSQPUPT Anopheles farauti DNA, clone 1/1


program 6
# get_ids_from_fasta.pl

while ( <> ) { if ( /^\>gi\|(\d+)\|gb\|(\S+)\|(\S+)/ ) { $gi = $1; $accession = $2; $name = $3; print "GI=$gi ACCESSION=$accession NAME=$name"; } }

download


Extracting upstream/downstream sequence

Extract 50 base pairs upstream and 25 base pairs downstream of the TATTAT consensus transcription start site:


program 7
# extract.pl
while ( <> ) {
    chomp;
    next unless /(.{50})TATTAT(.{25})/;
    my $upstream = $1;
    my $downstream = $2;
    print "Upstream: $upstream\n";
    print "Downstream: $downstream\n";
}
download

% cat > test.seq
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTATTATGGG

% extract.pl test.seq Upstream: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC Downstream: GGGGGGGGGGGGGGGGGGGGGGGGG

The . matches any character

The {50} says "match exactly 50 occurences"

So .{50} matches 50 occurences of any character

The characters between the first set of parentheses goes into the special variable $1; The characters inside the second goes into the special variable $2


Extracting Subpatterns Using Arrays

If you assign a regular expression match to an array, it will return a list of all the subpatterns that matched. Alternative implementation of previous example:


program 8
while ( <> ) {
    chomp;
    my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
}
download

If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:


program 9
while ( <> ) {
    chomp;
    next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
    print "upstream = $upstream\n";
    print "downstream = $downstream\n";  
}
download


Substititution operator s///

As well as searching text, regexps can be used to make substitutions in text.


program 10
# translate_codon1.pl
$seq = "ATG TCA TTT TCA";

$seq =~ s/ATG/M/; # substitutes occurences of ATG with M $seq =~ s/TCA/S/; # substitutes occurences of TCA with S $seq =~ s/TTT/F/; # substitutes occurences of TTT with F

print $seq; # prints M S F TCA

download

If you run this, this is the output you will get:
M S F TCA

Notice the last TCA codon is not subsitituted to the corresponding S amino acid. To do this we need to specify the global modifier


program 11
# translate_codon2.pl
$seq = "ATG TCA TTT TCA";

$seq =~ s/ATG/M/g; # substitutes ALL occurences of ATG with M $seq =~ s/TCA/S/g; # substitutes ALL occurences of TCA with S $seq =~ s/TTT/F/g; # substitutes ALL occurences of TTT with F

print $seq; # prints M S F S

download

To extend this to do full translation, we would have to specify 64 substitution operators. Alternately we could do this by pattern matching:


program 12
# translate_codon3.pl
$seq = "ATG TCA TTT TCA";

$seq =~ s/ATG/M/g; $seq =~ s/TC./S/g; $seq =~ s/TT[TC]/F/g; $seq =~ s/TT[AG]/L/g;

print $seq; # prints M S F S

download


Example - extracting sequence from genbank records

Just say we wanted to transform genbank-style sequence output

ORIGIN
        1 gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag
       61 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa actagttaga
      121 ttatgtgacg tgccaagggg tgtagatgta gaaattgaaa caactggtct aaccgattca
      181 gaaggagaaa gtgaatcaaa agaagaagag tgatgatgaa tagccaccat tactgcatac
      241 tgtagccctt acccttgtcg caccattagc cattaataaa aataaaaaat tatataaaaa
      301 ttacacccat

into a string containing only a, t, g, c

One way to do it would be first of all get rid of all the numbers, spaces and newlines:


program 13
# gb1.pl
while (<>) {
    s/ //g;
    s/\d//g;
    s/\n//g;
    s/ORIGIN//g;
    print
}
download

Let's go through this line by line:

    s/ //g;
the first operator substitutes spaces with an empty string; the g at the end means substitute ALL instances in this line.

the value of $_ after this is:

ORIGIN
1gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag
61taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga
121ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca
181gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac
241tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa
301ttacacccat

    s/\d//g;
the next operator introduces \d which matches any number.

the value of $_ after this is:

ORIGIN
gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag
taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga
ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca
gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac
tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa
ttacacccat

    s/\n//g;
the third operator gets rid of all newlines

the value of $_ after this is:

ORIGINgtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgata

    s/ORIGIN//g;
the final operator gets rid of the ORIGIN string

the value of $_ after this is:

gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgatagtatttta

this won't work on an entire genbank record, but if we put the above piece of text through, we end up only tcags

the above code could also be written as

program 14
# gb2.pl
while (<>) {
    s/ORIGIN| |\d|\n//g;
    print;
}
download

the pipe symbol | means OR - as in match this OR match this OR match this


Substitution Operator tr///

we have already encountered this operator briefly, for reverse complementing sequences (doing the complementation part):


program 15
        $rev = reverse $seq;
        $rev =~ tr/GATC/CTAG/;
download

Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list; for instance, G with C; A with T; T with A; C with G

we could have written our genbank to sequence script above as


program 16
while ( <> ) {
    tr/atcg//cd;
}
download

the c means complement the searchlist

the d means delete found but unreplaced characters


Advanced - Genbank records to fasta


program 17
#!/usr/local/bin/perl
# genbank2fasta.pl

$/ = "//"; # break up records on genbank // delimiter while (<>) { next unless /LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/; $locus = $1; $bp = $2; $type = $3; /ACCESSION[ ]*(\S+)/; $accession = $1; /ORIGIN(.*)$/s; $seq = $1; $seq =~ tr/tcag//cd; $seq =~ s/(.{60}.)/$1\n/g; print ">gb|$accession $locus $type\n$seq\n"; }

download

A full discussion of this program is outside the scope of this lecture; here is a brief outline

    /LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/;

matches genbank LOCUS line like this:
LOCUS       DDU63596      310 bp    DNA             INV       14-MAY-1999

The locus name matches the first subpattern $1, the bp count matches the second subpattern $2, and the type matches the third $3

    /ORIGIN(.*)$/s;

the subpattern here matches everything from the end of the genbank ORIGIN keyword right to the end of the record. normally the regular expression would only match as far as the first newline \n; the s qualifier forces the regexp engine to treat this as a single line.

Thankfully you rarely need to write programs such as the above. If you ever need to parse genbank (or other common bioinformatic file format) records, the BioPerl project has modules for this very purpose.


Exercises

Extend the previous restriction enzyme program to find enzymes like

'EcoNI' => 'CCTNNNNNAGG', 'EcoRII' => 'CCWGG',

Using the re_hash.pl program from lecture 3, write a program that asks a user for a dna sequence, and searches that sequence for each of the restriction enzymes. see here for answer.

From the sequence file ~cmungall/data/... search for sequences of (TA)n repeats, where n >= 5 and "mask" them by replacing them with N's. From the same sequence file, split each sequence into three-letter "codons" separated by spaces:

   % unwrap example1.fasta | codons
   M43911 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA...
   L54931 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC...
   L54932 TAG TTG ATT CAG TCC GAT TTC AAT TGA TTT CCC GTA TAT CCT...

Pipe the output of this program to a codon translation program named "ribosome":

% unwrap example1.fasta | codons | ribosome
M43911 PGGULLMLGGGYL....

Use the following code to help you


Appendix


program 18
# codontable.pl
%CODON_TABLE = (
   TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S',
   TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L',
   TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*',
   TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W',
   CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L',
   CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P',
   CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q',
   CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R',
   ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M',
   ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T',
   AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K',
   AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R',
   GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V',
   GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A',
   GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E',
   GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G');
download


Chris Mungall cjm@fruitfly.org
Berkeley Drosophila Genome Project