Bioinfo Helpdesk & On-line Training

Perl 6 - Regular Expressions

Reading: Deitel 8

advanced: "man perlre"

The match operator m//

One of the main reasons for using perl is for it's powerful text manipulation capabilities. A lot of bioinformatics is about data munging - essentially taking data input often as text, doing transformations on the data and outputting it.

One of the most powerful ways of searching, manipulating and transforming text data is through regular expressions. Perl has regexps (as they are often called) built into the language.

There is a lot to regular expressions. The perl regexp engine is a powerful toolkit of which you may need to utilise a few tools.

We saw our first regular expression in the previous lecture, when we were trying to check our sequence IDs for the pattern

Sequence ID.3prime

using the following code
program 1

if ($seq_id =~ m/3prime$/) {...}

download

Another common way to use regular expressions is for finding patterns in the sequences themselves:

program 2

while ( <> ) { chomp; if ($_ =~ m/GAATTC/){ print "Found an EcoRI site!\n"; $sites++; } } print "I counted $sites EcoRI sites total.\n"

download

This above bit of code loops through user input. Finds all lines containing an EcoRI site, and bumps up a counter.

The m at the beginning of the match operator is optional; perl treats anything between two slashes as a match operator. Perl will also operate on the special $_ variable by default, so the above code could be rewritten as:

program 3

while ( <> ) { chomp; if (/GAATTC/){ print "Found an EcoRI site!\n"; $sites++; } } print "I counted $sites EcoRI sites total.\n"

download

In fact, any nonalphanumeric, nonwhitespace character can serve as a delimiter; we could have written any of these:

m/GAATTC/ m@GAATTC@ m[GAATTC] m#GAATTC#

Here is a more advanced regexp. This counts instances of the methylation site (Pu-C-X-G):

program 4

while (<>) { chomp; if (/[GA]C.?G/) { print "Found a methylation site!\n"; $sites++; } } print "$sites methylation sites total.\n"

download

The . matches any character

The ? says "match zero or one occurences"

Subpatterns

program 5

# primer.pl
$primer1 = "AATTGG"; $primer2 = "CCCTTT";
$seq = "ATATATATAAATTGGAAAAAAAAAAAAAAAAAAAACCCTTTATATATATA"; if ($seq =~ /$primer1(.*)$primer2/) { print "Sequence between primers: $1\n"; } else { print "Couldn't find primer sequences in dna sequence"; }

download

If we run this:

Sequence between primers: AAAAAAAAAAAAAAAAAAAA

The . matches any character

The * says "match any number of occurences"

Patterns between parentheses go into special variables $1, $2, $3, ...

If we want to extract information from fasta headers

>gi|159608|gb|M59776.1|MSQPUPT Anopheles farauti DNA, clone 1/1

program 6

# get_ids_from_fasta.pl
while ( <> ) { if ( /^\>gi\|(\d+)\|gb\|(\S+)\|(\S+)/ ) { $gi = $1; $accession = $2; $name = $3; print "GI=$gi ACCESSION=$accession NAME=$name"; } }

download

Extracting upstream/downstream sequence

Extract 50 base pairs upstream and 25 base pairs downstream of the TATTAT consensus transcription start site:

program 7

# extract.pl while ( <> ) { chomp; next unless /(.{50})TATTAT(.{25})/; my $upstream = $1; my $downstream = $2; print "Upstream: $upstream\n"; print "Downstream: $downstream\n"; }

download

% cat > test.seq CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTATTATGGG
% extract.pl test.seq Upstream: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC Downstream: GGGGGGGGGGGGGGGGGGGGGGGGG

The . matches any character

The {50} says "match exactly 50 occurences"

So .{50} matches 50 occurences of any character

The characters between the first set of parentheses goes into the special variable $1; The characters inside the second goes into the special variable $2

Extracting Subpatterns Using Arrays

If you assign a regular expression match to an array, it will return a list of all the subpatterns that matched. Alternative implementation of previous example:

program 8

while ( <> ) { chomp; my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/; }

download

If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:

program 9

while ( <> ) { chomp; next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/; print "upstream = $upstream\n"; print "downstream = $downstream\n"; }

download

Substititution operator s///

As well as searching text, regexps can be used to make substitutions in text.

program 10

# translate_codon1.pl $seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/; # substitutes occurences of ATG with M $seq =~ s/TCA/S/; # substitutes occurences of TCA with S $seq =~ s/TTT/F/; # substitutes occurences of TTT with F
print $seq; # prints M S F TCA

download

If you run this, this is the output you will get:

M S F TCA

Notice the last TCA codon is not subsitituted to the corresponding S amino acid. To do this we need to specify the global modifier

program 11

# translate_codon2.pl $seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/g; # substitutes ALL occurences of ATG with M $seq =~ s/TCA/S/g; # substitutes ALL occurences of TCA with S $seq =~ s/TTT/F/g; # substitutes ALL occurences of TTT with F
print $seq; # prints M S F S

download

To extend this to do full translation, we would have to specify 64 substitution operators. Alternately we could do this by pattern matching:

program 12

# translate_codon3.pl $seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/g; $seq =~ s/TC./S/g; $seq =~ s/TT[TC]/F/g; $seq =~ s/TT[AG]/L/g;
print $seq; # prints M S F S

download

Example - extracting sequence from genbank records

Just say we wanted to transform genbank-style sequence output

ORIGIN 1 gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag 61 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa actagttaga 121 ttatgtgacg tgccaagggg tgtagatgta gaaattgaaa caactggtct aaccgattca 181 gaaggagaaa gtgaatcaaa agaagaagag tgatgatgaa tagccaccat tactgcatac 241 tgtagccctt acccttgtcg caccattagc cattaataaa aataaaaaat tatataaaaa 301 ttacacccat

into a string containing only a, t, g, c

One way to do it would be first of all get rid of all the numbers, spaces and newlines:

program 13

# gb1.pl while (<>) { s/ //g; s/\d//g; s/\n//g; s/ORIGIN//g; print }

download

Let's go through this line by line:

s/ //g;

the first operator substitutes spaces with an empty string; the g at the end means substitute ALL instances in this line.

the value of $_ after this is:

ORIGIN 1gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag 61taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga 121ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca 181gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac 241tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa 301ttacacccat

s/\d//g;

the next operator introduces \d which matches any number.

the value of $_ after this is:

ORIGIN gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa ttacacccat

s/\n//g;

the third operator gets rid of all newlines

the value of $_ after this is:

ORIGINgtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgata

s/ORIGIN//g;

the final operator gets rid of the ORIGIN string

the value of $_ after this is:

gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgatagtatttta

this won't work on an entire genbank record, but if we put the above piece of text through, we end up only tcags

the above code could also be written as

program 14

# gb2.pl while (<>) { s/ORIGIN| |\d|\n//g; print; }

download

the pipe symbol | means OR - as in match this OR match this OR match this

Substitution Operator tr///

we have already encountered this operator briefly, for reverse complementing sequences (doing the complementation part):

program 15

$rev = reverse $seq; $rev =~ tr/GATC/CTAG/;

download

Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list; for instance, G with C; A with T; T with A; C with G

we could have written our genbank to sequence script above as

program 16

while ( <> ) { tr/atcg//cd; }

download

the c means complement the searchlist

the d means delete found but unreplaced characters

Advanced - Genbank records to fasta

program 17

#!/usr/local/bin/perl # genbank2fasta.pl
$/ = "//"; # break up records on genbank // delimiter while (<>) { next unless /LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/; $locus = $1; $bp = $2; $type = $3; /ACCESSION[ ]*(\S+)/; $accession = $1; /ORIGIN(.*)$/s; $seq = $1; $seq =~ tr/tcag//cd; $seq =~ s/(.{60}.)/$1\n/g; print ">gb|$accession $locus $type\n$seq\n"; }

download

A full discussion of this program is outside the scope of this lecture; here is a brief outline

/LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/;

matches genbank LOCUS line like this:

LOCUS DDU63596 310 bp DNA INV 14-MAY-1999

The locus name matches the first subpattern $1, the bp count matches the second subpattern $2, and the type matches the third $3

/ORIGIN(.*)$/s;

the subpattern here matches everything from the end of the genbank ORIGIN keyword right to the end of the record. normally the regular expression would only match as far as the first newline \n; the s qualifier forces the regexp engine to treat this as a single line.

Thankfully you rarely need to write programs such as the above. If you ever need to parse genbank (or other common bioinformatic file format) records, the BioPerl project has modules for this very purpose.

Exercises

Extend the previous restriction enzyme program to find enzymes like

'EcoNI' => 'CCTNNNNNAGG', 'EcoRII' => 'CCWGG',

Using the re_hash.pl program from lecture 3, write a program that asks a user for a dna sequence, and searches that sequence for each of the restriction enzymes. see here for answer.

From the sequence file ~cmungall/data/... search for sequences of (TA)n repeats, where n >= 5 and "mask" them by replacing them with N's. From the same sequence file, split each sequence into three-letter "codons" separated by spaces:

% unwrap example1.fasta | codons M43911 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA... L54931 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC... L54932 TAG TTG ATT CAG TCC GAT TTC AAT TGA TTT CCC GTA TAT CCT...

Pipe the output of this program to a codon translation program named "ribosome":

% unwrap example1.fasta | codons | ribosome M43911 PGGULLMLGGGYL....

Use the following code to help you

Appendix

program 18

# codontable.pl %CODON_TABLE = ( TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S', TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L', TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*', TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W', CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L', CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P', CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q', CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R', ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M', ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T', AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K', AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R', GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V', GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A', GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E', GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G');

download

Chris Mungall cjm@fruitfly.org
Berkeley Drosophila Genome Project