Reading:
Deitel 8
advanced: "man
perlre"
The
match operator m//
One of the main
reasons for using perl is for it's powerful text manipulation capabilities.
A lot of bioinformatics is about data munging - essentially
taking data input often as text, doing transformations on the data
and outputting it.
One of the most
powerful ways of searching, manipulating and transforming text data
is through regular expressions. Perl has regexps (as they are
often called) built into the language.
There is a lot
to regular expressions. The perl regexp engine is a powerful toolkit
of which you may need to utilise a few tools.
We saw our first
regular expression in the previous lecture, when we were trying to
check our sequence IDs for the pattern
using the following
code
program 1
| if ($seq_id =~ m/3prime$/) {...}
|
download
Another common
way to use regular expressions is for finding patterns in the sequences
themselves:
program 2
| while ( <> ) {
chomp;
if ($_ =~ m/GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}
print "I counted $sites EcoRI sites total.\n"
|
download
This above bit
of code loops through user input. Finds all lines containing an EcoRI
site, and bumps up a counter.
The m at
the beginning of the match operator is optional; perl treats anything
between two slashes as a match operator. Perl will also operate on
the special $_ variable by default, so the above code could
be rewritten as:
program 3
| while ( <> ) {
chomp;
if (/GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}
print "I counted $sites EcoRI sites total.\n"
|
download
In fact, any nonalphanumeric,
nonwhitespace character can serve as a delimiter; we could have written
any of these:
| m/GAATTC/
m@GAATTC@
m[GAATTC]
m#GAATTC#
|
Here is a more
advanced regexp. This counts instances of the methylation site (Pu-C-X-G):
program 4
| while (<>) {
chomp;
if (/[GA]C.?G/) {
print "Found a methylation site!\n";
$sites++;
}
}
print "$sites methylation sites total.\n"
|
download
The . matches
any character
The ? says "match
zero or one occurences"
Subpatterns
program 5
| # primer.pl
$primer1 = "AATTGG";
$primer2 = "CCCTTT";
$seq = "ATATATATAAATTGGAAAAAAAAAAAAAAAAAAAACCCTTTATATATATA";
if ($seq =~ /$primer1(.*)$primer2/) {
print "Sequence between primers: $1\n";
}
else {
print "Couldn't find primer sequences in dna sequence";
}
|
download
If we run this:
| Sequence between primers: AAAAAAAAAAAAAAAAAAAA
|
The . matches
any character
The * says "match
any number of occurences"
Patterns between
parentheses go into special variables $1, $2, $3, ...
If we want to
extract information from fasta headers
| >gi|159608|gb|M59776.1|MSQPUPT Anopheles farauti DNA, clone 1/1
|
program 6
| # get_ids_from_fasta.pl
while ( <> ) {
if ( /^\>gi\|(\d+)\|gb\|(\S+)\|(\S+)/ ) {
$gi = $1;
$accession = $2;
$name = $3;
print "GI=$gi ACCESSION=$accession NAME=$name";
}
}
|
download
Extracting upstream/downstream
sequence
Extract 50 base
pairs upstream and 25 base pairs downstream of the TATTAT consensus
transcription start site:
program 7
| # extract.pl
while ( <> ) {
chomp;
next unless /(.{50})TATTAT(.{25})/;
my $upstream = $1;
my $downstream = $2;
print "Upstream: $upstream\n";
print "Downstream: $downstream\n";
}
|
download
| % cat > test.seq
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTATTATGGG
% extract.pl test.seq
Upstream: CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Downstream: GGGGGGGGGGGGGGGGGGGGGGGGG
|
The . matches
any character
The {50} says
"match exactly 50 occurences"
So .{50} matches
50 occurences of any character
The characters
between the first set of parentheses goes into the special variable
$1; The characters inside the second goes into the special variable
$2
Extracting
Subpatterns Using Arrays
If you assign
a regular expression match to an array, it will return a list
of all the subpatterns that matched. Alternative implementation of
previous example:
program 8
| while ( <> ) {
chomp;
my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
}
|
download
If the regular
expression doesn't match at all, then it returns an empty list. Since
an empty list is FALSE, you can use it in a logical test:
program 9
| while ( <> ) {
chomp;
next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
print "upstream = $upstream\n";
print "downstream = $downstream\n";
}
|
download
Substititution
operator s///
As well as searching
text, regexps can be used to make substitutions in text.
program 10
| # translate_codon1.pl
$seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/; # substitutes occurences of ATG with M
$seq =~ s/TCA/S/; # substitutes occurences of TCA with S
$seq =~ s/TTT/F/; # substitutes occurences of TTT with F
print $seq; # prints M S F TCA
|
download
If you run this,
this is the output you will get:
Notice the last
TCA codon is not subsitituted to the corresponding S amino acid. To
do this we need to specify the global modifier
program 11
| # translate_codon2.pl
$seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/g; # substitutes ALL occurences of ATG with M
$seq =~ s/TCA/S/g; # substitutes ALL occurences of TCA with S
$seq =~ s/TTT/F/g; # substitutes ALL occurences of TTT with F
print $seq; # prints M S F S
|
download
To extend this
to do full translation, we would have to specify 64 substitution operators.
Alternately we could do this by pattern matching:
program 12
| # translate_codon3.pl
$seq = "ATG TCA TTT TCA";
$seq =~ s/ATG/M/g;
$seq =~ s/TC./S/g;
$seq =~ s/TT[TC]/F/g;
$seq =~ s/TT[AG]/L/g;
print $seq; # prints M S F S
|
download
Example
- extracting sequence from genbank records
Just say we wanted
to transform genbank-style sequence output
| ORIGIN
1 gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag
61 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa actagttaga
121 ttatgtgacg tgccaagggg tgtagatgta gaaattgaaa caactggtct aaccgattca
181 gaaggagaaa gtgaatcaaa agaagaagag tgatgatgaa tagccaccat tactgcatac
241 tgtagccctt acccttgtcg caccattagc cattaataaa aataaaaaat tatataaaaa
301 ttacacccat
|
into a string
containing only a, t, g, c
One way to do
it would be first of all get rid of all the numbers, spaces and newlines:
program 13
| # gb1.pl
while (<>) {
s/ //g;
s/\d//g;
s/\n//g;
s/ORIGIN//g;
print
}
|
download
Let's go through
this line by line:
the first operator
substitutes spaces with an empty string; the g at the end means
substitute ALL instances in this line.
the value of $_
after this is:
| ORIGIN
1gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag
61taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga
121ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca
181gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac
241tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa
301ttacacccat
|
the next operator
introduces \d which matches any number.
the value of $_
after this is:
| ORIGIN
gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag
taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga
ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca
gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac
tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa
ttacacccat
|
the third operator
gets rid of all newlines
the value of $_
after this is:
| ORIGINgtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgata
|
the final operator
gets rid of the ORIGIN string
the value of $_
after this is:
| gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgatagtatttta
|
this won't work
on an entire genbank record, but if we put the above piece of text
through, we end up only tcags
the above code
could also be written as
program 14
| # gb2.pl
while (<>) {
s/ORIGIN| |\d|\n//g;
print;
}
|
download
the pipe symbol
| means OR - as in match this OR match this OR match this
Substitution
Operator tr///
we have already
encountered this operator briefly, for reverse complementing sequences
(doing the complementation part):
program 15
| $rev = reverse $seq;
$rev =~ tr/GATC/CTAG/;
|
download
Translates all
occurrences of the characters found in the search list with the corresponding
character in the replacement list; for instance, G with C; A with
T; T with A; C with G
we could have
written our genbank to sequence script above as
program 16
| while ( <> ) {
tr/atcg//cd;
}
|
download
the c means
complement the searchlist
the d means
delete found but unreplaced characters
Advanced
- Genbank records to fasta
program 17
| #!/usr/local/bin/perl
# genbank2fasta.pl
$/ = "//"; # break up records on genbank // delimiter
while (<>) {
next unless /LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/;
$locus = $1;
$bp = $2;
$type = $3;
/ACCESSION[ ]*(\S+)/;
$accession = $1;
/ORIGIN(.*)$/s;
$seq = $1;
$seq =~ tr/tcag//cd;
$seq =~ s/(.{60}.)/$1\n/g;
print ">gb|$accession $locus $type\n$seq\n";
}
|
download
A full discussion
of this program is outside the scope of this lecture; here is a brief
outline
| /LOCUS[ ]*(\S+)[ ]*(\d+) bp[ ]*(\w+)/;
|
matches genbank
LOCUS line like this:
| LOCUS DDU63596 310 bp DNA INV 14-MAY-1999
|
The locus name
matches the first subpattern $1, the bp count matches the second subpattern
$2, and the type matches the third $3
the subpattern
here matches everything from the end of the genbank ORIGIN keyword
right to the end of the record. normally the regular expression would
only match as far as the first newline \n; the s qualifier
forces the regexp engine to treat this as a single line.
Thankfully you
rarely need to write programs such as the above. If you ever need
to parse genbank (or other common bioinformatic file format) records,
the BioPerl project has modules for this very purpose.
Exercises
Extend the previous
restriction enzyme program to find enzymes like
'EcoNI' =>
'CCTNNNNNAGG', 'EcoRII' => 'CCWGG',
Using the re_hash.pl
program from lecture 3, write a program that asks a user for a dna
sequence, and searches that sequence for each of the restriction enzymes.
see here
for answer.
From the sequence
file ~cmungall/data/... search for sequences of (TA)n repeats, where
n >= 5 and "mask" them by replacing them with N's. From the same
sequence file, split each sequence into three-letter "codons" separated
by spaces:
| % unwrap example1.fasta | codons
M43911 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA...
L54931 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC...
L54932 TAG TTG ATT CAG TCC GAT TTC AAT TGA TTT CCC GTA TAT CCT...
|
Pipe the output
of this program to a codon translation program named "ribosome":
| % unwrap example1.fasta | codons | ribosome
M43911 PGGULLMLGGGYL....
|
Use the following
code to help you
Appendix
program 18
| # codontable.pl
%CODON_TABLE = (
TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S',
TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L',
TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*',
TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W',
CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L',
CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P',
CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q',
CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R',
ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M',
ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T',
AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K',
AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R',
GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V',
GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A',
GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E',
GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G');
|
download
Chris Mungall cjm@fruitfly.org
Berkeley
Drosophila Genome Project