Reading: Deitel 8
advanced: "man perlre"
Sequence ID.3prime |
using the following code
program 1
if ($seq_id =~ m/3prime$/) {...}
|
Another common way to use regular expressions is for finding patterns in the sequences themselves:
program 2
while ( <> ) {
chomp;
if ($_ =~ m/GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}
print "I counted $sites EcoRI sites total.\n"
|
This above bit of code loops through user input. Finds all lines containing an EcoRI site, and bumps up a counter.
The m at the beginning of the match operator is optional; perl treats anything between two slashes as a match operator. Perl will also operate on the special $_ variable by default, so the above code could be rewritten as:
program 3
while ( <> ) {
chomp;
if (/GAATTC/){
print "Found an EcoRI site!\n";
$sites++;
}
}
print "I counted $sites EcoRI sites total.\n"
|
In fact, any nonalphanumeric, nonwhitespace character can serve as a delimiter; we could have written any of these:
m/GAATTC/ m@GAATTC@ m[GAATTC] m#GAATTC# |
Here is a more advanced regexp. This counts instances of the methylation site (Pu-C-X-G):
program 4
while (<>) {
chomp;
if (/[GA]C.?G/) {
print "Found a methylation site!\n";
$sites++;
}
}
print "$sites methylation sites total.\n"
|
The . matches any character
The ? says "match zero or one occurences"
# primer.pl |
If we run this:
Sequence between primers: AAAAAAAAAAAAAAAAAAAA |
The . matches any character
The * says "match any number of occurences"
Patterns between parentheses go into special variables $1, $2, $3, ...
If we want to extract information from fasta headers
>gi|159608|gb|M59776.1|MSQPUPT Anopheles farauti DNA, clone 1/1 |
program 6
# get_ids_from_fasta.pl |
# extract.pl
while ( <> ) {
chomp;
next unless /(.{50})TATTAT(.{25})/;
my $upstream = $1;
my $downstream = $2;
print "Upstream: $upstream\n";
print "Downstream: $downstream\n";
}
|
% cat > test.seq CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTATTATGGGGGGGGGGGGGGGGGGGGGGGGG |
The . matches any character
The {50} says "match exactly 50 occurences"
So .{50} matches 50 occurences of any character
The characters between the first set of parentheses goes into the special variable $1; The characters inside the second goes into the special variable $2
while ( <> ) {
chomp;
my ($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
}
|
If the regular expression doesn't match at all, then it returns an empty list. Since an empty list is FALSE, you can use it in a logical test:
program 9
while ( <> ) {
chomp;
next unless my($upstream,$downstream) = /(.{50})TATTAT(.{25})/;
print "upstream = $upstream\n";
print "downstream = $downstream\n";
}
|
# translate_codon1.pl $seq = "ATG TCA TTT TCA"; |
If you run this, this is the output you will get:
M S F TCA |
Notice the last TCA codon is not subsitituted to the corresponding S amino acid. To do this we need to specify the global modifier
program 11
# translate_codon2.pl $seq = "ATG TCA TTT TCA"; |
To extend this to do full translation, we would have to specify 64 substitution operators. Alternately we could do this by pattern matching:
program 12
# translate_codon3.pl $seq = "ATG TCA TTT TCA"; |
ORIGIN
1 gtgacagttg gctgtcagac atacaatgat tgtttagaag aggagaagat tgatccggag
61 taccgtgata gtattttaaa aactatgaaa gcgggaatac ttaatggtaa actagttaga
121 ttatgtgacg tgccaagggg tgtagatgta gaaattgaaa caactggtct aaccgattca
181 gaaggagaaa gtgaatcaaa agaagaagag tgatgatgaa tagccaccat tactgcatac
241 tgtagccctt acccttgtcg caccattagc cattaataaa aataaaaaat tatataaaaa
301 ttacacccat
|
into a string containing only a, t, g, c
One way to do it would be first of all get rid of all the numbers, spaces and newlines:
program 13
# gb1.pl
while (<>) {
s/ //g;
s/\d//g;
s/\n//g;
s/ORIGIN//g;
print
}
|
Let's go through this line by line:
s/ //g;
|
the value of $_ after this is:
ORIGIN 1gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag 61taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga 121ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca 181gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac 241tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa 301ttacacccat |
s/\d//g;
|
the value of $_ after this is:
ORIGIN gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggag taccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttaga ttatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattca gaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatac tgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaa ttacacccat |
s/\n//g;
|
the value of $_ after this is:
ORIGINgtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttagattatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattcagaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatactgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaattacacccat |
s/ORIGIN//g;
|
the value of $_ after this is:
gtgacagttggctgtcagacatacaatgattgtttagaagaggagaagattgatccggagtaccgtgatagtattttaaaaactatgaaagcgggaatacttaatggtaaactagttagattatgtgacgtgccaaggggtgtagatgtagaaattgaaacaactggtctaaccgattcagaaggagaaagtgaatcaaaagaagaagagtgatgatgaatagccaccattactgcatactgtagcccttacccttgtcgcaccattagccattaataaaaataaaaaattatataaaaattacacccat |
this won't work on an entire genbank record, but if we put the above piece of text through, we end up only tcags
the above code could also be written as
program 14
# gb2.pl
while (<>) {
s/ORIGIN| |\d|\n//g;
print;
}
|
the pipe symbol | means OR - as in match this OR match this OR match this
$rev = reverse $seq;
$rev =~ tr/GATC/CTAG/;
|
Translates all occurrences of the characters found in the search list with the corresponding character in the replacement list; for instance, G with C; A with T; T with A; C with G
we could have written our genbank to sequence script above as
program 16
while ( <> ) {
tr/atcg//cd;
}
|
the c means complement the searchlist
the d means delete found but unreplaced characters
Advanced - Genbank records to fasta
program 17
#!/usr/local/bin/perl # genbank2fasta.pl |
A full discussion of this program is outside the scope of this lecture; here is a brief outline
|
matches genbank LOCUS line like this:
LOCUS DDU63596 310 bp DNA INV 14-MAY-1999 |
The locus name matches the first subpattern $1, the bp count matches the second subpattern $2, and the type matches the third $3
|
the subpattern here matches everything from the end of the genbank ORIGIN keyword right to the end of the record. normally the regular expression would only match as far as the first newline \n; the s qualifier forces the regexp engine to treat this as a single line.
Thankfully you rarely need to write programs such as the above. If you ever need to parse genbank (or other common bioinformatic file format) records, the BioPerl project has modules for this very purpose.
Extend the previous restriction enzyme program to find enzymes like
'EcoNI' => 'CCTNNNNNAGG',
'EcoRII' => 'CCWGG',
Using the re_hash.pl program from lecture 3, write a program
that asks a user for a dna sequence, and searches that sequence
for each of the restriction enzymes. see here for
answer.
From the sequence file ~cmungall/data/...
search for sequences of (TA)n repeats, where n >= 5 and
"mask" them by replacing them with N's.
From the same sequence file, split each sequence into three-letter
"codons" separated by spaces:
Exercises
% unwrap example1.fasta | codons M43911 GAT TCC GAT CCC CCC CCC AGT TTG ACC AAA GTT CAG AGG AAA... L54931 GGG TGG TGG TGA GAG AGA GCG ATT GAA AGC TAT ATA TAT GAC... L54932 TAG TTG ATT CAG TCC GAT TTC AAT TGA TTT CCC GTA TAT CCT... |
Pipe the output of this program to a codon translation program named "ribosome":
% unwrap example1.fasta | codons | ribosome M43911 PGGULLMLGGGYL.... |
Use the following code to help you
# codontable.pl %CODON_TABLE = ( TCA => 'S',TCG => 'S',TCC => 'S',TCT => 'S', TTT => 'F',TTC => 'F',TTA => 'L',TTG => 'L', TAT => 'Y',TAC => 'Y',TAA => '*',TAG => '*', TGT => 'C',TGC => 'C',TGA => '*',TGG => 'W', CTA => 'L',CTG => 'L',CTC => 'L',CTT => 'L', CCA => 'P',CCG => 'P',CCC => 'P',CCT => 'P', CAT => 'H',CAC => 'H',CAA => 'Q',CAG => 'Q', CGA => 'R',CGG => 'R',CGC => 'R',CGT => 'R', ATT => 'I',ATC => 'I',ATA => 'I',ATG => 'M', ACA => 'T',ACG => 'T',ACC => 'T',ACT => 'T', AAT => 'N',AAC => 'N',AAA => 'K',AAG => 'K', AGT => 'S',AGC => 'S',AGA => 'R',AGG => 'R', GTA => 'V',GTG => 'V',GTC => 'V',GTT => 'V', GCA => 'A',GCG => 'A',GCC => 'A',GCT => 'A', GAT => 'D',GAC => 'D',GAA => 'E',GAG => 'E', GGA => 'G',GGG => 'G',GGC => 'G',GGT => 'G'); |
Berkeley Drosophila Genome Project
Date: Tue May 29 10:31:25 2001