![]() |
|||||||
|
Reading: Deitel ch6; 6.1-6.9, 6.13, 6.14 (basic); 6.10-6.13 advanced
![]() Our first subroutine So far we've been using the built in subroutines (functions) provided by perl; for instance: print, split, chomp, sqrt, abs
Now we'll go into the details of how to build our own subroutines.
Imagine
we want to define our own subroutine for printing greetings to people
as they arrive at certain destinations, such that we could write code
like this: that would produce output like this:
Here is one way of defining such a subroutine:
How does this work? Declaring the subroutine First of all we declare the subroutine, and give it a name. You can call your own subroutines what you like, but make sure you give your own subroutines meaningful names, to help other people understand your programs (and to help you, when you come back to it 6 months later!). Also the convention in perl is to_name_your_subroutines_like_this altough some people prefer JavaStyleSubroutineNames. The body of the subroutine is contained within the block of code enclosed by curly braces {}. The general form for a subroutine definition is:
Subroutine arguments
Subroutines take zero or more arguments. Arguments are passed as an array to the subroutine and go into the special array variable @_ We can use the shift function to grab the arguments from the array. Note that shift with no arguments implicitly pulls variables from @_ This
means we can also write our first subroutine as:
compare this with: program 4
Just say we wanted to write a subroutine to perform the following function:
such that we could write code like this:
We use the return keyword to pass back the value of a variable to the caller.
this will output:
We use a new operator here - tr This is a substitution operator, explained in detail in a future lecture on regular expressions. For
now, read the following code
as meaning - for every character in the string $rev, substitute G with C; A with T; T with A; C with G this is the complementation part of the reverse complement The my keyword Notice the declaration of the subroutine variable with my This ensures that the variable declaration is private to the subroutine. Otherwise the subroutine could have unintended side-effects such as altering variables which are already in use. To
illustrate:
This program outputs:
Notice that the variables $n and @list in the subroutine are private, and represent different variables from the $n and @list in the body of the program. It is good practice to use my all the time in your programs see Deitel 6.13 (you can ignore for now uses of the our keyword)
Using string handling functions
This subroutine introduces two new builtin perl functions length( string )this returns the length of the string fed in substr( string , index , length , replacement string)this takes a substring of the string starting at index, extending up to length characters, replacing them with replacement string The last two arguments are optional. You can also use a negative index to indicate the substring should be taken from the right. ![]() Recursion (Optional) See Deitel 6.10, 6.11, 6.12
Modules Deitel 6.14 (or man perlmod for a very detailed explanation) So now you should be able to create your own subroutines for carrying out useful tasks and data manipulations. What happens if you want to use the same subroutines in different programs? Or if you want to share your subroutines with your fellow programmers? Of course, you could copy the relevant piece of your first program into your other programs, but then if you want to make modifications to your original subroutine, you have to modify multiple programs! This rapidly becomes unworkable. A key concept in software engineering is that of re-usability
The easiest way to facilitate reuse in perl is through modules Let's say we want create a module containing various useful biological sequence manipulation subroutines, for your own use and for sharing with others.
the package keyword tells perl that we are defining a new namespace
the above code resides in a file called MySeqTools.pm the .pm suffix indicates that this a module.
Now lets say your colleague comes along and wants to write a program to reverse complement all 3prime ESTs in a fasta file. Being a sensible programmer, she doesn't want to reinvent the wheel, so she writes a program to use your MySeqTools module. Because this seems like a difficult task, she gets warmed up by first of all writing a smaller program that uses your MySeqTools module to reverse complement a fixed dna sequence:
At the beginning of the program, the use keyword specifies that an external module should be used. We specify the subroutine name by first qualifying it with the name of the package it belongs to. This is how the full program would be written: For the sake of simplicity, let's assume that the files are in the following format:
with one line per sequence, seperated by newlines. (We'll learn how to parse the popular FASTA format later)
let's try it out:
ExplanationWe loop through the line in the input file, breaking them up into sequence ID and actual cDNA sequence. In order to check if the sequence is a 3prime EST, we rely on a certain naming convention. (You should always beware of relying on naming conventions, as they have a tendency to change over time). We are assuming all 3 prime sequences end with the suffix. To test whether an EST is 5 or 3 prime, I've introduced a new construct that will be covered more fully in the next lecture, regular expressions, specifically the match operator:
This expression returns TRUE if the string value of the $seq_id variable contains the string "3prime" If the sequence is at the 3 prime end, we reverse complement it:
notice that unlike the subroutine calls we have encountered so far, this one has the name of the package preceeding the subroutine name, followed by double colon symbols. This is because MySeqTools exists in a different namespace While it may seem annoying to have to include this with all your subroutine calls it does prevent clashing subroutine names - it acts as an address to the subroutine. In fact, many modules go further and have a two part address, for instance you may want to call your package Fiocruz::MySeqTools, so you could collect other useful modules under Fiocroz:: and have them organized together. Some modules allow you to import the subroutines into your own namespace.
One of the advantages of using perl is that there are lots of other people out there using it too, people who like to share software! This means that for any easily generalised problem, there will quite possibly be a module or set of modules for you to use. Most of these modules are centralised in one place, the Comprehensive Perl Archive Network (http://www.cpan.org/). Here you will find modules for everything from statistical analysis to writing games. Generally all the modules at CPAN are very well documented. They have good examples of how to use them in your own programs, you should never have to go in and look at the actual code within the module yourself (although this can be a good way of learning advanced perl tricks) A few of the modules at CPAN you may find useful: CGI.pm This is a module to help you write CGI (Common Gateway Interface) programs. A CGI program is a special kind of program that runs behind a web server, allowing users to interact with your data across the web. Whenever you use a web interface to a biological database or analysis tool such as the NCBI blast server, this usually has CGI or some variant behind it. If you are interested in writing webserver-based programs, Deitel ch7 has an introduction and links to sites you can learn more. You can get CGI.pm (if you don't already have it installed) from Lincoln Stein's website http://stein.cshl.org/ GD.pm GD is a useful module for drawing simple graphics in GIF/PNG format for displaying as part of a web page or web service. Ace.pm This is a module for interacting with ACEDB databases; (we will return to databases in a future lecture).
BioPerl This is a set of object oriented modules that are extremely helpful for managing and analysing biological data. They are focused towards sequence oriented data, but not exclusively so. BioPerl isn't the work of one person, it's an open collaboration between multiple programmers all over the world. They have a website at http://www.bioperl.org/ ; if you end up writing biological programs that you think may be useful to other people, you should think about hosting these at bioperl. If you have root access on your computer, installing third party modules is generally a painless process. Type:
If this is your first time, you will have to answer some questions about your setup; generally you can just hit return and go with the defaults. Once this process is complete, just type "install MODULENAME", for instance "install CGI" If you do not have root access on your machine, you can still install modules in your own personal space quite easily, although this is outside the scope of this lecture. Go to http://www.cpan.org/ for more details.
Deitel ex 6.4, 6.5 Extend the MySeqTools module to have functions that 1. take a hash as an argument, and print out a file of sequences in the format used in the examples above 2. take a filename as argument, and parse the file (assumed to be in the above format) and return a hash of id=>seq
|