Perl 5 - Subroutines and Modules

Reading: Deitel ch6; 6.1-6.9, 6.13, 6.14 (basic); 6.10-6.13 advanced

Our first subroutine

So far we've been using the built in subroutines (functions) provided by perl;

for instance: print, split, chomp, sqrt, abs

Now we'll go into the details of how to build our own subroutines.

Imagine we want to define our own subroutine for printing greetings to people as they arrive at certain destinations, such that we could write code like this:

program 1

greet( "Chris", "Rio" );

download

that would produce output like this:

Hello, Chris. Welcome to Rio!

Here is one way of defining such a subroutine:

program 2

# greet.pl


# friendly subroutine 
sub greet {
        my ($name, $location) = @_;
        print "Hello, $name. ";
        print "Welcome to $location!\n";
}

# call the subroutine with 2 arguments
greet("Chris", "Rio");

download

Hello, Chris. Welcome to Rio!

How does this work?

Declaring the subroutine

First of all we declare the subroutine, and give it a name. You can call your own subroutines what you like, but make sure you give your own subroutines meaningful names, to help other people understand your programs (and to help you, when you come back to it 6 months later!). Also the convention in perl is to_name_your_subroutines_like_this altough some people prefer JavaStyleSubroutineNames.

The body of the subroutine is contained within the block of code enclosed by curly braces {}.

The general form for a subroutine definition is:

sub subroutineName 
{
      statement₁;
      statement₂;
      statement₃;
      .
      .
      .
      statement_n;
}

Subroutine arguments

Subroutines take zero or more arguments. Arguments are passed as an array to the subroutine and go into the special array variable @_

We can use the shift function to grab the arguments from the array.

Note that shift with no arguments implicitly pulls variables from @_

This means we can also write our first subroutine as:

program 3

sub greet {
        my $name     =  shift;   # get first argument from @_
        my $location =  shift;   # get second argument from @_
        print "Hello, $name. Welcome to $location!\n";
}

download

compare this with:

program 4

sub greet {
        my ($name, $location) = @_;
        print "Hello, $name. ";
        print "Welcome to $location!\n";
}

download

Returning values

Just say we wanted to write a subroutine to perform the following function:

examplefunction:

 f( x, y ) = 3x² + 5xy - 12y

such that we could write code like this:

program 5

print " f( 7, 9 ) = ", examplefunction( 7, 9 ), "\n";
print " f( -3, 18.5 ) = ", examplefunction( -3, 18.5 ), "\n";

download

program 6

#!/usr/local/bin/perl
# examplefunction.pl

print " f( 7, 9 ) = ", examplefunction( 7, 9 ), "\n";
print " f( -3, 18.5 ) = ", examplefunction( -3, 18.5 ), "\n";

sub examplefunction {
    my $x = shift @_;
    my $y = shift @_;
    return ( $x ** 2 ) + ( 5 * $x * $y ) - ( 12 * $y );
}

download

We use the return keyword to pass back the value of a variable to the caller.

sub subroutineName 
{
      statement₁;
      statement₂;
      statement₃;
      .
      .
      .
      statement_n-1;
      return expression;
}

A subroutine to do reverse complementation

program 7

# revcomp1.pl

# subroutine for working out the
# reverse complement of a sequence
sub reverse_complement {
        my $seq = shift @_;
        my $rev = reverse $seq;   # reverses string
        $rev =~ tr/GATC/CTAG/;    # complement nts
        return $rev;
}


print reverse_complement("AAAACCCC") . "\n";
print reverse_complement("TTGGGC")   . "\n";

download

this will output:

GGGGTTTT
GCCCAA

We use a new operator here - tr

This is a substitution operator, explained in detail in a future lecture on regular expressions.

For now, read the following code

program 8

    $rev =~ tr/GATC/CTAG/;

download

as meaning - for every character in the string $rev, substitute G with C; A with T; T with A; C with G

this is the complementation part of the reverse complement

The my keyword

Notice the declaration of the subroutine variable with my

This ensures that the variable declaration is private to the subroutine. Otherwise the subroutine could have unintended side-effects such as altering variables which are already in use.

To illustrate:

program 9

# mean.pl
sub mean {
    my @list = @_;
    my $sum  = 0;
    foreach my $n (@list) {
        $sum += $n;
    }
    return $sum / scalar(@list);
}

$n    = 8;
@list = (1, 2, 3);
$mean = mean($n, @list);

print "n    = $n\n";
print "list = @list\n";
print "mean = $mean\n";

download

This program outputs:

n    = 8
list = 1 2 3
mean = 3.5

Notice that the variables $n and @list in the subroutine are private, and represent different variables from the $n and @list in the body of the program.

It is good practice to use my all the time in your programs

see Deitel 6.13 (you can ignore for now uses of the our keyword)

More useful subroutines

Using string handling functions

program 10

#!/usr/local/bin/perl
# find_start_codons.pl

# find the position in a sequence of all 
# possible translation initiation (ATG) sites

# define a subroutine
sub find_start_codons {
    my $seq = shift;
    
    # keep list of ATG positions
    my @atg_positions = ();

    # loop through all the bases in the sequence
    # - think of a sliding "window" 3 bases long
    my $window_position = 0;
    while ( $window_position <= length($seq) - 3 ) {

        my $next_3_bases = substr($seq, $window_position, 3);

        print "looking at window: $next_3_bases\n";
        if ($next_3_bases eq "ATG") {
            
            # found a possible translation-start!
            # report this (counting the first base
            # as position 1)
            print "found ATG at position " . ($window_position+1) ."\n";
	    push( @atg_positions, $window_position + 1 );
        }

	# increment
	$window_position++;
    }

    # return an array of start positions to called
    return @atg_positions;
}

# call the subroutine:
@atgs = find_start_codons("TTGGATTATGCCGGATGCATTTCAGTATGCCAAAAA");
print "ATG positions = @atgs\n";

download

looking at window: TTG
looking at window: TGG
looking at window: GGA
looking at window: GAT
looking at window: ATT
looking at window: TTA
looking at window: TAT
looking at window: ATG
found ATG at position 8
looking at window: TGC
looking at window: GCC
looking at window: CCG
looking at window: CGG
looking at window: GGA
looking at window: GAT
looking at window: ATG
found ATG at position 15
looking at window: TGC
looking at window: GCA
looking at window: CAT
looking at window: ATT
looking at window: TTT
looking at window: TTC
looking at window: TCA
looking at window: CAG
looking at window: AGT
looking at window: GTA
looking at window: TAT
looking at window: ATG
found ATG at position 27
looking at window: TGC
looking at window: GCC
looking at window: CCA
looking at window: CAA
looking at window: AAA
looking at window: AAA
looking at window: AAA
ATG positions = 8 15 27

This subroutine introduces two new builtin perl functions

**length( string )**

this returns the length of the string fed in

**substr( string , index , length , replacement string)**

this takes a substring of the string starting at index, extending up to length characters, replacing them with replacement string

The last two arguments are optional.

You can also use a negative index to indicate the substring should be taken from the right.

Recursion (Optional)

See Deitel 6.10, 6.11, 6.12

Modules

Deitel 6.14

(or man perlmod for a very detailed explanation)

So now you should be able to create your own subroutines for carrying out useful tasks and data manipulations.

What happens if you want to use the same subroutines in different programs? Or if you want to share your subroutines with your fellow programmers?

Of course, you could copy the relevant piece of your first program into your other programs, but then if you want to make modifications to your original subroutine, you have to modify multiple programs! This rapidly becomes unworkable.

Modules and reusability

A key concept in software engineering is that of re-usability

The easiest way to facilitate reuse in perl is through modules

Let's say we want create a module containing various useful biological sequence manipulation subroutines, for your own use and for sharing with others.

program 11

# MySeqTools.pm - my collection of useful sequence utilities
package MySeqTools;

# subroutine for working out the
# reverse complement of a sequence
sub reverse_complement {
        my $seq = shift @_;
        my $rev = reverse $seq;
        $rev =~ tr/GATC/CTAG/;
        return $rev;
}

# we could include more subroutine definitions here...

return 1;  # end of package

download

the package keyword tells perl that we are defining a new namespace

the above code resides in a file called MySeqTools.pm

the .pm suffix indicates that this a module.

Now lets say your colleague comes along and wants to write a program to reverse complement all 3prime ESTs in a fasta file. Being a sensible programmer, she doesn't want to reinvent the wheel, so she writes a program to use your MySeqTools module.

Because this seems like a difficult task, she gets warmed up by first of all writing a smaller program that uses your MySeqTools module to reverse complement a fixed dna sequence:

program 12

#!/usr/local/bin/perl
# test_seqmodule.pl

# use the sequence tools module
use MySeqTools;   

$dnaseq = "AGTCT";
$dnaseq_rc = MySeqTools::reverse_complement($dnaseq);
print $dnaseq_rc;           # prints AGACT

download

At the beginning of the program, the use keyword specifies that an external module should be used.

We specify the subroutine name by first qualifying it with the name of the package it belongs to.

This is how the full program would be written:

For the sake of simplicity, let's assume that the files are in the following format:

LD09278.3prime AGTCTTCCCCTGGCGACTGCTGCCACTCAACGGCGCTGACCGCCTGCTGCGCCGTGG
LD09278.5prime AGCAAGTGCAGTTAGTTTCTAACTAATCGTTTCAGGGATTTGGGATGGCACGCAAAG
LD09751.3prime  CACAACACTACCGCAGTCCACATTGAGACCAAAGCAGCTTTCGCTACTATTTGGACG

with one line per sequence, seperated by newlines. (We'll learn how to parse the popular FASTA format later)

program 13

#!/usr/local/bin/perl
# rev3prime.pl

# use the sequence tools module
use MySeqTools;   

while ( <> ) {
      chomp;     # get rid of trailing newline

      # the variable $_ contains the sequence ID and cDNA sequence;
      # split up the line into two variables:
      my ( $seq_id, $residues ) = split( ' ', $_ );

      # check to see if this is a 3 prime end EST
      if ( $seq_id =~ m/3prime$/ ) {

           # revcomp the sequence
           $residues = MySeqTools::reverse_complement($residues);
           
           # modify the ID so we know that this is revcomped
           $seq_id .= "-REVCOMP";
      }
      
      # output the data after any transformations
      print "$seq_id $residues\n";
}

download

let's try it out:

% rev3prime.pl EST.seq > EST.rev.seq
% cat EST.rev.fa
LD09278.3prime-REVCOMP TCCACCGGGAAAGGAGCCTGCAGCGGATTCCTGCTCTTCAGCTC
LD09278.5prime AGCAAGTGCAGTTAGTTTCTAACTAATCGTTTCAGGGATTTGGGATGGCACG
LD09751.3prime-REVCOMP TACCCAACAGCATTCGCAGCAACAGACTCGCGTCGGAATGGATC

Explanation

We loop through the line in the input file, breaking them up into sequence ID and actual cDNA sequence.

In order to check if the sequence is a 3prime EST, we rely on a certain naming convention. (You should always beware of relying on naming conventions, as they have a tendency to change over time). We are assuming all 3 prime sequences end with the suffix. To test whether an EST is 5 or 3 prime, I've introduced a new construct that will be covered more fully in the next lecture, regular expressions, specifically the match operator:

 $seq_id =~ m/3prime$/

This expression returns TRUE if the string value of the $seq_id variable contains the string "3prime"

If the sequence is at the 3 prime end, we reverse complement it:

$residues = MySeqTools::reverse_complement($residues);

notice that unlike the subroutine calls we have encountered so far, this one has the name of the package preceeding the subroutine name, followed by double colon symbols. This is because MySeqTools exists in a different namespace

While it may seem annoying to have to include this with all your subroutine calls it does prevent clashing subroutine names - it acts as an address to the subroutine. In fact, many modules go further and have a two part address, for instance you may want to call your package Fiocruz::MySeqTools, so you could collect other useful modules under Fiocroz:: and have them organized together.

Some modules allow you to import the subroutines into your own namespace.

For the program above to work, you have to download MySeqTools.pm into your current unix directory. At a later date you may want to explore ways of telling perl where to find your modules; type "man lib" on the unix command line for more details.

Existing Modules: CPAN

One of the advantages of using perl is that there are lots of other people out there using it too, people who like to share software! This means that for any easily generalised problem, there will quite possibly be a module or set of modules for you to use.

Most of these modules are centralised in one place, the Comprehensive Perl Archive Network (http://www.cpan.org/). Here you will find modules for everything from statistical analysis to writing games.

Generally all the modules at CPAN are very well documented. They have good examples of how to use them in your own programs, you should never have to go in and look at the actual code within the module yourself (although this can be a good way of learning advanced perl tricks)

A few of the modules at CPAN you may find useful:

CGI.pm

This is a module to help you write CGI (Common Gateway Interface) programs. A CGI program is a special kind of program that runs behind a web server, allowing users to interact with your data across the web. Whenever you use a web interface to a biological database or analysis tool such as the NCBI blast server, this usually has CGI or some variant behind it.

If you are interested in writing webserver-based programs, Deitel ch7 has an introduction and links to sites you can learn more. You can get CGI.pm (if you don't already have it installed) from Lincoln Stein's website http://stein.cshl.org/

GD.pm

GD is a useful module for drawing simple graphics in GIF/PNG format for displaying as part of a web page or web service.

Ace.pm

This is a module for interacting with ACEDB databases; (we will return to databases in a future lecture).

BioPerl

This is a set of object oriented modules that are extremely helpful for managing and analysing biological data. They are focused towards sequence oriented data, but not exclusively so.

BioPerl isn't the work of one person, it's an open collaboration between multiple programmers all over the world. They have a website at http://www.bioperl.org/ ; if you end up writing biological programs that you think may be useful to other people, you should think about hosting these at bioperl.

Installing modules from CPAN

If you have root access on your computer, installing third party modules is generally a painless process.

Type:

% perl -MCPAN -e shell

If this is your first time, you will have to answer some questions about your setup; generally you can just hit return and go with the defaults. Once this process is complete, just type "install MODULENAME", for instance "install CGI"

If you do not have root access on your machine, you can still install modules in your own personal space quite easily, although this is outside the scope of this lecture. Go to http://www.cpan.org/ for more details.

Exercises

Deitel ex 6.4, 6.5

Extend the MySeqTools module to have functions that

1. take a hash as an argument, and print out a file of sequences in the format used in the examples above

2. take a filename as argument, and parse the file (assumed to be in the above format) and return a hash of id=>seq

Chris Mungall cjm@fruitfly.org
Berkeley Drosophila Genome Project