Perl 3 - Arrays and Hashes

Reading: Deitel ch4

(leave 4.6 until next lecture)

Advanced: man perldata (perl data structures manual)

So far we have been looking at scalar variables. Recall a scalar is a single valued variable.
Limitations of scalar variables

Imagine we want to find the average of a list of numbers

we could do it like this:


program 1
$number1 = 5.4;
$number2 = 7.3;
$number3 = 4.1;
$average = ( $number1 + $number2 + $number3 ) / 3;
download

but this is obviously extremely limited

Lists


program 2
( 5.6, 8.22, 14.9 );            # list of floating point numbers

( "hello", "brazil" ); # list of strings

( "hello", $country );

( "blah", 18, 22, "x", 3.14 ); # mixed list

( 0 .. 5 ); # list of integers between 0 and 5 ( 'a' .. 'z' ); # list of strings a,b,c,d......

download

Array variables


program 3
@numbers = (5.6, 8.22, 14.9);    # list of floating point numbers

@words = ("Hello", "Brazil!"); # list of strings

@qual = (100, 100, 100, 75, 75, 75);

@greeting = ("hello", $country);

@list = ("blah", 18, 22, "x", 3.14); # mixed list

@range = (0..5); # list of integers betwen 0 and 5

download

as we can see, we use the special character @ for denoting arrays

Accessing array elements


program 4
# alphabet_index.pl
print "Enter an index number between 0 and 25\n";
$index = <STDIN>;
chomp $index;

@letters = ('A'..'Z'); print "letter index $index = $letters[$index] \n";

download

 

The number in square brackets is the index

arrays are indexed from zero, not one

The arrays above are lists of scalars. we use the scalar sign $ to indicate that the elements we are accessing is a scalar.

Setting the values in an array

program 5
# array_set.pl
@words = ('The', 'quick', 'brown', 'fox');
$words[1] = 'small';
$words[2] = 'furry';
print "@words \n";    # The small furry fox
download

Let's go through this line by line and see what is happening


program 6
@words = ('The', 'quick', 'brown', 'fox');
download

This constructs an array that looks like this:

          +------+
$words[0] |The   |
          +------+
$words[1] |quick |
          +------+
$words[2] |brown |
          +------+
$words[3] |fox   |
          +------+

program 7
$words[1] = 'small';
download

This sets element 1 (counting from zero) of the array, so our array now looks like this:

          +------+
$words[0] |The   |
          +------+
$words[1] |small |
          +------+
$words[2] |brown |
          +------+
$words[3] |fox   |
          +------+

program 8
$words[2] = 'furry';
download

This sets element 2 (counting from zero) of the array, so our array ends up like this:

          +------+
$words[0] |The   |
          +------+
$words[1] |small |
          +------+
$words[2] |furry |
          +------+
$words[3] |fox   |
          +------+

Indexing arrays with negative numbers

You can index from the end of an array backwards by using negative numbers to index the array:


program 9
# negative_index.pl
@letters = ('A'..'Z');
print "       last letter = $letters[-1] \n";  # Z
print "penultimate letter = $letters[-2] \n";  # Y
download

Getting the length of an array

You can use the function scalar to turn an array into a single valued scalar variable; the value of this variable will be the length of the array.


program 10
@numbers = (0..100);
print scalar(@numbers);    # prints 101
download

The index count $#

You can also get the value of the last index by preceeding the array variable name with $#


program 11
# index2.pl
@numbers = (0..100);
@numbers = reverse @numbers;
print "index = $#numbers \n";     # prints 100
print "$numbers[$#numbers] \n";   # prints 0
download

Taking a "slice" of an array

program 12
@words = ('the', 'quick', 'brown', 'fox');
@cut = @words[1,3];               # same as ($words[1], $words[3])
print "@cut \n";                  # quick fox
download

Functions that act on arrays

Push

program 13
# push_example.pl
@numbers = (1, 2, 3);
push(@numbers, 4, 5);
print "@numbers \n";    # prints 1 2 3 4 5
download

Pop

program 14
# pop_example.pl
@words = ('the', 'quick', 'brown', 'fox');

print pop(@words); # fox print pop(@words); # brown print pop(@words); # quick print pop(@words); # the

download

Shift

program 15
# shift_example.pl
@words = ('the', 'quick', 'brown', 'fox');

print pop(@words); # the print pop(@words); # quick print pop(@words); # brown print pop(@words); # fox

download

Unshift

program 16
# unshift_example.pl
@words = ('quick', 'brown', 'fox');
unshift(@words, 'the');
print "@words\n";   # the quick brown fox
download

Reverse

program 17
# reverse_example.pl
@words = ('the', 'quick', 'brown', 'fox');
print reverse(@words), "\n";   # foxbrownquickthe
download

an array in quotes is interpolated i.e. there are spaces placed
between the words. If we print an array without the quotes, the
elements are all squashed together.

Sort

program 18
# sort_example.pl
@words = ('The', 'quick', 'brown', 'fox', 'jumped');
@sorted = sort(@words);
print "sorted words = @sorted\n"; # The brown fox jumped quick
download

you can optionally specify a code block to use as the sort method. Use the special variables $a and $b to specify to sort comparison.


program 19
# sort_example2.pl
@words = ('The', 'quick', 'brown', 'fox', 'jumped');
@sorted = sort { lc($a) cmp lc($b) }  @words;
print "sorted words = @sorted\n"; # brown fox jumped quick The
download

lc is a function that takes a string as an arguments and returns the string as lower case cmp is a new operation. it returns

-1 if the left side is less than (alphabetically before) the right side

0 if the left side is the same as the right side

+1 if the left side is greater than (alphabetically after) the right side

what do you think the outcome of the following program is?
program 20
# sort_example3.pl
@numbers = (100, 101, 102, 10, 11, 12, 1, 2, 3);
@sorted = sort @numbers;
print "sorted numbers = @sorted\n"; 
download

The default sort comparison is alphabetical order:
sorted numbers = 1 10 100 101 102 11 12 2 3

to sort in numeric order:
program 21
# sort_example4.pl
@numbers = (100, 101, 102, 10, 11, 12, 1, 2, 3);
@sorted = sort { $a <=> $b  } @numbers;
print "sorted numbers = @sorted\n"; # 1 2 3 10 11 12 100 101 102
download

<=> is a new operation. it returns

-1 if the left side is less than the right side

0 if the left side is equal to the right side

+1 if the left side is greater than the right side

Splice


program 22
# split_example.pl
@words = ('The', 'quick', 'brown', 'fox', 'jumped');
@spliced = splice(@words, 1, 2, 'happy', 'red');
print "spliced words = @spliced\n";   # quick brown
print "        words = @words\n";     # The happy red fox jumped
download

Join

program 23
# join_example.pl
@words = ('The', 'quick', 'brown', 'fox', 'jumped');
print join("+", @words), "\n";   # The+quick+brown+fox+jumped
download

Split

program 24
# split_example.pl
$sentence = "The+++quick+++brown+++fox+++jumped";
@words = split(/\+\+\+/, $sentence);
print "@words \n";                 # The quick brown fox jumped
download

Often we want to break up a sentence seperated by spaces into an array of words:


program 25
# split_example2.pl
$sentence = "The    quick    brown   fox    jumped";
@words = split(/ /, $sentence);
print "word0 = '$words[0]'\n";      # 'The'
print "word1 = '$words[1]'\n";      # ''
print "word2 = '$words[2]'\n";      # ''
print "word3 = '$words[3]'\n";      # ''
print "word4 = '$words[4]'\n";      # 'quick'
download

What has happened here is that the split function in splitting on each individual space character. To remedy this:


program 26
# split_example3.pl
$sentence = "The    quick    brown   fox    jumped";
@words = split(' ', $sentence);
print "word0 = '$words[0]'\n";      # 'The'
print "word1 = '$words[1]'\n";      # 'quick'
print "word2 = '$words[2]'\n";      # 'brown'
print "word3 = '$words[3]'\n";      # 'fox'
print "word4 = '$words[4]'\n";      # 'jumped'
download

specifying an empty split term will break a string into individual characters:


program 27
# split_example4.pl
$alphabet = "ABCDEF";
@words = split(//, $alphabet); # @words = ('A', 'B', 'C', 'D', 'E', 'F')
download

The qw operator

This is an operator, not a function.

It is used purely for convenience when specifying a list of words.


program 28
# using_qw.pl
@words = qw(The quick brown fox     jumped);
printf "The number of words is = %d\n", scalar(@words);   # 5
print "@words\n";                       # The quick brown fox jumped
download

Hashes

Hashes (also known as hashtables or dictionaries or associative arrays) are common data structure in programming. They are built into the language in perl.

What is a hash?

With an array, you index values by a numberic index. With hashes, you can use a symbolic index. (Think of a telephone directory for hashes, and a row of numbered houses for arrays)

The symbolic index is known as the key

The result is known as the value

The hash is a mappings between a set of keys and values


program 29
# re_hash.pl

# initialize the lookup table %re_lookup = ( 'Eco47III'=> 'AGCGCT', 'EcoNI' => 'CCTNNNNNAGG', 'EcoRI' => 'GAATTC', 'EcoRII' => 'CCWGG', 'HincII' => 'GTYRAC', 'HindII' => 'GTYRAC', 'HindIII' => 'AAGCTT', 'HinfI' => 'GANTC' );

print "Enter restriction enzyme name\n"; $re=<STDIN>; chomp $re;

$seq = $re_lookup{$re}; if (defined($seq)) { print "RE sequence for $re is: $seq\n"; } else { print "Sorry, I don't know about \"$re\""; }

download

The symbol to indicate a hash table is %

Hashes can be specified in a similar way to arrays; use the parentheses () to construct a hash.

The construct for looking up a hash is:

value = hashvariable => { key }

we use => to indicate the key and the value

The keys and values functions

The keys function takes a hash as argument and returns a list of keys in that hash

The values function takes a hash as argument and returns a list of values in that hash

program 30
# keys_example.pl

# create a lookup table of GenBank accessions # keyed by Clone ID %accession_hash = ( "BACR01A01" => "AC005555", "BACR48E02" => "AC005577", "BACR24K17" => "AC005101", );

# get all the keys in the hash (hash is keyed by clone ID) @clones = keys %accession_hash; print "Clone IDs: @clones\n"; # prints BACR01A01 BACR48E02 BACR24K17

# get all the values in the hash (hash is a lookup for accessions) @accs = values %accession_hash; print "GenBank Accessions: @accs\n"; # prints AC005555 AC005577 AC005101

download

Reverse on hashes

You can use the reverse function to reverse a hash; unlike arrays, this does not affect the order. Hashes are implicitly unordered.

reverse will map the values onto the keys.

program 31
%re_lookup_by_seq = reverse %re_lookup;
print "I recognise GAATTC as being $re_lookup_by_seq{GAATTC}\n";
# the above should print EcoRI
download

For this to work, there must be a one to one mapping between keys and values

Removing elements from a hash

program 32
delete $re_lookup{"EcoRI"}
download

We can also set the value in a hash:


program 33
# translate1.pl

%translate = (); # initialize the hash $translate{'atg'} = 'M'; $translate{'taa'} = '*'; $translate{'ctt'} = 'K'; print $translate{'atg'}; # prints M

download

Exercises:

Deitel exercises 4.5, 4.6

Write a program that sorts dna seqs by size. The output should be one sequence per line, like this.

% perl sort_by_seqsize.pl AAA TCCAAAGGGT  ATTGG
Sorted seqs:
AAA
ATTGG
TCCAAAGGGT


you may want to use the *length* function; see Deitel p290

Write an english to portuguese translator; or a program that translates between any two languages. Keep the vocabulary down to ten words or so, and forget about grammar/context altogether

Extend it so that you can go either way.

EITHER

Deitel 4.7 (do two kinds of shuffling; the kind that is mentioned in the book, and the shuffle you would get from cutting the cards)

OR

Write a molecular evolution simulation program(!) Do it for only one generation. (Choose artificially exaggerated probabilities for testing).

1. Define a paramater: chance of single point mutation

This should either be hardcoded or come from the user

% perl recomb.pl 0.5 AAAAATTTTTTT
after one generation:
AAAAATGTTTTT

assume that each base is equally likely in an outcome in the event of a point mutation.

2. Add other mutation/recombination events; feel free to be biologically unrealistic in order to make a fun simulation.

If you've made it this far you're doing extremely well. The rest of the exercises are optional.

Write a mini-medline system. Create a hash of journal article titles by medline ID. Just populate it with 3 or so entries, you can make them up if you like, e.g.


program 34
1000 => "Made up article title",
download

1. Write a program to allow people to look up journal article titles by ID

2. Extend the program to allow people to get the medline ID if they give the exact title. You should ask the user the question : search by ID/title?

3. (Extra) Extend the "database" (e.g. add other hashtables) such that titles can be looked up by author and/or journal name. What happens when an author has written more than one article? Discuss the limitations of the system.

Deitel


Chris Mungall cjm@fruitfly.org
Berkeley Drosophila Genome Project