PROTDIST

version 3.5c PROTDIST -- Program to compute distance matrix from protein sequences

(c) Copyright 1993 by Joseph Felsenstein.  Permission is granted to  copy  this
document  provided that no fee is charged for it and that this copyright notice
is not removed.

     This program uses protein sequences to compute a  distance  matrix,  under
three  different  models of amino acid replacement.  The distance for each pair
of species estimates the total branch length between the two species,  and  can
be  used in the distance matrix programs FITCH, KITSCH or NEIGHBOR.  This is an
alternative to use of  the  sequence  data  itself  in  the  parsimony  program
PROTPARS.

     The  program  reads  in  protein  sequences  and  writes  an  output  file
containing  the  distance  matrix.  The three models of amino acid substitution
are one which is based on the PAM matrixes of  Margaret  Dayhoff,  one  due  to
Kimura  (1983)  which  approximates  it based simply on the fraction of similar
amino acids, and one based on a model in which the amino acids are  divided  up
into  groups,  with change occurring based on the genetic code but with greater
difficulty of changing  between  groups.   The  program  correctly  takes  into
account a variety of sequence ambiguities.

     The three methods are:

(1) The Dayhoff PAM matrix.  This uses Dayhoff's PAM 001  matrix  from  Dayhoff
(1979),  page 348.  The PAM model is an empirical one that scales probabilities
of change from one amino acid to another  in  terms  of  a  unit  which  is  an
expected  1%  change  between  two amino acid sequences.  The PAM 001 matrix is
used to make a transition probability matrix which  allows  prediction  of  the
probability of changing from any one amino acid to any other, and also predicts
equilibrium  amino  acid  composition.   The   program   assumes   that   these
probabilities  are correct and bases its computations of distance on them.  The
distance that is computed is scaled in units  of  expected  fraction  of  amino
acids changed.

(2)  Kimura's  distance.   This  is  a  rough-and-ready  distance  formula  for
approximating  PAM distance by simply measuring the fraction of amino acids, p,
that differs between two sequences and computing the distance as (Kimura, 1983)

                                2
  D   =   - log  ( 1 - p - 0.2 p  ).
               e

This is very quick to do but has some obvious limitations.  It  does  not  take
into  account  which  amino acids differ or to what amino acids they change, so
some information is lost.  The units of the distance measure  are  fraction  of
amino  acids  differing,  as  also  in  the  case  of the PAM distance.  If the
fraction of amino acids differing gets larger than 0.8541 the distance  becomes
infinite.

(3) The Categories  distance.   This  is  my  own  concoction.   I  imagined  a
nucleotide  sequence changing according to Kimura's 2-parameter model, with the
exception that some changes of amino acids are less likely  than  others.   The
amino acids are grouped into a series of categories.  Any base change that does
not change which category the amino acid is in is allowed, but if an amino acid
changes  category  this  is  allowed  only a certain fraction of the time.  The
fraction is called the "ease" and there is a parameter for  it,  which  is  1.0
when  all  changes are allowed and near 0.0 when changes between categories are



nearly impossible.

In this option I have allowed the user to  select  the  Transition/Transversion
ratio, which of several genetic codes to use, and which categorization of amino
acids to use.  There are three of them, a somewhat random sample:

   (a) The George-Hunt-Barker (1988) classification of amino acids,

   (b) A classification provided by my colleague Ben Hall when I asked him  for
   one,

   (c) One I found in an old "baby biochemistry" book (Conn and Stumpf,  1963),
   which  contains  most  of the biochemistry I was ever taught, and all that I
   ever learned.
Interestingly enough, all of them are consisten with the same, linear, ordering
of  amino  acids,  which  they divide up in different ways.  For the Categories
model I have set as default  the  George/Hunt/Barker  classification  with  the
"ease"  parameter  set to 0.457 which is approximately the value implied by the
empirical rates in the Dayhoff PAM matrix.

     The method uses, as I have noted, Kimura's (1980) 2-parameter model of DNA
change.   The  Kimura  "2-parameter"  model  allows  for  a  difference between
transition and transversion rates.  Its transition  probability  matrix  for  a
short interval of time is:

              To:     A        G        C        T
                   ---------------------------------
               A  | 1-a-2b     a         b       b
       From:   G  |   a      1-a-2b      b       b
               C  |   b        b       1-a-2b    a
               T  |   b        b         a     1-a-2b

where a is u dt, the product of the rate of transitions per unit time and dt is
the length dt of the time interval, and b is v dt, the product of half the rate
of transversions (i.e., the rate of a specific transversion) and the length  dt
of the time interval.

     Each distance that is calculated is an estimate, from that particular pair
of  species,  of  the  divergence  time  between those two species.  The Kimura
distance is straightforward to compute.  The other two are considerably slower,
and  they  look  at  all  positions,  and  find  that  distance which makes the
likelihood highest.  This likelihood is in effect the length  of  the  internal
branch  in  a two-species tree that connects these two species.  Its likelihood
is just the product, under the model, of the  probabilities  of  each  position
having  the  (one  or) two amino acids that are actually found.  This is fairly
slow to compute.

     The computation proceeds from an eigenanalysis (spectral decomposition) of
the  transition  probability  matrix.   In  the  case of the PAM 001 matrix the
eigenvalues and eigenvectors  are  precomputed  and  are  hard-coded  into  the
program  in  over  400  statements.   In  the  case of the Categories model the
program computes the eigenvalues and eigenvectors  itself,  which  will  add  a
delay.   But  the  delay  is  independent  of  the  number  of  species  as the
calculation is done only once, at the outset.

     The actual algorithm for estimating  the  distance  is  in  both  cases  a
bisection  algorithm  which  tries to find the point at which the derivative os
the likelihood is zero.  Some of the kinds of ambiguous amino acids like  "glx"
are  correctly  taken  into  account.  However, gaps are treated as if they are
unkown  nucleotides,  which  means  those  positions  get  dropped  from   that
particular  comparison.  However, they are not dropped from the whole analysis.



You need not eliminate regions containing gaps, as long as you  are  reasonably
sure of the alignment there.

     Note that there is an assumption that we are  looking  at  all  positions,
including  those that have not changed at all.  It is important not to restrict
attention to some positions based on whether or not they  have  changed;  doing
that  would bias the distances by making them too large, and that in turn would
cause the distances to misinterpret the meaning of  those  positions  that  had
changed.


                           INPUT FORMAT AND OPTIONS

     Input is fairly standard, with one addition.  As usual the first  line  of
the  file  gives  the number of species and the number of sites.  There follows
the character W if the Weights option is being used.

     Next come the species data.  Each sequence starts on a  new  line,  has  a
ten-character  species  name  that  must  be blank-filled to be of that length,
followed immediately by the species data in the one-letter code.  The sequences
must  either  be  in the "interleaved" or "sequential" formats described in the
Molecular Sequence Programs document.  The I option selects between them.   The
sequences  can  have internal blanks in the sequence but there must be no extra
blanks at the end of the terminated line.  Note that a blank  is  not  a  valid
symbol for a deletion.

     After that are the lines (if any) containing the  information  for  the  W
option, as described below.

     The options are selected using an interactive menu.  The menu  looks  like
this:


Protein distance algorithm, version 3.5c

Settings for this run:
  P  Use PAM, Kimura or categories model?  Dayhoff PAM matrix
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes

Are these settings correct? (type Y or the letter for one to change)

The user either types "Y" (followed, of course, by a  carriage-return)  if  the
settings  shown  are to be accepted, or the letter or digit corresponding to an
option that is to be changed.

     The options M and 0 are the usual ones.  They are described  in  the  main
documentation file of this package.  Option I is the same as in other molecular
sequence programs and is described in the documentation file for  the  sequence
programs.

     The P option selects one of the three distance methods.  It toggles  among
the three methods. The default method, if none is specified, is the Dayhoff PAM
matrix model.  If the Categories distance is selected another menu  option,  T,
will  appear allowing the user to supply the Transition/Transversion ratio that
should be assumed at the underlying DNA level, and another one, C, which allows
the  user to select among various nuclear and mitochondrial genetic codes.i The
transition/transversion ratio can be any number from 0.5 upwards.



     The W (Weights) option is invoked in the usual way, with  only  weights  0
and  1 allowed.  It selects a set of sites to be analyzed, ignoring the others.
The sites selected are those with weight 1.  If the W option  is  not  invoked,
all sites are analyzed.


                                 OUTPUT FORMAT

     As the distances are computed,  the  program  prints  on  your  screen  or
terminal  the  names of the species in turn, followed by one dot (".") for each
other species for which the distance to that species has been  computed.   Thus
if  there  are  ten species, the first species name is printed out, followed by
one dot, then on the next line the next species name is printed out followed by
two dots, then the next followed by three dots, and so on.  The pattern of dots
should form a triangle.  When the distance matrix has been written out  to  the
output file, the user is notified of that.

     The output file contains on its first line  the  number  of  species.  The
distance matrix is then printed in standard form, with each species starting on
a new line with the species name, followed by the distances to the  species  in
order.   These  continue  onto  a  new  line  after  every nine distances.  The
distance matrix is square with zero distances on the diagonal.  In general  the
format  of the distance matrix is such that it can serve as input to any of the
distance matrix programs.

     If the option to print out the data is  selected,  the  output  file  will
precede  the  data  by  more  complete  information  on  the input and the menu
selections.  The output file begins by giving the number  of  species  and  the
number  of  characters,  and the identity of the distance measure that is being
used.

     In the Categories model of substitution, the  distances  printed  out  are
scaled in terms of expected numbers of substitutions, counting both transitions
and transversions but not replacements of a base by itself, and scaled so  that
the  average  rate  of  change  is  set to 1.0.  For the Dayhoff PAM and Kimura
models the distance are scaled in terms of the expected numbers of  amino  acid
substitutions  per  site.   Of course, when a branch is twice as long this does
not mean that there will be twice as much net change expected along  it,  since
some of the changes may occur in the same site and overlie or even reverse each
other.  The branch  lengths  estimates  here  are  in  terms  of  the  expected
underlying  numbers  of changes.  That means that a branch of length 0.26 is 26
times as long as one which would show a 1% difference between the  protein  (or
nucleotide) sequences at the beginning and end of the branch.  But we would not
expect the sequences at  the  beginning  and  end  of  the  branch  to  be  26%
different, as there would be some overlaying of changes.

     One problem that can arise is that two or more of the species  can  be  so
dissimilar  that  the  distance  between them would have to be infinite, as the
likelihood rises indefinitely as the estimated divergence time increases.   For
example,  with  the Kimura model, if the two sequences differ in 85.41% or more
of their positions then the estimate of  divergence  time  would  be  infinite.
Since there is no way to represent an infinite distance in the output file, the
program regards this as an error, issues a  warning  message  indicating  which
pair of species are causing the problem, and computes a distance of -1.0.










                               PROGRAM CONSTANTS

     The constants that are  available  to  be  changed  by  the  user  at  the
beginning  of the program include The other constants include "namelength", the
length of species  names  in  characters,  and  "epsilon",  a  parameter  which
controls  the  accuracy  of  the  results  of the iterations which estimate the
distances.  Making "epsilon" smaller will increase run times but result in more
decimal places of accuracy.  This should not be necessary.

     The program spends most of its time doing real arithmetic.   Any  software
or  hardware changes that speed up that arithmetic will speed it up by a nearly
proportional amount.  For example,  microcomputers  that  have  a  numeric  co-
processor  (such  as  an 8087, 80287, or 80387 chip) will run this program much
faster than ones that do not, if the software calls it.

--------------------------------TEST DATA SET--------------------------

   5   13
Alpha     AACGTGGCCACAT
Beta      AAGGTCGCCACAC
Gamma     CAGTTCGCCACAA
Delta     GAGATTTCCGCCT
Epsilon   GAGATCTCCGCCC

------ CONTENTS OF OUTPUT FILE (with all numerical options on ) -----------
Name            Sequences
----            ---------

Alpha        AACGTGGCCA CAT
Beta         ..G..C.... ..C
Gamma        C.GT.C.... ..A
Delta        G.GA.TT..G .C.
Epsilon      G.GA.CT..G .CC



    5
Alpha          0.00000  0.47285  0.88304  1.29841  2.12269
Beta           0.47285  0.00000  0.45192  1.34185  0.84009
Gamma          0.88304  0.45192  0.00000  1.30693  1.21582
Delta          1.29841  1.34185  1.30693  0.00000  0.27536
Epsilon        2.12269  0.84009  1.21582  0.27536  0.00000