RESTML

version 3.5c RESTML -- Restriction sites Maximum Likelihood program

(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.

     This program implements a maximum likelihood method for restriction  sites
data  (not  restriction  fragment  data).   This program is perhaps the slowest
program in this package, and can be  very  tedious  to  run.   Although  it  is
possible  to  have  the program search for the maximum likelihood tree, it will
probably only be practical for most users (those that do not have  workstation-
class  machines)  to  use  the U (User Tree) option, which takes less run time,
optimizing  branch  lengths  and  computing  likelihoods  for  particular  tree
topologies suggested by the user.  The model used here is essentially identical
to that used by  Smouse  and  Li  (1987)  who  give  explicit  expressions  for
computing  the  likelihood  for  three-species  trees.  It does not place prior
probabilities on trees as they do.  The present program extends their  approach
to  multiple  species  by  a  technique  which, while it does not give explicit
expressions for likelihoods, does enable their computation  and  the  iterative
improvement  of  branch  lengths.   It  also  allows  for  multiple restriction
enzymes.  The algorithm is described in a recent  paper  (Felsenstein,  1992b).
Another relevant paper is that of DeBry and Slade (1985).

     The assumptions of the present model are:

     1. Each restriction site evolves independently.

     2. Different lineages evolve independently.

     3. Each site undergoes substitution at an expected rate which we specify.

     4. Substitutions consist of replacement of a  nucleotide  by  one  of  the
other three nucleotides, chosen at random.

     Note that if the existing base is, say, an  A,  the  chance  of  it  being
replaced  by a G is 1/3, and so is the chance that it is replaced by a T.  This
means that there can be no difference in the (expected) rate of transitions and
transversions.   Users  who  are  upset  at  this  might ponder the fact that a
version allowing different rates of transitions and transversions would run  an
estimated  16  times slower.  If it also allowed for unequal frequencies of the
four bases, it would run about 300,000 times slower!  For the moment,  until  a
better method is available, I guess I'll stick with this one!


                           INPUT FORMAT AND OPTIONS

     Subject to these assumptions, the  program  is  an  approximately  correct
maximum  likelihood  method.   The input is fairly standard, with one addition.
As usual the first line of the file gives the number of species and the  number
of  sites,  but  there is also a third number, which is the number of different
restriction enzymes that were used to detect the  restriction  sites.   Thus  a
data  set with 10 species and 35 different sites, representing digestion with 4
different enzymes, would have the first line of the data file look like this:

   10   35    4

The first line of the data file will also contain a letter  W  following  these



numbers  (and  separated  from  them by a space) if the Weights option is being
used.  As with all programs using the weights option, a line or lines must then
follow, before the data, with the weights for each site.

     The site data are in standard form.  Each species starts  with  a  species
name  whose  maximum  length is given by the constant "nmlngth" (whose value in
the program as distributed is 10 characters).  The name should,  as  usual,  be
padded  out  to  that  length  with  blanks  if necessary.  The sites data then
follows, one character per site (any blanks will be skipped and ignored).  Like
the  DNA and protein sequence data, the restriction sites data may be either in
the "interleaved" form  or  the  "sequential"  form.   Note  that  if  you  are
analyzing  restriction  sites  data  with  the  programs DOLLOP or MIX or other
discrete character programs, at the  moment  those  programs  do  not  use  the
"aligned"  or  "interleaved" data format.  Therefore you may want to avoid that
format when you have restriction sites data that you will  want  to  feed  into
those programs.

     The presence of a site is indicated by a "+" and the absence by a "-".   I
have  also  allowed  the  use  of  "1" and "0" as synonyms for "+" and "-", for
compatibility with MIX and DOLLOP which do not  allow  "+"  and  "-".   If  the
presence of the site is unknown (for example, if the DNA containing it has been
deleted so that one does not know whether it would  have  contained  the  site)
then  the  state  "?"  can  be  used to indicate that the state of this site is
unknown.

     User-defined trees may follow the data in the usual way.  The  trees  must
be unrooted, which means that at their base they must have a trifurcation.

     The options are selected by a menu, which looks like this:



Restriction site Maximum Likelihood method, version 3.5c

Settings for this run:
  U                 Search for best tree?  Yes
  A               Are all sites detected?  No
  G                Global rearrangements?  No
  J   Randomize input order of sequences?  No. Use input order
  L                          Site length?  6
  O                        Outgroup root?  No, use as outgroup species  1
  E                  Extrapolation factor  100.0
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)


     The U, J, O, M, and 0 options are the usual ones, described  in  the  main
documentation  file.   The  I option selects between Interleaved and Sequential
input data formats,  and  is  described  in  the  documentation  file  for  the
molecular sequences programs.

     The G (global search) option causes, after the last species  is  added  to
the  tree,  each  possible group to be removed and re-added.  This improves the
result, since the position of every species is reconsidered.  It  approximately



triples the run-time of the program.

     The three options specific to this program are the A, L,  and  E  options.
The  L  (Length)  option  allows the user to specify the length in bases of the
restriction sites.  Allowed  values  are  1  to  8  (the  constant  "maxcutter"
controls  the  maximum  allowed value).  At the moment the program assumes that
all sites have the same length (for  example,  that  all  enzymes  are  6-base-
cutters).  The default value for this parameter is 6, which will be used if the
L option is not invoked.  A desirable future development for the package  would
be  allowing  the L parameter to be different for every site.  It would also be
desirable to allow for ambiguities in the recognition site, since some  enzymes
recognize  2  or  4  sequences.  Both of these would require fairly complicated
programming or else slower execution times.

     The A (All) option specifies that all sites are detected, even  those  for
which  all of the species have the recognition sequence absent (character state
"-").  The default condition is that it is assumed that  such  sites  will  not
occur  in  the  data.  The likelihood computed when the A option is not used is
the probability of the pattern of sites given that tree and conditional on  the
pattern  not  being all absences.  This will be realistic for most data, except
for cases in which the data are extracted from sites data for a  larger  number
of species, in which case some of the site positions could have all absences in
the subset of species.  In such cases an effective way of  analyzing  the  data
would  be to omit those sites and not use the A option, as such positions, even
if not absolutely excluded, are nevertheless less likely than  random  to  have
been incorporated in the data set.

     The E option allows the user to reset the  extrapolation  factor  used  in
iterating  branch  lengths.  This is initially 100.  You may want to drop it to
10 or raise it to 1000.  You can test  whether  that  improves  the  result  by
comparing  the resulting likelihoods.  In particular, if too many of the branch
lengths on the tree are zero  or  nearly  zero,  this  may  indicate  that  the
extrapolation factor is too large.

     The W (Weights) option, which is invoked in the input file rather than  in
the  menu,  allows  the user to select a subset of sites to be analyzed.  It is
invoked in the usual way, except that only weights 0 and 1 are allowed.  If the
W  option  is  not  used, all sites will be analyzed.  If the Weights option is
used, there must be a W in the first line of the input file.


                                 OUTPUT FORMAT

     The output starts by giving the number  of  species,  and  the  number  of
sites.   If  the  default condition is used instead of the A option the program
states that it is assuming that sites absent in all species have been  omitted.
The value of the site length (6 bases, for example) is also given.

     If option 2 (print out data) has been  selected,  there  then  follow  the
restriction  site  sequences,  printed in groups of ten sites.  The trees found
are printed as an unrooted tree topology (possibly rooted  by  outgroup  if  so
requested).   The  internal  nodes  are  numbered  arbitrarily  for the sake of
identification.  The number of trees evaluated so far and the log likelihood of
the tree are also given.

     A table is printed showing the length of each tree  segment,  as  well  as
(very)  rough  confidence limits on the length.  As with DNAML, if a confidence
limit is negative, this indicates that rearrangement of the tree in that region
is  not excluded, while if both limits are positive, rearrangement is still not
necessarily excluded because the variance calculation on which  the  confidence
limits are based results in an underestimate, which makes the confidence limits



too narrow.

     In addition to  the  confidence  limits,  the  program  performs  a  crude
Likelihood  Ratio Test (LRT) for each branch of the tree.  The program computes
the ratio of likelihoods with and without this branch  length  forced  to  zero
length.   This  done  by  comparing  the  likelihoods changing only that branch
length.  A truly correct LRT would force that branch length to  zero  and  also
allow  the  other  branch  lengths  to  adjust  to that.  The result would be a
likelihood ratio closer to 1.  Therefore the present LRT will err on  the  side
of being too significant.

     One should also realize that if you are looking not at a previously-chosen
branch  but at all branches, that you are seeing the results of multiple tests.
With 20 tests, one is expected to reach significance  at  the  P  =  .05  level
purely   by  chance.   You  should  therefore  use  a  much  more  conservative
significance  level,  such  as  .05  divided  by  the  number  of  tests.   The
significance  of  these  tests  is  shown  by  printing  asterisks  next to the
confidence interval on each branch length.  It is important  to  keep  in  mind
that  both  the confidence limits and the tests are very rough and approximate,
and probably  indicate  more  significance  than  they  should.   Nevertheless,
maximum  likelihood  is one of the few methods that can give you any indication
of its own error; most other methods simply fail to warn the user that there is
any  error!   (In  fact, whole philosophical schools of taxonomists exist whose
main point seems to be that there isn't any error, that the "most parsimonious"
tree is the best tree by definition and that's that).

     The log likelihood printed out with the final tree can be used to  perform
various  likelihood  ratio  tests.   Remember  that  testing  one tree topology
against another is not a simple matter, because two different  tree  topologies
are  not  hypotheses that are nested one within the other.  If the trees differ
by only one branch swap, it seems to be conservative  to  test  the  difference
between  their  likelihoods  with  one  degree  of freedom, but other than that
little is known and more work on this is needed.

     If the U (User Tree) option is used and more than one  tree  is  supplied,
the program also performs a statistical test of each of these trees against the
one with highest likelihood.  This test, invented Kishino and  Hasegawa  (1989)
uses  the  mean and variance of log-likelihood differences between trees, taken
across sites.  If the mean is more than 1.96 standard deviations different then
the  trees  are  declared  significantly  different.  This use of the empirical
variance of log-likelihood differences is more robust  and  nonparametric  than
the  classical likelihood ratio test, and may to some extent compensate for the
any lack of realism in the model underlying this program.   The program  prints
out  a  table of the log-likelihoods of each tree, the differences of each from
the highest one, the variance of  that  quantity  as  determined  by  the  log-
likelihood differences at individual sites, and a conclusion as to whether that
tree is or is not significantly worse than the best one.   The  maximum  number
of  user trees that can be analyzed is given by the constant "maxtrees" (set to
10 in the distribution copy to save storage space).

     The branch lengths printed out are scaled in terms of expected numbers  of
base  substitutions, not counting replacements of a base by itself.  Of course,
when a branch is twice as long this does not mean that there will be  twice  as
much  net change expected along it, since some of the changes occur in the same
site and overlie  or even reverse each other.  Confidence limits on the  branch
lengths  are  also  given.   Of course a negative value of the branch length is
meaningless, and a confidence limit overlapping  zero  simply  means  that  the
branch length is not necessarily significantly different from zero.  Because of
limitations of the numerical algorithm, branch length estimates  of  zero  will
often  print  out as small numbers such as 0.00001.  If you see a branch length
that small, it is really estimated to be of zero length.



     Another possible source of confusion is the existence of  negative  values
for  the  log  likelihood.  This is not really a problem; the log likelihood is
not a probability but the logarithm of a probability, and  since  probabilities
never exceed 1.0 this logarithm will typically be negative.  The log likelihood
is maximized by being made more positive: -30.23 is worse than -29.14.  The log
likelihood  will not always be negative since a combinatorial constant has been
left out of the expression for the likelihood.  This does not affect  the  tree
found or the likelihood ratios (or log likelihood differences) between trees.


                                 THE ALGORITHM

     The program uses an EM-algorithm to update one branch length at a time.  I
have  described  this method recently (Felsenstein, 1992).  The likelihood that
is being maximized is the same one used by Smouse and Li  (1987)  extended  for
multiple  species.   Especially when the A option is not used, the EM algorithm
is quite slow by itself.  I have therefore resorted to two ways of speeding  it
up.    The   first   involves  the  constant  "extrapol0".   This  involves  an
extrapolation.  For example, if the  EM  algorith  would  increase  the  branch
length  by  0.0001  in a single cycle, this change is multiplied by 100 (or the
value of extrapol) so that the change made would be 0.01.  This carries with it
the  risk of overshoot and moving down on the likelihood surface.  You may have
to "tune" the value of extrapol to suit your data.

     Even this change leaves the algorithm far too  slow.   I  have  therefore,
every  three  cycles  of  the  EM  iteration,  put  in  a  step  using Aitken's
acceleration method (Aitken's delta-squared  method  found  in  most  numerical
analysis  texts).   This  is  a  risky  method that can also go downhill on the
likelihood surface.  You could disable it by changing  the  value  of  constant
"iterations"  to  less than 4, but I think that you would then find the program
unacceptably slow.


                               PROGRAM CONSTANTS

     The constant "maxtrees" is the maximum number of user trees  that  can  be
processed.   It  is  small  (10) at present to save some further memory but the
cost of  increasing  it  is  not  very  great.   The  other  constants  include
"maxcutter," the maximum length of an enzyme recognition site.  The memory used
by the program will be approximately proportional to this value, which is 8  in
the  distribution  copy.   The  constant  "namelength" is the length of species
names in characters.   The  constants  "smoothings",  "iterations",  "epsilon",
"extrap0",  and  "initialv" help "tune" the algorithm and define the compromise
between execution speed  and  the  quality  of  the  branch  lengths  found  by
iteratively maximizing the likelihood.  Reducing "iterations" and "smoothings",
and decreasing "epsilon", will result in faster execution but a  worse  result.
These values will not usually have to be changed.

     The program spends most of its time doing real arithmetic.   Any  software
or  hardware changes that speed up that arithmetic will speed it up by a nearly
proportional amount.  For example, microcomputers having  a  math  co-processor
should  run it much faster, if the executable program calls it.  The algorithm,
with separate and independent computations occurring at each site, lends itself
readily to parallel processing.

     A feature of the algorithm is that it saves time by recognizing  sites  at
which  the  pattern  of presence/absence is the same, and does that computation
only once.  Thus if we have only four species but  a  large  number  of  sites,
there  are  only  about  (ignoring  ambiguous  bases)  16 different patterns of
presence/absence (2 x 2 x 2 x 2) that can  occur.   The  program  automatically
counts occurrences of each and does the computation for each pattern only once,



so that it only needs to do as much computation as would be needed with at most
16  sites,  even  though the number of sites is actually much larger.  Thus the
program will run very effectively with few species and many sites.


                        PAST AND FUTURE OF THE PROGRAM

     This program was developed by modifying DNAML version 3.1 and also  adding
some  of  the modifications that were added to DNAML version 3.2, with which it
shares many of its data structures and much of its strategy.

     There are a number of obvious directions in which the program needs to  be
modified  in  the future.  Extension to allow for different rates of transition
and  transversion  is  straightforward,  but  would  slow  down   the   program
considerably,  as  I  have mentioned above.  I have not included in the program
any provision for saving and printing  out  multiple  trees  tied  for  highest
likelihood, in part because an exact tie is unlikely.

     Given that I have had to do all my own  programming,  these  changes  will
take  place  gradually over future versions of PHYLIP.  Users who get impatient
for them are invited to discuss with me the possibility that  they  could  make
the  required  changes  themselves.   Of course I would particularly appreciate
hearing about any problems users have with this program.

----------------------------TEST DATA SET--------------------------

   5   13   2
Alpha     ++-+-++--+++-
Beta      ++++--+--+++-
Gamma     -+--+-++-+-++
Delta     ++-+----++---
Epsilon   ++++----++---

----- CONTENTS OF OUTPUT FILE (if all numerical options are on) -----

Restriction site Maximum Likelihood method, version 3.5c


  Recognition sequences all 6 bases long

Sites absent from all species are assumed to have been omitted


Name            Sites
----            -----

Alpha        ++-+-++--+ ++-
Beta         ++++--+--+ ++-
Gamma        -+--+-++-+ -++
Delta        ++-+----++ ---
Epsilon      ++++----++ ---





  +-Gamma
  !
  !     +Epsilon
  !  +--3
--1--2  +Delta



  !  !
  !  +Beta
  !
  +Alpha


remember: this is an unrooted tree!

Ln Likelihood =   -40.36177

Examined   15 trees

 Between        And            Length      Approx. Confidence Limits
 -------        ---            ------      ------- ---------- ------
   1          Gamma           0.11490     (  0.11237,     0.11743) **
   1             2            0.00014     (  0.00006,     0.00022)
   2             3            0.05680     (  0.05508,     0.05852) **
   3          Epsilon         0.00010     (  0.00003,     0.00017)
   3          Delta           0.01517     (  0.01434,     0.01601) **
   2          Beta            0.00010     (  0.00003,     0.00017)
   1          Alpha           0.02470     (  0.02359,     0.02580) **

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01