DNA Maximum

version 3.5c

DNAML -- DNA Maximum Likelihood program
(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.

     This program implements the maximum likelihood method for  DNA  sequences.
This  program is fairly slow, and can be expensive to run.  The present version
is, however, faster than earlier versions of DNAML.  Details of  the  algorithm
of  a  previous  version  of  DNAML  are  described  in  my paper in Journal of
Molecular Evolution (1981a).   For  some  uses  of  this  program  and  further
developments,   the   papers  of  Masami  Hasegawa  and  Hirohisa  Kishino  are
particularly instructive (Hasegawa and Yano, 1984a,  1984b;  Hasegawa  et.  al.
1985a, 1985b; Kishino and Hasegawa, 1989).  My 1981 algorithm is similar to the
present one, except that it did not allow for different  rates  of  transitions
and transversions and for different rates of evolution at different sites.  The
assumptions of the present model are:

     1.  Each site in the sequence evolves independently.

     2.  Different lineages evolve independently.

     3.  Each site undergoes substitution at an expected rate which  is  chosen
from  a  series  of  rates  (each  with  a  probability of occurrence) which we
specify.

     4.  All relevant sites are included in the sequence, not just  those  that
have changed or those that are "phylogenetically informative".

     5.  A substitution consists of one of two sorts of events:

           a.  The first kind of event  consists  of  the  replacement  of  the
              existing base by a base drawn from a pool of purines or a pool of
              pyrimidines (depending on whether the base being replaced  was  a
              purine or a pyrimidine).  It can lead either to no change or to a
              transition.

           b.  The second kind of event consists  of  the  replacement  of  the
              existing  base  by a base drawn at random from a pool of bases at
              known frequencies, independently of  the  identity  of  the  base
              which  is being replaced.  This could lead either to a no change,
              to a transition or to a transversion.

     The ratio of the two purines in the purine replacement pool is the same as
their ratio in the overall pool, and similarly for the pyrimidines.

     The ratios of transitions to transversions can be set by  the  user.   The
substitution  process  can be diagrammed as follows: Suppose that you specified
A, C, G, and T base frequencies of 0.24, 0.28, 0.27, and 0.21.











First kind of event:

   1.  Determine whether the existing base is a purine or a pyrimidine.

   2.  Draw from the proper pool:

      Purine pool:                Pyrimidine pool:

     I               I            I               I
     I   0.4706 A    I            I   0.5714 C    I
     I   0.5294 G    I            I   0.4286 T    I
     I (ratio is     I            I (ratio is     I
     I  0.24 : 0.27) I            I  0.28 : 0.21) I
     -----------------            -----------------


Second kind of event:

   Draw from the overall pool:

              I                  I
              I      0.24 A      I
              I      0.28 C      I
              I      0.27 G      I
              I      0.21 T      I
              I                  I
               ------------------

Note that if the existing base is, say, an A, the first kind  of  event  has  a
0.4706  probability  of  "replacing" it by another A.  The second kind of event
has a 0.24 chance of replacing it by  another  A.   This  rather  disconcerting
model  is used because it has nice mathematical properties that make likelihood
calculations far easier.  A closely similar, but not precisely identical  model
having  different  rates  of  transitions  and  transversions  has been used by
Hasegawa et. al. (1985b).  The transition probability formulas for the  current
model were given (with my permission) by Kishino and Hasegawa (1989).

     Note the assumption that we are looking at all sites, including those that
have  not  changed  at  all.  It is important not to restrict attention to some
sites based on whether or not they have changed; doing that would  bias  branch
lengths  by  making  them  too long, and that in turn would cause the method to
misinterpret the meaning of those sites that had changed.

     An important new development in the 3.5 release is the Hidden Markov Chain
method of inferring different rates of evolution at different sites.  This will
be described in a future paper by me and Gary Churchill, of Cornell University.
It  allowsus to specify to the program that there will be a number of different
possible evolutionary rates, what the prior probabilities of occurrence of each
is,  and  what the average length of a patch of sites all having the same rate.
The program then computes the  likelihood  by  summing  it  over  all  possible
assignments  of  rates  to  sites,  weighting  each by its prior probability of
occurrence.

     For example, if we have used the C and  R  options  (described  below)  to
specify  that  there are three possible rates of evolution,  1.0, 2.4, and 0.0,
that the prior probabilities of a site having these rates  are  0.4,  0.3,  and
0.3,  and  that  the average patch length (number of consecutive sites with the
same rate) is 2.0, the program will sum the likelihood over all  possibilities,
but  giving  less  weight  to those that (say) assign all sites to rate 2.4, or
that fail to have consecutive sites that have the same rate.




     This feature effectively removes the artificial assumption that all  sites
have  the  same  rate,  and  also  means  that  we need not know in advance the
identities of the sites that have a particular rate of evolution.


                           INPUT FORMAT AND OPTIONS

     Subject to these assumptions, the program is a correct maximum  likelihood
method.   The  input is fairly standard, with one addition.  As usual the first
line of the file gives the number of species and the number  of  sites.   There
follows the character W if the Weights options is being used.

     Next come the species data.  Each sequence starts on a  new  line,  has  a
ten-character  species  name  that  must  be blank-filled to be of that length,
followed immediately by the species data in the one-letter code.  The sequences
must  either  be  in the "interleaved" or "sequential" formats described in the
Molecular Sequence Programs document.  The I option selects between them.   The
sequences  can  have internal blanks in the sequence but there must be no extra
blanks at the end of the terminated line.  Note that a blank  is  not  a  valid
symbol for a deletion.

     After that are the lines (if any) containing the information  for  the  C,
and W options, as described below.

     The options are selected using an interactive menu.  The menu  looks  like
this:


Nucleic acid sequence Maximum Likelihood method, version 3.5c

Settings for this run:
  U                 Search for best tree?  Yes
  T        Transition/transversion ratio:  2.0000
  F       Use empirical base frequencies?  Yes
  C   One category of substitution rates?  Yes
  G                Global rearrangements?  No
  J   Randomize input order of sequences?  No. Use input order
  O                        Outgroup root?  No, use as outgroup species  1
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

The user either types "Y" (followed, of course, by a  carriage-return)  if  the
settings  shown  are to be accepted, or the letter or digit corresponding to an
option that is to be changed.

     The options U, J, O, M, and 0 are the usual ones.  They are  described  in
the  main documentation file of this package.  Option I is the same as in other
molecular sequence programs and is described in the documentation file for  the
sequence programs.

     The T option in this program does not stand for Threshold, but instead  is
the  Transition/transversion  option.   The  user is prompted for a real number
greater than 0.0, as the expected ratio of transitions to transversions.   Note
that  this is not the ratio of the first to the second kinds of events, but the



resulting  expected  ratio  of  transitions  to   transversions.    The   exact
relationship  between  these  two  quantities depends on the frequencies in the
base pools.  The default value of the T parameter if  you  do  not  use  the  T
option is 2.0.

     The F (Frequencies) option is one which may save users much time.  If  you
want  to  use  the  empirical  frequencies  of the bases, observed in the input
sequences, as the base frequencies, you simply use the default setting of the F
option.   These  empirical  frequencies  are  not really the maximum likelihood
estimates of the base frequencies, but they will often be close to those values
(what  they  are  is maximum likelihood estimates under a "star" or "explosion"
phylogeny).  If you change the setting of the F option you will be prompted for
the  frequencies of the four bases.  These must add to 1 and are to be typed on
one line separated by blanks, not commas.

     The C (categories) option allows the user to specify how  many  categories
of  substitution  rates there will be, and what are the rates and probabilities
for each.  The user first is asked how many categories there will be  (for  the
moment  there  is  an upper limit of 9, which should not be restrictive).  Then
the program asks for the  rates  for  each  category.   These  rates  are  only
meaningful  relative  to  each  other, so that rates 1.0, 2.0, and 2.4 have the
exact same effect as rates 2.0, 4.0, and 4.8.  Note that a  category  can  have
rate of change 0, so that this allows us to take into account that there may be
a category of sites that are invariant.  Note that the run time of the  program
will be proportional to the number of rate categories: twice as many categories
means twice as long a run.  Finally the program will ask for the  probabilities
of  a  random  site falling into each of these categories.  These probabilities
must be nonnegative and sum to 1.  Default for the  program  is  one  category,
with  rate  1.0  and probability 1.0 (actually the rate does not matter in that
case).

     If more than one category is specified, then another  option,  R,  becomes
visible  in  the  menu.   This allows us to specify that we want to assume that
sites that have the same rate category  are  expected  to  be  clustered.   The
program  asks  for  the value of the average patch length.  This is an expected
length of patches that have the same rate.  If it is 1, the rates of successive
sites  will be independent.  If it is, say, 10.25, then the chance of change to
a new rate will be 1/10.25  after  every  site.   However  the  "new  rate"  is
randomly drawn from the mix of rates, and hence could even be the same.  So the
actual observed length of patches with the same rate will  be  somewhat  larger
than  10.25.   Note below that if you choose multiple patches, there will be an
estimate in the  output  file  as  to  which  combination  of  rate  categories
contributed most to the likelihood.

     With the current options C and R the program has  gained  greatly  in  its
ability  to  infer  different rates at different sites and estimate phylogenies
under a more realistic model.  Note that Likelihood Ratio Tests can be used  to
test whether one combination of rates is significantly better than another.

     The G (global search) option causes, after the last species  is  added  to
the  tree,  each  possible group to be removed and re-added.  This improves the
result, since the position of every species is reconsidered.  It  approximately
triples the run-time of the program.

     If the U (user tree) option is chosen another option appears in the  menu,
the  L  option.   If it is selected, it signals the program that it should take
any branch lengths that are in the user tree and simply evaluate the likelihood
of  that tree, without further altering those branch lengths.   This means that
if some branches have lengths and others do not, the program will estimate  the
lengths  of  those  that do not have lengths given in the user tree.  Note that
the program RETREE can be used to add and remove lengths  from  a  tree.   This



means  that  we  can  test  hypothesis  such  as that a certain branch has zero
length, and by doing a series of runs with different specified  lengths  for  a
branch  we can plot a likelihood curve for its branch length while allowing all
other branches to adjust their lengths to it.  If  all  branches  have  lengths
specified,  none  of  them  will  be  iterated.  This is useful to allow a tree
produced by another method to have its likelihood evaluated.  The L option  has
no effect and does not appear in the menu if the U option is not used.

     The W (Weights) option is invoked in the usual way, with  only  weights  0
and  1 allowed.  It selects a set of sites to be analyzed, ignoring the others.
The sites selected are those with weight 1.  If the W option  is  not  invoked,
all sites are analyzed.


                                 OUTPUT FORMAT

     The output starts by giving the number of species, the  number  of  sites,
and  the base frequencies for A, C, G, and T that have been specified.  It then
prints out the transition/transversion ratio that  was  specified  or  used  by
default.    It   also   uses   the  base  frequencies  to  compute  the  actual
transition/transversion ratio implied by the parameter.

     If the C (Categories) option is used a table  of  the  relative  rates  of
expected  substitution  at  each  category  of sites is printed, as well as the
probabilities of each of those rates.

     There then follow the data sequences, with the base sequences  printed  in
groups of ten bases along the lines of the Genbank and EMBL formats.  The trees
found are printed as an unrooted tree topology (possibly rooted by outgroup  if
so  requested).   The  internal  nodes are numbered arbitrarily for the sake of
identification.  The number of trees evaluated so far and the log likelihood of
the  tree  are also given.  Note that the trees printed out have a trifurcation
at the base.  The branch lengths in the diagram are roughly proportional to the
estimated  branch  lengths,  except that very short branches are printed out at
least three characters in length so that the connections can be seen.

     A table is printed showing the length of each tree segment  (in  units  of
expected nucleotide substitutions per site), as well as (very) rough confidence
limits on their lengths.  As with CONTML, if a confidence  limit  is  negative,
this  indicates  that rearrangement of the tree in that region is not excluded,
while if both limits are  positive,  rearrangement  is  still  not  necessarily
excluded  because  the  variance calculation on which the confidence limits are
based results in an  underestimate,  which  makes  the  confidence  limits  too
narrow.

     In addition to  the  confidence  limits,  the  program  performs  a  crude
Likelihood  Ratio Test (LRT) for each branch of the tree.  The program computes
the ratio of likelihoods with and without this branch  length  forced  to  zero
length.   This  done  by  comparing  the  likelihoods changing only that branch
length.  A truly correct LRT would force that branch length to  zero  and  also
allow  the  other  branch  lengths  to  adjust  to that.  The result would be a
likelihood ratio closer to 1.  Therefore the present LRT will err on  the  side
of  being too significant.  YOU ARE WARNED AGAINST TAKING IT TOO SERIOUSLY.  If
you want to get a better likelihood curve  for  a  branch  length  you  can  do
multiple runs with different prespecified lengths for that branch, as discussed
above in the discussion of the L option.

     One should also realize that if you are looking not at a previously-chosen
branch  but at all branches, that you are seeing the results of multiple tests.
With 20 tests, one is expected to reach significance  at  the  P  =  .05  level
purely   by  chance.   You  should  therefore  use  a  much  more  conservative



significance  level,  such  as  .05  divided  by  the  number  of  tests.   The
significance  of  these  tests  is  shown  by  printing  asterisks  next to the
confidence interval on each branch length.  It is important  to  keep  in  mind
that  both  the confidence limits and the tests are very rough and approximate,
and probably  indicate  more  significance  than  they  should.   Nevertheless,
maximum  likelihood  is one of the few methods that can give you any indication
of its own error; most other methods simply fail to warn the user that there is
any  error!   (In  fact, whole philosophical schools of taxonomists exist whose
main point seems to be that there isn't any error, that the "most parsimonious"
tree is the best tree by definition and that's that).

     The log likelihood printed out with the final tree can be used to  perform
various  likelihood  ratio  tests.   One  can,  for  example, compare runs with
different values of the expected  transition/transversion  ratio  to  determine
which value is the maximum likelihood estimate, and what is the allowable range
of values (using a likelihood ratio test, which  you  will  find  described  in
mathematical  statistics  books).  One could also estimate the base frequencies
in the same way.  Both of these, particularly the latter, require multiple runs
of  the  program  to  evaluate  different  possible  values, and this might get
expensive.

     If the U (User Tree) option is used and more than one  tree  is  supplied,
and  the  program  is  not  told to assume autocorrelation between the rates at
different sites, the program also performs a statistical test of each of  these
trees  against  the one with highest likelihood.  This test, due to Kishino and
Hasegawa (1989), uses the  mean  and  variance  of  log-likelihood  differences
between  trees,  taken  across  sites.   If the mean is more than 1.96 standard
deviations different then the trees are declared significantly different.  This
use  of the empirical variance of log-likelihood differences is more robust and
nonparametric than the classical likelihood ratio test, and may to some  extent
compensate  for  the  any lack of realism in the model underlying this program.
The program prints out a  table  of  the  log-likelihoods  of  each  tree,  the
differences  of  each  from  the  highest one, the variance of that quantity as
determined by  the  log-likelihood  differences  at  individual  sites,  and  a
conclusion  as  to  whether that tree is or is not significantly worse than the
best one.  However the Kishino-Hasegawa-Templeton test is not available  if  we
assume  that  there is autocorrelation of rates at neighboring sites (option R)
and is not done in those cases.

     The branch lengths printed out are scaled in terms of expected numbers  of
substitutions, counting both transitions and transversions but not replacements
of a base by itself, and scaled so that the average rate  of  change,  averaged
over  all  sites  analyzed,  is  set to 1.0 if there are multiple categories of
sites.  This means that whether or not there are multiple categories of  sites,
the  expected fraction of change for very small branches is equal to the branch
length.  Of course, when a branch is twice as long  this  does  not  mean  that
there  will  be  twice  as much net change expected along it, since some of the
changes occur in the same site and overlie or even  reverse  each  other.   The
branch  lengths  estimates here are in terms of the expected underlying numbers
of changes.  That means that a branch of length 0.26 is 26 times as long as one
which  would  show  a  1%  difference  between  the nucleotide sequences at the
beginning and end of the branch.  But we would not expect the sequences at  the
beginning  and  end  of  the branch to be 26% different, as there would be some
overlaying of changes.

     Confidence limits on the branch lengths  are  also  given.   Of  course  a
negative  value  of  the  branch  length is meaningless, and a confidence limit
overlapping zero simply  means  that  the  branch  length  is  not  necessarily
significantly  different  from  zero.   Because of limitations of the numerical
algorithm, branch length estimates of  zero  will  often  print  out  as  small
numbers  such  as 0.00001.  If you see a branch length that small, it is really



estimated to be of zero length.  Note that versions 2.7  and  earlier  of  this
program  printed  out  the  branch  lengths in terms of expected probability of
change, so that they were scaled differently.

     Another possible source of confusion is the existence of  negative  values
for  the  log  likelihood.  This is not really a problem; the log likelihood is
not a probability but the logarithm of a probability.  When it is  negative  it
simply  means that the corresponding probability is less than one (since we are
seeing its logarithm).  The log likelihood is  maximized  by  being  made  more
positive: -30.23 is worse than -29.14.

     At the end of the output, if the C option is in effect with multiple  rate
categories,  the  program will print a list of what site categories contributed
the most to the final likelihood.  This combination of rate categories need not
have  contributed  a  majority  of the likelihood, just a plurality.  Still, it
will be helpful as a view of where the program infers that the higher and lower
rates  are.   Note that the use in this calculations of the prior probabilities
of different rates, and the  average  patch  length,  gives  this  inference  a
"smoothed"  appearance:  some  other  combination of rates might make a greater
contribution to the likelihood, but be discounted  because  it  conflicts  with
this prior information.  See the example output below to see what this printout
of rate categories looks like.


                               PROGRAM CONSTANTS

     The constants defined at the beginning of the program include  "maxtrees",
the  maximum  number  of user trees that can be processed.  It is small (10) at
present to same some further memory but the cost of increasing it is  not  very
great.   Other  constants  include  "maxcategories", the maximum number of site
categories, "namelength", the length of species names in characters, and  three
others,  "smoothings",  "iterations",  and  "epsilon",  that  help  "tune"  the
algorithm and define the compromise between execution speed and the quality  of
the  branch  lengths  found by iteratively maximizing the likelihood.  Reducing
iterations and smoothings,  and  increasing  epsilon,  will  result  in  faster
execution  but  a  worse  result.   These  values  will  not usually have to be
changed.

     The program spends most of its time doing real arithmetic.   Any  software
or  hardware changes that speed up that arithmetic will speed it up by a nearly
proportional amount.  For example, microcomputers having a numeric co-processor
will  run  the  program  much  faster,  if  the  compiled program uses it.  The
algorithm, with  separate  and  independent  computations  occurring  for  each
pattern, lends itself readily to parallel processing.


                        PAST AND FUTURE OF THE PROGRAM

     This program, which in version 2.6 replaced the old version of  DNAML,  is
not  derived  directly  from  it but instead was developed by modifying CONTML,
with which it shares many of its data structures and much of its strategy.   It
was  speeded  up  by  two major developments, the use of aliasing of nucleotide
sites (version 3.1) and pretabulation of  some  exponentials  (added  by  Akiko
Fuseki  in version 3.4).  In version 3.5 the Hidden Markov Chain code was added
and the method of iterating branch lengths was changed from an EM algorithm  to
direct  search.   The Hidden Markov Chain code slows things down, especially if
there is autocorrelation between sites, so this version is slower than  version
3.4.  Nevertheless we hope that the sacrifice is worth it.

     Two changes that are needed in the future  are  to  put  in  some  way  of
allowing for base composition of nucleotide sequences in different parts of the



phylogeny, and to re-introduce the code that  allowed  us  to  assign  relative
rates  of  change to individual sites.  By allowing prespecified rates and also
allowing the Hidden Markov Chain method of inferring rates at different  sites,
we could then deal with cases in which (say) first, second, and third positions
in a coding sequence had rates of (say) 1.0 : 0.8 : 2.7 but the  Hidden  Markov
Chain  also inferred rates that varied along the length of the coding sequence.
This would be much more realistic.

----------------------------TEST DATA SET--------------------------

   5   13
Alpha     AACGTGGCCAAAT
Beta      AAGGTCGCCAAAC
Gamma     CATTTCGTCACAA
Delta     GGTATTTCGGCCT
Epsilon   GGGATCTCGGCCC

(it was run with two categories with rates 1.0 and 3.2, and with
probabilities 0.4 and 0.6 for these rates, and with patch length
parameter = 1.5)

---- CONTENTS OF OUTPUT FILE (with all numerical options on) -------



Nucleic acid sequence Maximum Likelihood method, version 3.5c


Site category   Rate of change    Probability

           1        1.000            0.400
           2        3.200            0.600


Expected length of a patch of sites having the same rate =    1.500

Name            Sequences
----            ---------

Alpha        AACGTGGCCA AAT
Beta         ..G..C.... ..C
Gamma        C.TT.C.T.. C.A
Delta        GGTA.TT.GG CC.
Epsilon      GGGA.CT.GG CCC


Empirical Base Frequencies:

   A       0.24615
   C       0.29231
   G       0.24615
  T(U)     0.21538

Transition/transversion ratio =   2.000000

(Transition/transversion parameter =   1.523077)


                                                       +Epsilon
     +-------------------------------------------------3
  +--2                                                 +----Delta



  !  !
  !  +Beta
  !
--1---------------------Gamma
  !
  +---Alpha


remember: this is an unrooted tree!

Ln Likelihood =   -72.22208

Examined   15 trees

 Between        And            Length      Approx. Confidence Limits
 -------        ---            ------      ------- ---------- ------

   1             2              0.06639     (     zero,     0.46657) **
   2             3              3.32192     (     zero,    infinity)
   3          Epsilon           0.00006     (     zero,     0.30554) **
   3          Delta             0.31809     (     zero,     0.78094) **
   2          Beta              0.00003     (     zero,     0.41027) **
   1          Gamma             1.47499     (     zero,     3.68232)
   1          Alpha             0.27845     (     zero,     0.72666) **

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01

Combination of categories that contributes the most to the likelihood:

             2222111111 222