CONTML - Gene Frequencies

version 3.5c

CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method
(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.

     This program estimates phylogenies by the  restricted  maximum  likelihood
method based on the Brownian motion model.  It is based on the model of Edwards
and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967).   Gomberg  (1966),
Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work
leading to efficient algorithms.  CONTML  uses  restricted  maximum  likelihood
estimation  (REML),  which  is  the criterion used by Felsenstein (1973b).  The
actual algorithm is an iterative EM  Algorithm  (Dempster,  Laird,  and  Rubin,
1977) which is guaranteed to always give increasing likelihoods.  The algorithm
is described in detail in a paper  or  mine  (Felsenstein,  1981c),  which  you
should  definitely  consult  if  you  are  going  to  use  this  program.  Some
simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman
(1988).

     The default (gene frequency) mode treats the input as gene frequencies  at
a   series   of   loci,   and  square-root-transforms  the  allele  frequencies
(constructing the frequency of the missing allele at each locus  first).   This
enables us to use the Brownian motion model on the resulting coordinates, in an
approximation equivalent to using Cavalli-Sforza  and  Edwards's  (1967)  chord
measure  of genetic distance and taking that to give distance between particles
undergoing  pure  Brownian  motion.   It  assumes  that  each   locus   evolves
independently by pure genetic drift.

     The alternative continuous characters mode  (menu  option  C)  treats  the
input  as  a series of coordinates of each species in N dimensions.  It assumes
that  we  have  transformed  the  characters  to  remove  correlations  and  to
standardize their variances.

     The input file is as described in the continuous characters  documentation
file above.  Options are selected using a menu:






Continuous character Maximum Likelihood method version 3.5c

Settings for this run:
  U                       Search for best tree?  Yes
  C  Gene frequencies or continuous characters?  Gene frequencies
  A   Input file has all alleles at each locus?  No, one allele missing at each
  O                              Outgroup root?  No, use as outgroup species 1
  G                      Global rearrangements?  No
  J           Randomize input order of species?  No. Use input order
  M                 Analyze multiple data sets?  No
  0         Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1          Print out the data at start of run  No
  2        Print indications of progress of run  Yes
  3                              Print out tree  Yes
  4             Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

Option U is the usual User Tree option.  Options C (Continuous Characters)  and
A  (All  alleles  present)  have  been  described  in  the Gene Frequencies and
Continuous Characters Programs documentation file.  The options G, J, O  and  M
are  the  usual  Global Rearrangements, Jumble order of species, Outgroup root,
and Multiple Data Sets options.

     The G and J options have no effect if the User Tree  option  is  selected.
User  trees  are given with a trifurcation (three-way split) at the base.  They
can start from any interior node.  Thus the tree:

.     A
.     !
.     *--B
.     !
.     *-----C
.     !
.     *--D
.     !
.     E

can be represented by any of the following:

(A,B,(C,(D,E)));
((A,B),C,(D,E));
(((A,B),C),D,E);

(there are of course 69 other representations as well obtained  from  these  by
swapping the order of branches at an interior node).

     The output has a standard appearance.  The topology of the tree  is  given
by  an  unrooted  tree diagram.  The lengths (in time or in expected amounts of
variance) are given in a table below  the  topology,  and  a  rough  confidence
interval  given for each length.  Negative lower bounds on length indicate that
rearrangements may be acceptable.

     The units of length are amounts  of  expected  accumulated  variance  (not
time).   The log likelihood (natural log) of each tree is also given, and it is
indicated how many topologies have been tried.  The tree does  not  necessarily
have  all  tips  contemporary, and the log likelihood may be either positive or
negative (this simply corresponds to whether the density function does or  does
not  exceed  1) and a negative log likelihood does not indicate any error.  The
log likelihood allows various formal likelihood ratio  hypothesis  tests.   The
description  of the tree includes approximate standard errors on the lengths of



segments of the tree.  These are calculated by considering only  the  curvature
of  the  likelihood surface as the length of the segment is varied, holding all
other lengths constant.  As such they are most probably underestimates  of  the
variance, and hence may give too much confidence in the given tree.

     One should use caution in interpreting the likelihoods  that  are  printed
out.   If the model is wrong, it will not be possible to use the likelihoods to
make formal statistical  statements.   Thus,  if  gene  frequencies  are  being
analyzed,  but  the gene frequencies change not only by genetic drift, but also
by mutation, the model is not correct.  It would be as well-justified  in  this
case  to  use  GENDIST  to compute the Nei (1972) genetic distance and then use
FITCH, KITSCH or NEIGHBOR to make a tree.  If continuous characters  are  being
analyzed,  but  if  the characters have not been transformed to new coordinates
that evolve independently and at equal rates, then the model is  also  violated
and no statistical analysis is possible.

     The program makes another kind of statistical  analysis  if  the  U  (User
Tree)  option  is  invoked  and  if  multiple  user  trees are input.  Then the
pairwise statistical test of Templeton (1983) as modified  for  likelihoods  by
Kishino  and  Hasegawa  (1989)  is carried out.  This is a relative of the test
that is called by Allan Wilson (see Holmquist, Miyamoto, and Goodman, 1988) the
"winning  sites"  test.   The  test  forms  the  differences  between  the log-
likelihoods of the best tree and each given tree, locus by locus (or coordinate
by  coordinate  in  the  continuous  characters  case).   Then it carries out a
pairwise t-test (actually a z-test as it uses the normal distribution, which is
a  bit  rougher)  between  the  two  trees  to  see  whether  one  of  them  is
significantly more strongly  supported  by  the  data.   We  can  consider  the
confidence  interval of trees that are allowed by the data to be all those that
pass the test (i.e. all those  that  have  the  sum  of  their  log  likelihood
differences  more  than  1.96  standard deviations from zero).  This test makes
fewer assumptions than does the standard likelihood  ratio  test,  and  can  be
applied  when  the  LRT  is  not  valid,  as  when  we  compare  different tree
topologies.  It may be valid even when there are different rates  of  evolution
at different loci, for example.

     One problem which sometimes arises is that the program is fed two  species
(or  populations)  with identical transformed gene frequencies: this can happen
if sample sizes are small and/or many loci are monomorphic.  In this  case  the
program  "gets its knickers in a twist" and can divide by zero, usually causing
a crash.  If you suspect that this has happened, check  for  two  species  with
identical  coordinates.   If you find them, eliminate one from the problem: the
two must always show up as being at the same point on the tree anyway.

     The constants available for modification at the beginning of  the  program
include  "epsilon1", a small quantity used in the iterations of branch lengths,
"epsilon2", another not quite so small quantity  used  to  check  whether  gene
frequencies  that were fed in for all alleles do not add up to 1, "smoothings",
the number  of  passes  through  a  given  tree  in  the  iterative  likelihood
maximization for a given topology, "maxtrees", the maximum number of user trees
that will be used for the Kishino-Hasegawa-Templeton  test,  and  "namelength",
the length of species names.   There is no provision in this program for saving
multiple trees that are tied for having the highest likelihood, mostly  because
an exact tie is unlikely anyway.

     The algorithm does not run as quickly as the  discrete  character  methods
but is not enormously slower.  Like them, its execution time should rise as the
cube of the number of species.  In one paper Astolfi, Kidd, and  Cavalli-Sforza
(1981) say that "ML requires a prohibitive amount of computer time."  It should
be realized that they were using a FORTRAN program of mine nearly ten years old
at  that  time,  which  did  not take advantage of the EM algorithm to speed up
convergence to the best likelihood for a given topology.  The  current  program



should be much more practical than the one they had to use.


---------------------------------TEST DATA SET--------------------------

.   5    10
2 2 2 2 2 2 2 2 2 2
European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976

-------- TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON) --------

   5 Populations,   10 Loci

Continuous character Maximum Likelihood method version 3.5c


Numbers of alleles at the loci:
------- -- ------- -- --- -----

   2   2   2   2   2   2   2   2   2   2

Name                 Gene Frequencies
----                 ---- -----------

  locus:         1         2         3         4         5         6
                 7         8         9        10

European     0.28680   0.56840   0.44220   0.42860   0.38280   0.72850
.            0.63860   0.02050   0.80550   0.50430
African      0.13560   0.48400   0.06020   0.03970   0.59770   0.96750
.            0.95110   0.06000   0.75820   0.62070
Chinese      0.16280   0.59580   0.72980   1.00000   0.38110   0.79860
.            0.77820   0.07260   0.74820   0.73340
American     0.01440   0.69900   0.32800   0.74210   0.66060   0.86030
.            0.79240   0.00000   0.80860   0.86360
Australian   0.12110   0.22740   0.58210   1.00000   0.20180   0.90000
.            0.98370   0.03960   0.90970   0.29760


  +----------------------------------African
  !
  !              +--------American
--1--------------2
  !              !                    +-----------------------Australian
  !              +--------------------3
  !                                   +Chinese
  !
  +--European


remember: this is an unrooted tree!




Ln Likelihood =    33.29060

examined   15 trees

Between     And             Length      Approx. Confidence Limits
-------     ---             ------      ------- ---------- ------
  1       African           0.08464   (     0.02351,     0.17917)
  1          2              0.03569   (    -0.00262,     0.09493)
  2       American          0.02094   (    -0.00904,     0.06731)
  2          3              0.05098   (     0.00555,     0.12124)
  3       Australian        0.05959   (     0.01775,     0.12430)
  3       Chinese           0.00221   (    -0.02034,     0.03710)
  1       European          0.00624   (    -0.01948,     0.04601)