version 3.5c
CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method
(c) Copyright 1986-1993 by Joseph Felsenstein and by the University of
Washington. Written by Joseph Felsenstein. Permission is granted to copy this
document provided that no fee is charged for it and that this copyright notice
is not removed.
This program estimates phylogenies by the restricted maximum likelihood
method based on the Brownian motion model. It is based on the model of Edwards
and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966),
Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work
leading to efficient algorithms. CONTML uses restricted maximum likelihood
estimation (REML), which is the criterion used by Felsenstein (1973b). The
actual algorithm is an iterative EM Algorithm (Dempster, Laird, and Rubin,
1977) which is guaranteed to always give increasing likelihoods. The algorithm
is described in detail in a paper or mine (Felsenstein, 1981c), which you
should definitely consult if you are going to use this program. Some
simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman
(1988).
The default (gene frequency) mode treats the input as gene frequencies at
a series of loci, and square-root-transforms the allele frequencies
(constructing the frequency of the missing allele at each locus first). This
enables us to use the Brownian motion model on the resulting coordinates, in an
approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord
measure of genetic distance and taking that to give distance between particles
undergoing pure Brownian motion. It assumes that each locus evolves
independently by pure genetic drift.
The alternative continuous characters mode (menu option C) treats the
input as a series of coordinates of each species in N dimensions. It assumes
that we have transformed the characters to remove correlations and to
standardize their variances.
The input file is as described in the continuous characters documentation
file above. Options are selected using a menu:
Continuous character Maximum Likelihood method version 3.5c
Settings for this run:
U Search for best tree? Yes
C Gene frequencies or continuous characters? Gene frequencies
A Input file has all alleles at each locus? No, one allele missing at each
O Outgroup root? No, use as outgroup species 1
G Global rearrangements? No
J Randomize input order of species? No. Use input order
M Analyze multiple data sets? No
0 Terminal type (IBM PC, VT52, ANSI)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
Option U is the usual User Tree option. Options C (Continuous Characters) and
A (All alleles present) have been described in the Gene Frequencies and
Continuous Characters Programs documentation file. The options G, J, O and M
are the usual Global Rearrangements, Jumble order of species, Outgroup root,
and Multiple Data Sets options.
The G and J options have no effect if the User Tree option is selected.
User trees are given with a trifurcation (three-way split) at the base. They
can start from any interior node. Thus the tree:
. A
. !
. *--B
. !
. *-----C
. !
. *--D
. !
. E
can be represented by any of the following:
(A,B,(C,(D,E)));
((A,B),C,(D,E));
(((A,B),C),D,E);
(there are of course 69 other representations as well obtained from these by
swapping the order of branches at an interior node).
The output has a standard appearance. The topology of the tree is given
by an unrooted tree diagram. The lengths (in time or in expected amounts of
variance) are given in a table below the topology, and a rough confidence
interval given for each length. Negative lower bounds on length indicate that
rearrangements may be acceptable.
The units of length are amounts of expected accumulated variance (not
time). The log likelihood (natural log) of each tree is also given, and it is
indicated how many topologies have been tried. The tree does not necessarily
have all tips contemporary, and the log likelihood may be either positive or
negative (this simply corresponds to whether the density function does or does
not exceed 1) and a negative log likelihood does not indicate any error. The
log likelihood allows various formal likelihood ratio hypothesis tests. The
description of the tree includes approximate standard errors on the lengths of
segments of the tree. These are calculated by considering only the curvature
of the likelihood surface as the length of the segment is varied, holding all
other lengths constant. As such they are most probably underestimates of the
variance, and hence may give too much confidence in the given tree.
One should use caution in interpreting the likelihoods that are printed
out. If the model is wrong, it will not be possible to use the likelihoods to
make formal statistical statements. Thus, if gene frequencies are being
analyzed, but the gene frequencies change not only by genetic drift, but also
by mutation, the model is not correct. It would be as well-justified in this
case to use GENDIST to compute the Nei (1972) genetic distance and then use
FITCH, KITSCH or NEIGHBOR to make a tree. If continuous characters are being
analyzed, but if the characters have not been transformed to new coordinates
that evolve independently and at equal rates, then the model is also violated
and no statistical analysis is possible.
The program makes another kind of statistical analysis if the U (User
Tree) option is invoked and if multiple user trees are input. Then the
pairwise statistical test of Templeton (1983) as modified for likelihoods by
Kishino and Hasegawa (1989) is carried out. This is a relative of the test
that is called by Allan Wilson (see Holmquist, Miyamoto, and Goodman, 1988) the
"winning sites" test. The test forms the differences between the log-
likelihoods of the best tree and each given tree, locus by locus (or coordinate
by coordinate in the continuous characters case). Then it carries out a
pairwise t-test (actually a z-test as it uses the normal distribution, which is
a bit rougher) between the two trees to see whether one of them is
significantly more strongly supported by the data. We can consider the
confidence interval of trees that are allowed by the data to be all those that
pass the test (i.e. all those that have the sum of their log likelihood
differences more than 1.96 standard deviations from zero). This test makes
fewer assumptions than does the standard likelihood ratio test, and can be
applied when the LRT is not valid, as when we compare different tree
topologies. It may be valid even when there are different rates of evolution
at different loci, for example.
One problem which sometimes arises is that the program is fed two species
(or populations) with identical transformed gene frequencies: this can happen
if sample sizes are small and/or many loci are monomorphic. In this case the
program "gets its knickers in a twist" and can divide by zero, usually causing
a crash. If you suspect that this has happened, check for two species with
identical coordinates. If you find them, eliminate one from the problem: the
two must always show up as being at the same point on the tree anyway.
The constants available for modification at the beginning of the program
include "epsilon1", a small quantity used in the iterations of branch lengths,
"epsilon2", another not quite so small quantity used to check whether gene
frequencies that were fed in for all alleles do not add up to 1, "smoothings",
the number of passes through a given tree in the iterative likelihood
maximization for a given topology, "maxtrees", the maximum number of user trees
that will be used for the Kishino-Hasegawa-Templeton test, and "namelength",
the length of species names. There is no provision in this program for saving
multiple trees that are tied for having the highest likelihood, mostly because
an exact tie is unlikely anyway.
The algorithm does not run as quickly as the discrete character methods
but is not enormously slower. Like them, its execution time should rise as the
cube of the number of species. In one paper Astolfi, Kidd, and Cavalli-Sforza
(1981) say that "ML requires a prohibitive amount of computer time." It should
be realized that they were using a FORTRAN program of mine nearly ten years old
at that time, which did not take advantage of the EM algorithm to speed up
convergence to the best likelihood for a given topology. The current program
should be much more practical than the one they had to use.
---------------------------------TEST DATA SET--------------------------
. 5 10
2 2 2 2 2 2 2 2 2 2
European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976
-------- TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON) --------
5 Populations, 10 Loci
Continuous character Maximum Likelihood method version 3.5c
Numbers of alleles at the loci:
------- -- ------- -- --- -----
2 2 2 2 2 2 2 2 2 2
Name Gene Frequencies
---- ---- -----------
locus: 1 2 3 4 5 6
7 8 9 10
European 0.28680 0.56840 0.44220 0.42860 0.38280 0.72850
. 0.63860 0.02050 0.80550 0.50430
African 0.13560 0.48400 0.06020 0.03970 0.59770 0.96750
. 0.95110 0.06000 0.75820 0.62070
Chinese 0.16280 0.59580 0.72980 1.00000 0.38110 0.79860
. 0.77820 0.07260 0.74820 0.73340
American 0.01440 0.69900 0.32800 0.74210 0.66060 0.86030
. 0.79240 0.00000 0.80860 0.86360
Australian 0.12110 0.22740 0.58210 1.00000 0.20180 0.90000
. 0.98370 0.03960 0.90970 0.29760
+----------------------------------African
!
! +--------American
--1--------------2
! ! +-----------------------Australian
! +--------------------3
! +Chinese
!
+--European
remember: this is an unrooted tree!
Ln Likelihood = 33.29060
examined 15 trees
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
1 African 0.08464 ( 0.02351, 0.17917)
1 2 0.03569 ( -0.00262, 0.09493)
2 American 0.02094 ( -0.00904, 0.06731)
2 3 0.05098 ( 0.00555, 0.12124)
3 Australian 0.05959 ( 0.01775, 0.12430)
3 Chinese 0.00221 ( -0.02034, 0.03710)
1 European 0.00624 ( -0.01948, 0.04601)