version 3.5c
KITSCH -- Fitch-Margoliash and Least Squares Methods with
Evolutionary Clock
(c) Copyright 1986-1993 by Joseph Felsenstein and by the University of
Washington. Written by Joseph Felsenstein. Permission is granted to copy this
document provided that no fee is charged for it and that this copyright notice
is not removed.
This program carries out the Fitch-Margoliash and Least Squares methods,
plus a variety of others of the same family, with the assumption that all tip
species are contemporaneous, and that there is an evolutionary clock (in
effect, a molecular clock). This means that branches of the tree cannot be of
arbitrary length, but are constrained so that the total length from the root of
the tree to any species is the same. The quantity minimized is the same
weighted sum of squares described in the Distance Matrix Methods documentation
file.
The options are set using the menu:
Fitch-Margoliash method with contemporary tips, version 3.5c
Settings for this run:
U Search for best tree? Yes
P Power? 2.00000
- Negative branch lengths allowed? No
L Lower-triangular data matrix? no
R Upper-triangular data matrix? No
S Subreplicates? No
J Randomize input order of species? No. Use input order
M Analyze multiple data sets? No
0 Terminal type (IBM PC, VT52, ANSI)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Are these settings correct? (type Y or the letter for one to change)
The options are as described in the Distance Matrix Programs documentation
file. Note, of course, that the User Trees (used by option U) must be rooted
trees (with a bifurcation at their base). If you take a user tree from FITCH
and try to evaluate it in KITSCH, it must first be rooted. This can be done
using RETREE. Of the options available in FITCH, the O option is not
available, as KITSCH estimates a rooted tree which cannot be rerooted, and the
G option is not available, as global rearrangement is the default condition
anyway. It is also not possible to specify that specific branch lengths of a
user tree be retained when it is read into KITSCH, unless all of them are
present. In that case the tree should be properly clocklike. Readers who
wonder why we have not proovided the feature of holding some of the user tree
branch lengths constant while iterating others are invited to tell us how they
would do it. As you consider particular possible patterns of branch lengths
you will find that the matter is not at all simple.
The input is exactly the same as described in the Distance Matrix Methods
documentation file. The output is a rooted tree, together with the sum of
squares, the number of tree topologies searched, and, if the power P is at its
default value of 2.0, the Average Percent Standard Deviation is also supplied.
The lengths of the branches of the tree are given in a table, that also shows
for each branch the time at the upper end of the branch. "Time" here really
means cumulative branch length from the root, going upwards (on the printed
diagram, rightwards). For each branch, the "time" given is for the node at the
right (upper) end of the branch. It is important to realize that the branch
lengths are not exactly proportional to the lengths drawn on the printed tree
diagram! In particular, short branches are exaggerated in the length on that
diagram so that they are more visible.
The method may be considered as providing an estimate of the phylogeny.
Alternatively, it can be considered as a phenetic clustering of the tip
species. This method minimizes an objective function, the sum of squares, not
only setting the levels of the clusters so as to do so, but rearranging the
hierarchy of clusters to try to find alternative clusterings that give a lower
overall sum of squares. When the power option P is set to a value of P = 0.0,
so that we are minimizing a simple sum of squares of the differences between
the observed distance matrix and the expected one, the method is very close in
spirit to Unweighted Pair Group Arithmetic Average Clustering (UPGMA), also
called Average-Linkage Clustering. If the topology of the tree is fixed and
there turn out to be no branches of negative length, its result should be the
same as UPGMA in that case. But since it tries alternative topologies and
(unless the N option is set) it combines nodes that otherwise could result in a
reversal of levels, it is possible for it to give a different, and better,
result than simple sequential clustering. Of course UPGMA itself is available
as an option in program NEIGHBOR.
The U (User Tree) option requires a bifurcating tree, unlike FITCH, which
requires an unrooted tree with a trifurcation at its base. Thus the tree shown
below would be written:
((D,E),(C,(A,B)));
If a tree with a trifurcation at the base is by mistake fed into the U option
of KITSCH then some of its species (the entire rightmost furc, in fact) will be
ignored and too small a tree read in. This should result in an error message
and the program should stop. It is important to understand the difference
between the User Tree formats for KITSCH and FITCH. You may want to use RETREE
to convert a user tree that is suitable for FITCH into one suitable for KITSCH
or vice versa.
An important use of this method will be to do a formal statistical test of
the evolutionary clock hypothesis. This can be done by comparing the sums of
squares achieved by FITCH and by KITSCH, BUT SOME CAVEATS ARE NECESSARY.
First, the assumption is that the observed distances are truly independent,
that no original data item contributes to more than one of them (not counting
the two reciprocal distances from i to j and from j to i). THIS WILL NOT HOLD
IF THE DISTANCES ARE OBTAINED FROM GENE FREQUENCIES, FROM MORPHOLOGICAL
CHARACTERS, OR FROM MOLECULAR SEQUENCES. It may be invalid even for
immunological distances and levels of DNA hybridization, provided that the use
of common standard for all members of a row or column allows an error in the
measurement of the standard to affect all these distances simultaneously. It
will also be invalid if the numbers have been collected in experimental groups,
each measured by taking differences from a common standard which itself is
measured with error. Only if the numbers in different cells are measured from
independent standards can we depend on the statistical model. The details of
the test and the assumptions are discussed in my review paper on distance
methods (Felsenstein, 1984a). For further and sometimes irrelevant controversy
on these matters see the papers by Farris (1981, 1985, 1986) and myself
(Felsenstein, 1986, 1988b).
A second caveat is that the distances must be expected to rise linearly
with time, not according to any other curve. Thus it may be necessary to
transform the distances to achieve an expected linearity. If the distances
have an upper limit beyond which they could not go, this is a signal that
linearity may not hold. It is also VERY important to choose the power P at a
value that results in the standard deviation of the variation of the observed
from the expected distances being the P/2-th power of the expected distance.
To carry out the test, fit the same data with both FITCH and KITSCH, and
record the two sums of squares. If the topology has turned out the same, we
have N = n(n-1)/2 distances which have been fit with 2n-3 parameters in FITCH,
and with n-1 parameters in KITSCH. Then the difference between S(K) and S(F)
has d1 = n-2 degrees of freedom. It is statistically independent of the value
of S(F), which has d2 = N-(2n-3) degrees of freedom. The ratio of mean squares
([S(K)-S(F)]/d1)/(S(F)/d2) should, under the evolutionary clock, have an F
distribution with n-2 and N-(2n-3) degrees of freedom respectively. The test
desired is that the F ratio is in the upper tail (say the upper 5%) of its
distribution. If the S (subreplication) option is in effect, the above degrees
of freedom must be modified by noting that N is not n(n-1)/2 but is the sum of
the numbers of replicates of all cells in the distance matrix read in, which
may be either square or triangular. A further explanation of the statistical
test of the clock is given in my more recent paper (Felsenstein, 1986).
The program uses a similar tree construction method to the other programs
in the package and, like them, is not guaranteed to give the best-fitting tree.
The assignment of the branch lengths for a given topology is a least squares
fit, subject to the constraints against negative branch lengths, and should not
be able to be improved upon. KITSCH runs more quickly than FITCH.
The constants available for modification at the beginning of the program
are "namelength", the length of a species name, and "epsilon", which defines a
small quantity needed in some of the calculations. There is no feature saving
multiply trees tied for best, because exact ties are not expected, except in
cases where it should be obvious from the tree printed out what is the nature
of the tie (as when an interior branch is of length zero).
----------------------------TEST DATA SET-------------------------------
5
Alpha 0.000 1.000 2.000 3.000 3.000
Beta 1.000 0.000 2.000 3.000 3.000
Gamma 2.000 2.000 0.000 3.000 3.000
Delta 3.000 3.000 3.000 0.000 1.000
Epsilon 3.000 3.000 3.000 1.000 0.000
----- TEST SET OUTPUT FILE (with all numerical options on) -------------
5 Populations
Fitch-Margoliash method with contemporary tips, version 3.5c
__ __ 2
\ \ (Obs - Exp)
Sum of squares = /_ /_ ------------
2
i j Obs
negative branch lengths not allowed
Name Distances
---- ---------
Alpha 0.00000 1.00000 2.00000 3.00000 3.00000
Beta 1.00000 0.00000 2.00000 3.00000 3.00000
Gamma 2.00000 2.00000 0.00000 3.00000 3.00000
Delta 3.00000 3.00000 3.00000 0.00000 1.00000
Epsilon 3.00000 3.00000 3.00000 1.00000 0.00000
+--------------Epsilon
+-----------------------------4
! +--------------Delta
--3
! +--------------Beta
! +--------------1
+--------------2 +--------------Alpha
!
+-----------------------------Gamma
Sum of squares = 0.000
Average percent standard deviation = 0.00000
examined 76 trees
From To Length Time
---- -- ------ ----
4 Epsilon 0.50000 1.50000
3 4 1.00000 1.00000
4 Delta 0.50000 1.50000
1 Beta 0.50000 1.50000
2 1 0.50000 1.00000
1 Alpha 0.50000 1.50000
3 2 0.50000 0.50000
2 Gamma 1.00000 1.50000