KITSCH

version 3.5c

KITSCH -- Fitch-Margoliash and Least Squares Methods with Evolutionary Clock
(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.

     This program carries out the Fitch-Margoliash and Least  Squares  methods,
plus  a  variety of others of the same family, with the assumption that all tip
species are contemporaneous, and  that  there  is  an  evolutionary  clock  (in
effect,  a molecular clock).  This means that branches of the tree cannot be of
arbitrary length, but are constrained so that the total length from the root of
the  tree  to  any  species  is  the  same.  The quantity minimized is the same
weighted sum of squares described in the Distance Matrix Methods  documentation
file.

     The options are set using the menu:


Fitch-Margoliash method with contemporary tips, version 3.5c

Settings for this run:
  U                 Search for best tree?  Yes
  P                                Power?  2.00000
  -      Negative branch lengths allowed?  No
  L         Lower-triangular data matrix?  no
  R         Upper-triangular data matrix?  No
  S                        Subreplicates?  No
  J     Randomize input order of species?  No. Use input order
  M           Analyze multiple data sets?  No
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)

The options are as described in  the  Distance  Matrix  Programs  documentation
file.   Note,  of course, that the User Trees (used by option U) must be rooted
trees (with a bifurcation at their base).  If you take a user tree  from  FITCH
and  try  to  evaluate it in KITSCH, it must first be rooted.  This can be done
using RETREE.  Of  the  options  available  in  FITCH,  the  O  option  is  not
available,  as KITSCH estimates a rooted tree which cannot be rerooted, and the
G option is not available, as global rearrangement  is  the  default  condition
anyway.   It  is also not possible to specify that specific branch lengths of a
user tree be retained when it is read into  KITSCH,  unless  all  of  them  are
present.   In  that  case  the  tree should be properly clocklike.  Readers who
wonder why we have not proovided the feature of holding some of the  user  tree
branch  lengths constant while iterating others are invited to tell us how they
would do it.  As you consider particular possible patterns  of  branch  lengths
you will find that the matter is not at all simple.

     The input is exactly the same as described in the Distance Matrix  Methods
documentation  file.   The  output  is  a rooted tree, together with the sum of



squares, the number of tree topologies searched, and, if the power P is at  its
default  value of 2.0, the Average Percent Standard Deviation is also supplied.
The lengths of the branches of the tree are given in a table, that  also  shows
for  each  branch  the time at the upper end of the branch.  "Time" here really
means cumulative branch length from the root, going  upwards  (on  the  printed
diagram, rightwards).  For each branch, the "time" given is for the node at the
right (upper) end of the branch.   It is important to realize that  the  branch
lengths  are  not exactly proportional to the lengths drawn on the printed tree
diagram!  In particular, short branches are exaggerated in the length  on  that
diagram so that they are more visible.

     The method may be considered as providing an estimate  of  the  phylogeny.
Alternatively,  it  can  be  considered  as  a  phenetic  clustering of the tip
species.  This method minimizes an objective function, the sum of squares,  not
only  setting  the  levels  of the clusters so as to do so, but rearranging the
hierarchy of clusters to try to find alternative clusterings that give a  lower
overall  sum of squares.  When the power option P is set to a value of P = 0.0,
so that we are minimizing a simple sum of squares of  the  differences  between
the  observed distance matrix and the expected one, the method is very close in
spirit to Unweighted Pair Group Arithmetic  Average  Clustering  (UPGMA),  also
called  Average-Linkage  Clustering.   If the topology of the tree is fixed and
there turn out to be no branches of negative length, its result should  be  the
same  as  UPGMA  in  that  case.  But since it tries alternative topologies and
(unless the N option is set) it combines nodes that otherwise could result in a
reversal  of  levels,  it  is  possible for it to give a different, and better,
result than simple sequential clustering.  Of course UPGMA itself is  available
as an option in program NEIGHBOR.

     The U (User Tree) option requires a bifurcating tree, unlike FITCH,  which
requires an unrooted tree with a trifurcation at its base.  Thus the tree shown
below would be written:

    ((D,E),(C,(A,B)));

If a tree with a trifurcation at the base is by mistake fed into the  U  option
of KITSCH then some of its species (the entire rightmost furc, in fact) will be
ignored and too small a tree read in.  This should result in an  error  message
and  the  program  should  stop.   It is important to understand the difference
between the User Tree formats for KITSCH and FITCH.  You may want to use RETREE
to  convert a user tree that is suitable for FITCH into one suitable for KITSCH
or vice versa.

     An important use of this method will be to do a formal statistical test of
the  evolutionary  clock hypothesis.  This can be done by comparing the sums of
squares achieved by FITCH and  by  KITSCH,  BUT  SOME  CAVEATS  ARE  NECESSARY.
First,  the  assumption  is  that the observed distances are truly independent,
that no original data item contributes to more than one of them  (not  counting
the  two reciprocal distances from i to j and from j to i).  THIS WILL NOT HOLD
IF THE  DISTANCES  ARE  OBTAINED  FROM  GENE  FREQUENCIES,  FROM  MORPHOLOGICAL
CHARACTERS,   OR  FROM  MOLECULAR  SEQUENCES.   It  may  be  invalid  even  for
immunological distances and levels of DNA hybridization, provided that the  use
of  common  standard  for all members of a row or column allows an error in the
measurement of the standard to affect all these distances  simultaneously.   It
will also be invalid if the numbers have been collected in experimental groups,
each measured by taking differences from a  common  standard  which  itself  is
measured  with error.  Only if the numbers in different cells are measured from
independent standards can we depend on the statistical model.  The  details  of
the  test  and  the  assumptions  are  discussed in my review paper on distance
methods (Felsenstein, 1984a).  For further and sometimes irrelevant controversy
on  these  matters  see  the  papers  by  Farris  (1981, 1985, 1986) and myself
(Felsenstein, 1986, 1988b).



     A second caveat is that the distances must be expected  to  rise  linearly
with  time,  not  according  to  any  other curve.  Thus it may be necessary to
transform the distances to achieve an expected  linearity.   If  the  distances
have  an  upper  limit  beyond  which  they could not go, this is a signal that
linearity may not hold.  It is also VERY important to choose the power P  at  a
value  that  results in the standard deviation of the variation of the observed
from the expected distances being the P/2-th power of the expected distance.

     To carry out the test, fit the same data with both FITCH and  KITSCH,  and
record  the  two  sums of squares.  If the topology has turned out the same, we
have N = n(n-1)/2 distances which have been fit with 2n-3 parameters in  FITCH,
and  with  n-1 parameters in KITSCH.  Then the difference between S(K) and S(F)
has d1 = n-2 degrees of freedom.  It is statistically independent of the  value
of S(F), which has d2 = N-(2n-3) degrees of freedom.  The ratio of mean squares
([S(K)-S(F)]/d1)/(S(F)/d2) should, under the  evolutionary  clock,  have  an  F
distribution  with  n-2 and N-(2n-3) degrees of freedom respectively.  The test
desired is that the F ratio is in the upper tail (say  the  upper  5%)  of  its
distribution.  If the S (subreplication) option is in effect, the above degrees
of freedom must be modified by noting that N is not n(n-1)/2 but is the sum  of
the  numbers  of  replicates of all cells in the distance matrix read in, which
may be either square or triangular.  A further explanation of  the  statistical
test of the clock is given in my more recent paper (Felsenstein, 1986).

     The program uses a similar tree construction method to the other  programs
in the package and, like them, is not guaranteed to give the best-fitting tree.
The assignment of the branch lengths for a given topology is  a  least  squares
fit, subject to the constraints against negative branch lengths, and should not
be able to be improved upon.  KITSCH runs more quickly than FITCH.

     The constants available for modification at the beginning of  the  program
are  "namelength", the length of a species name, and "epsilon", which defines a
small quantity needed in some of the calculations.  There is no feature  saving
multiply  trees  tied  for best, because exact ties are not expected, except in
cases where it should be obvious from the tree printed out what is  the  nature
of the tie (as when an interior branch is of length zero).


----------------------------TEST DATA SET-------------------------------

    5
Alpha      0.000 1.000 2.000 3.000 3.000
Beta       1.000 0.000 2.000 3.000 3.000
Gamma      2.000 2.000 0.000 3.000 3.000
Delta      3.000 3.000 3.000 0.000 1.000
Epsilon    3.000 3.000 3.000 1.000 0.000

----- TEST SET OUTPUT FILE (with all numerical options on) -------------

   5 Populations

Fitch-Margoliash method with contemporary tips, version 3.5c

                  __ __             2
                  \  \   (Obs - Exp)
Sum of squares =  /_ /_  ------------
                                2
                   i  j      Obs

negative branch lengths not allowed





Name                       Distances
----                       ---------

Alpha         0.00000   1.00000   2.00000   3.00000   3.00000
Beta          1.00000   0.00000   2.00000   3.00000   3.00000
Gamma         2.00000   2.00000   0.00000   3.00000   3.00000
Delta         3.00000   3.00000   3.00000   0.00000   1.00000
Epsilon       3.00000   3.00000   3.00000   1.00000   0.00000




                                +--------------Epsilon
  +-----------------------------4
  !                             +--------------Delta
--3
  !                             +--------------Beta
  !              +--------------1
  +--------------2              +--------------Alpha
                 !
                 +-----------------------------Gamma


Sum of squares =      0.000

Average percent standard deviation =   0.00000

examined   76 trees

From    To           Length          Time
----    --           ------          ----

   4  Epsilon         0.50000        1.50000
   3     4            1.00000        1.00000
   4  Delta           0.50000        1.50000
   1  Beta            0.50000        1.50000
   2     1            0.50000        1.00000
   1  Alpha           0.50000        1.50000
   3     2            0.50000        0.50000
   2  Gamma           1.00000        1.50000