version 3.5c
Molecular Sequence Programs

(c) Copyright  1986-1993  by  Joseph  Felsenstein  and  by  the  University  of
Washington.  Written by Joseph Felsenstein.  Permission is granted to copy this
document provided that no fee is charged for it and that this copyright  notice
is not removed.

     These programs estimate phylogenies from protein sequence or nucleic  acid
sequence  data.   PROTPARS uses a parsimony method intermediate between Eck and
Dayhoff's method (1966) of allowing transitions between  all  amino  acids  and
counting  those, and Fitch's (1971) method of counting the number of nucleotide
changes that would be needed to evolve the protein sequence.  DNAPARS uses  the
parsimony  method allowing changes between all bases and counting the number of
those.  DNAMOVE is an  interactive  parsimony  program  allowing  the  user  to
rearrange  trees by hand and see where characters states change.  DNAPENNY uses
the branch-and-bound method to search for all most parsimonious  trees  in  the
nucleic  acid  sequence  case.   DNACOMP  adapts  to  nucleotide  sequences the
compatibility (largest clique) approach.  DNAINVAR does not directly estimate a
phylogeny, but computes Lake's (1987) and Cavender's (Cavender and Felsenstein,
1987) phylogenetic invariants, which are quantities whose values depend on  the
phylogeny.    DNAML  does  a  maximum  likelihood  estimate  of  the  phylogeny
(Felsenstein, 1981a).  DNAMLK is similar  to  DNAML  but  assumes  a  molecular
clock.   DNADIST  computes  distance  measures  between  pairs  of species from
nucleotide sequences, distances that can then be used by  the  distance  matrix
programs  FITCH  and  KITSCH.  RESTML  does  a maximum likelihood estimate from
restriction sites data.   SEQBOOT allows you to read in a  data  set  and  then
produce  multiple  data sets from it by bootstrapping, delete-half jackknifing,
or by permuting within sites.  This then allows most of  these  methods  to  be
bootstrapped  or  jackknifed,  and for the Permutation Tail Probability Test of
Archie (1989) and Faith and Cranston (1991) to be carried out.

     The input and output format for RESTML is described in its document files.
In general its input format is similar to those described here, except that the
one-letter codes for restriction sites is  specific  to  that  program  and  is
described  in  that  document  file.  Since the input formats for the eight DNA
sequence and two protein sequence programs apply to more than one program, they
are  described here.  Their input formats are standard, making use of the IUPAC
standards.


                      INTERLEAVED AND SEQUENTIAL FORMATS

     The sequences can continue over multiple lines;  when  this  is  done  the
sequences  must  be  either  in  "interleaved" format, similar to the output of
alignment programs, or "sequential" format.  These are described  in  the  main
document  file.  In sequential format all of one sequence is given, possibly on
multiple lines, before the next starts.  In interleaved format the  first  part
of  the  file  should  contain  the  first  part of each of the sequences, then
possibly a line containing nothing but a carriage-return  character,  then  the
second part of each sequence, and so on.  Only the first parts of the sequences
should be preceded by names.  Here is a  hypothetical  example  of  interleaved
format:









  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

Note, of course, that a portion of a sequence like this:

   300   AAGCGTGAAC GTTGTACTAA TRCAG

is perfectly legal, assuming that the species name  has  gone  before,  and  is
filled  out  to  full  length  by  blanks.  The above digits and blanks will be
ignored, the sequence being taken as starting at the first base symbol (in this
case  an  A).  This should enable you to use output from many multiple-sequence
alignment programs with only minimal editing.

     In interleaved format the present versions of the programs  may  sometimes
have  difficulties  with the blank lines between groups of lines, and if so you
might want to retype those lines, making sure that they have only  a  carriage-
return  and  no  blank characters on them, or you may perhaps have to eliminate
them.  The symptoms of this problem are that the  programs  complain  that  the
sequences  are  not  properly aligned, and you can find no other cause for this
complaint.


                      INPUT FOR THE DNA SEQUENCE PROGRAMS

     The input format for the DNA sequence programs is standard: the data  have
A's,  G's, C's and T's (or U's).  The first line of the input file contains the
number of species and the number of sites.  As with the other programs, options
information  may  follow  this.   Following  this, each species starts on a new
line.  The first 10 characters of that line are the species name.   There  then
follows  the  base  sequence  of  that species, each character being one of the
letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period
was  also  previously allowed but it is no longer allowed, because it sometimes
is used in different senses in other programs).  Blanks will be ignored, and so
will  numerical  digits.   This  allows GENBANK and EMBL sequence entries to be
read with minimum editing.





     These characters can be  either  upper  or  lower  case.   The  algorithms
convert  all  input  characters  to upper case (which is how they are treated).
The characters constitute the IUPAC (IUB) nucleic acid code  plus  some  slight
extensions.  They enable input of nucleic acid sequences taking full account of
any ambiguities in the sequence.

          Symbol   Meaning
          ------   -------
            A       Adenine
            G       Guanine
            C       Cytosine
            T       Thymine
            U       Uracil
            Y       pYrimidine  (C or T)
            R       puRine      (A or G)
            W       "Weak"      (A or T)
            S       "Strong"    (C or G)
            K       "Keto"      (T or G)
            M       "aMino"     (C or A)
            B       not A       (C or G or T)
            D       not C       (A or G or T)
            H       not G       (A or C or T)
            V       not T       (A or C or G)
          X,N,?     unknown     (A or C or G or T)
            O       deletion
            -       deletion



                    INPUT FOR THE PROTEIN SEQUENCE PROGRAMS

     The input for the protein sequence programs is fairly standard.  The first
line  contains  the  number  of  species and the number of amino acid positions
(counting any stop codons that you want to include).  These are followed on the
same line by the options.  The only options which need information in the input
file are U (User Tree) and W (Weights).  They are  as  described  in  the  main
documentation file.  If the W (Weights) option is used there must be a W in the
first line of the input file.

     Next come the species data.  Each sequence starts on a  new  line,  has  a
ten-character  species  name  that  must  be blank-filled to be of that length,
followed immediately by the species data in the one-letter code.  The sequences
must  either  be  in  the  "interleaved" or "sequential" formats.  The I option
selects between them.  The sequences can have internal blanks in  the  sequence
but there must be no extra blanks at the end of the terminated line.  Note that
a blank is not a valid symbol for a deletion.

     The protein sequences are given by the one-letter code used  by  the  late
Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with
the IUB standard abbreviations.  In the present version it is:














               Symbol               Stands for
               ------               ----------

                 A                     ala
                 B                     asx
                 C                     cys
                 D                     asp
                 E                     glu
                 F                     phe
                 G                     gly
                 H                     his
                 I                     ileu
                 J                  (not used)
                 K                     lys
                 L                     leu
                 M                     met
                 N                     asn
                 O                  (not used)
                 P                     pro
                 Q                     gln
                 R                     arg
                 S                     ser
                 T                     thr
                 U                  (not used)
                 V                     val
                 W                     trp
                 X             unknown amino acid
                 Y                     tyr
                 Z                     glx
                 *                nonsense (stop)
                 ?        unknown amino acid or deletion
                 -                   deletion

where  "nonsense",  and  "unknown"  mean   respectively   a   nonsense   (chain
termination)  codon  and  an amino acid whose identity has not been determined.
The state "asx" means "either asn or asp", and the state  "glx"  means  "either
gln  or  glu"  and the state "deletion" means that alignment studies indicate a
deletion has happened in the ancestry of this position, so that it is no longer
present.   Note  that  if  two  polypeptide  chains  are being used that are of
different length owing to one terminating before the other, they can  be  coded
as (say)

             HIINMA*????
             HIPNMGVWABT

since after the stop codon we do not definitely know  that  there  has  been  a
deletion,  and  do  not  know  what  amino  acid would have been there.  If DNA
studies tell us that there is DNA sequence in that region, then  we  could  use
"X" rather than "?".  Note that "X" means an unknown amino acid, but definitely
an amino acid, while "?" could mean either that or a deletion.   Otherwise  one
will  usually  want  to  use  "?" after a stop codon, if one does not know what
amino acid is there.  If the DNA sequence has been observed there, one probably
ought  to  resist  putting in the amino acids that this DNA would code for, and
one should use "X" instead, because under  the  assumptions  implicit  in  this
either the parsimony or the distance methods, changes to any noncoding sequence
are much easier than changes in a coding region that change the amino acid

     Here are the same one-letter codes tabulated the other way 'round:






              Amino acid               One-letter code
              ----------               ---------------

                ala                           A
                arg                           R
                asn                           N
                asp                           D
                asx                           B
                cys                           C
                gln                           Q
                glu                           E
                gly                           G
                glx                           Z
                his                           H
                ileu                          I
                leu                           L
                lys                           K
                met                           M
                phe                           F
                pro                           P
                ser                           S
                thr                           T
                trp                           W
                tyr                           Y
                val                           V
                deletion                      -
                nonsense (stop)               *
                unknown amino acid            X
                unknown (incl. deletion)      ?


                                  THE OPTIONS

     The programs allow options chosen from their menus.  Many of these are  as
described  in the main documentation file, particularly the options J, O, U, T,
W, and Y.  (Although T has a  different  meaning  in  the  programs  DNAML  and
DNADIST than in the others).

     The U option indicates that user-defined trees are provided at the end  of
the  input  file.   This  happens  in  the usual way, except that for PROTPARS,
DNAPARS,  DNACOMP,  and  DNAMLK,  the  trees  must  be  strictly   bifurcating,
containing  only  two-way  splits,  e.  g.:  ((A,B),(C,(D,E)));.  For DNAML and
RESTML it must have a trifurcation at its base, e. g.:  ((A,B),C,(D,E));.   The
root  of  the  tree  may  in those cases be placed arbitrarily, since the trees
needed are actually unrooted, though they look different when printed out.  The
program  RETREE  should  enable you to reroot the trees without having to hand-
edit or retype them.  For DNAMOVE the U option is not available (although there
is an equivalent feature which uses rooted user trees).

     A feature of the nucleotide sequence programs other than DNAMOVE  is  that
they  save  time  and  computer  memory space by recognizing sites at which the
pattern of bases is the same, and doing their computation only once.   Thus  if
we  have  only  four  species  but a large number of sites, there are (ignoring
ambiguous  bases)  only   about   256   different   patterns   of   nucleotides
(4 x 4 x 4 x 4)  that  can  occur.   The  programs automatically count how many
occurrences there are of each and then only needs to do as much computation  as
would  be  needed  with  256 sites, even though the number of sites is actually
much larger.  If there are ambiguities (such as Y or R nucleotides), these  are
also  handled correctly, and do not cause trouble.  The programs store the full
sequences but reserve  other  space  for  bookkeeping  only  for  the  distinct
patterns.   This saves space.  Thus the programs will run very effectively with



few species and many  sites.   On  larger  numbers  of  species,  if  rates  of
evolution  are  small,  many of the sites will be invariant (such as having all
A's) and thus will mostly have one of four patterns.  The programs will in this
way automatically avoid doing duplicate computations for such sites.