Text Navigation System

PUZZLE

Maximum likelihood analysis for nucleotide, amino acid, and two-state data

Version 4.0.2
April 1999
Copyright 1995-99 by Korbinian Strimmer and Arndt von Haeseler

Korbinian Strimmer, please check http://members.tripod.de/korbi/ for up-to-date email and postal addresses. Current place: email: strimmer@mips.biochem.mpg.de, MIPS, Max-Planck-Institute for Biochemistry, Am Klopferspitz 18a, D-82152 Martinsried/München, Germany.

Arndt von Haeseler, email: haeseler@eva.mpg.de, Max-Planck-Institute for Evolutionary Anthropology, Inselstr. 22, D-04103 Leipzig, Germany.

PUZZLE is a computer program to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch. PUZZLE also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. Branch lengths can also be calculated under the clock-assumption. In addition, PUZZLE offers a novel method, likelihood mapping, to investigate the support of a hypothesized internal branch without computing an overall tree and to visualize the phylogenetic content of a sequence alignment. PUZZLE also conducts a number of statistical tests on the data set (chi-square test for homogeneity of base composition, likelihood ratio to test the clock hypothesis, Kishino-Hasegawa test). The models of substitution provided by PUZZLE are TN, HKY, F84, SH for nucleotides, Dayhoff, JTT, mtREV24, BLOSUM 62 for amino acids, and F81 for two-state data. Rate heterogeneity is modelled by a discrete Gamma distribution and by allowing invariable sites. The corresponding parameters can be inferred from the data set.

PUZZLE is available free of charge from

http://members.tripod.de/korbi/puzzle/ (KS home page)

http://iubio.bio.indiana.edu/soft/molbio/evolve (IUBio archive www, USA)

ftp://iubio.bio.indiana.edu/molbio/evolve (IUBio archive ftp, USA)

ftp://ftp.ebi.ac.uk/pub/software (European Bioinformatics Institute, UK)

ftp://ftp.pasteur.fr/pub/GenSoft (Institut Pasteur, France)

PUZZLE is written in ANSI C. It will run on most personal computers and workstations if compiled by an appropriate C compiler. Please read the installation section for more details.

We suggest that this documentation should be read before using PUZZLE the first time. If you do not have the time to read this manual completely please do read at least the sections Input/Output Conventions and Quick Start below. Then you should be able to use the PUZZLE program, especially if you have some experience with the PHYLIP programs. The other sections should then be read at a later time.

To find out what's new in version 4.0.2 please read the Version History.

Legal Stuff
Installation

UNIX
MacOS
Windows 95/98/NT
VMS

Introduction
Input/Output Conventions

Sequence Input
General Output
Distance Output
Tree Output
Tree Input
Likelihood Mapping Output

Quick Start
Models of Sequence Evolution

Models of Substitution
Models of Rate Heterogeneity

Available Options
Other Features
Interpretation and Hints

Quartet Puzzling Support Values
Percentage of Unresolved Quartets
Automatic Parameter Estimation
Likelihood Mapping
Batch Mode

Limits and Error Messages
Are Quartets Reliable?
Other Programs
Acknowledgements
References
Version History

Legal Stuff

PUZZLE 4.0.2 is (c) 1995-1999 by Korbinian Strimmer and Arndt von Haeseler. The software and its accompanying documentation are provided as is, without guarantee of support or maintenance. The whole package is licenced under the GNU public licence, except for the parts indicated in the sources where the copyright of the authors does not apply. Please see http://www.opensource.org/gpl-license.html for details.

Installation

The source code of the PUZZLE software is 100% identical across platforms. However, installation procedures differ.

UNIX

Get the file puzzle-402.tar. If you received a compressed tar file (puzzle-402.tar.Z or puzzle-402.tar.gz) you have to decompress it first (using the "uncompress" or "gunzip" command). Then untar the file with

        tar xvf puzzle-402.tar

The newly created directory "puzzle-4.0.2" contains four subdirectories called "doc", "data", "bin", and "src". The "doc" directory contains this manual in HTML format. The "data" directory contains example input files. The "src" folder contains the ANSI C sources of PUZZLE. Switch to this directory by typing

        cd puzzle-4.0.2/src

To compile we recommed the GNU gcc (or GNU egcs) compiler. If gcc is installed just type

        make install

and the executable "puzzle" is compiled and put into the "bin" directory. Then type

        make clean

and everything will be nicely cleaned up. If your compiler is not the GNU gcc compiler you will have to modify the makefile. Usually it is sufficient to change the name of the compiler (CC = cc instead of CC = gcc in the first lines of the makefile). If you still can´t compile properly then your compiler or its runtime library is most probably not ANSI compliant (e.g., old SUN compilers). In most cases,however, you will succeed to compile by changing some parameters in the "makefile". Ask your local Unix expert for help.

If the compilation of the PUZZLE program completed correctly but running the "puzzle" executable does not run the PUZZLE maximum likelihood program rename the executable (e.g., to "puzzle402") as some Unix systems come with a preinstalled game that is also called "puzzle".

MacOS

Get the file puzzle-402.hqx. After decoding this BinHex file (this is done automatically on a properly installed system, otherwise use programs like "StuffIt Expander" or ask your local Mac expert) you will find a folder called "puzzle-4.0.2" on your hard disk. This folder contains the four subdirectories "doc", "data", "bin", and "src". The "doc" folder contains this manual in HTML format. The "data" folder contains example input files. The "bin" folder contains a Macintosh PPC executable with a default memory partition of 3000K. There is no 68k executable. If you get a memory allocation error while running PUZZLE you have to increase PUZZLE´s memory partition with the "Get Info" command of the Macintosh Finder. The "src" folder contains the ANSI C sources of PUZZLE.

The MacOS executables have been compiled for the PowerMac by Florian Burckhardt using Metrowerks CodeWarrior. PUZZLE runs successfully both on System 7 and System 8.

Note: It´s probably a good idea to install PPC Linux (or MkLinux) on your Macintosh. PUZZLE (as any other program) runs 20-50% faster under Linux compared to the same program under MacOS (on the same machine!), and the Mac does not freeze during execution because of Linux´s multitasking capabilities (maybe this changes in MacOS X).

Windows 95/98/NT

Get the file puzzle-402.zip. After uncompressing (using, e.g., WinZip or a similar tool) a folder "puzzle-4.0.2" is created containing four subdirectories called "doc", "data", "bin", and "src". The "doc" directory contains this manual in HTML format. The "data" directory contains example input files. The "src" folder contains the ANSI C sources of PUZZLE. The "bin" directory contains the exectutable "puzzle.exe". To use PUZZLE the system path to the executable needs to be set correctly. Ask your local Windows expert for help.

The executable has been compiled by Norman J. Pieniazek using Microsoft Visual C++ and the "makefile.w32" (contained in "src").

If you have Linux partition on your PC as well we recommended to install and use PUZZLE under Linux (see Unix section) because it runs PUZZLE significantly faster than Windows.

VMS

Get the Unix sources and install the package on your computer (ask your local VMS expert for help). Go to the subfolder "src" and compile PUZZLE using the command file "makefile.com".

Introduction

PUZZLE is an ANSI C application to reconstruct phylogenetic trees from molecular sequence data by maximum likelihood. It implements a fast tree search algorithm, quartet puzzling, that allows analysis of large data sets and automatically assigns estimations of support to each internal branch. Rate heterogeneity (invariable sites plus Gamma distributed rates) is incorporated in all models of substitution available (nucleotides: SH, TN, HKY, F84, and submodels; amino acids: Dayhoff, JTT, mtREV24, BLOSUM 62; two-state data: F81). All parameters including rate heterogeneity can be estimated from the data by maximum likelihood approaches. PUZZLE also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees. In addition, PUZZLE offers a novel method, likelihood mapping, to investigate the support of internal branches without computing an overall tree.

Input/Output Conventions

Sequence Input

PUZZLE requests sequence input in PHYLIP INTERLEAVED format (sometimes also called PHYLIP 3.4 format). Many sequence editors and alignment programs (e.g., CLUSTAL W) output data in this format. In the "data" folder there are four example input files ("globin.a", "marswolf.n", "atp6.a", "primates.b") that can be used as templates for own data files. The default name of the sequence input file is "infile". If an "infile" is not present PUZZLE will prompt the user for an alternative file name. Sequences names in the "infile" are allowed to contain blanks but all blanks will internally be converted to underscores "_". Sequences can be in upper or lower case, any spaces or control characters are ignored. The dot "." is recognized as matching character, it can be used in all sequences except in the first sequence. Valid symbols for nucleotides are A, C, G, T and U, and for amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. All other visible characters (including gaps, question marks etc.) are treated as N (DNA/RNA) or X (amino acids). For two-state data the symbols 0 and 1 are used. The first sequence in the data set is considered the default outgroup.

General Output

All results are written to the file "outfile". If the option "List all unresolved quartets" is invoked a file called "outqlist" is created showing all these quartets. If the option "List puzzling step trees" the file "outpstep" is generated.

Distance Output

PUZZLE automatically computes pairwise maximum likelihood distances for all the sequences in the data file. They are written in the "outfile" and in the separate file "outdist". The format of "outdist" is PHYLIP compatible (i.e. it can directly be used as input for PHYLIP distance-based programs such as "neighbor").

Tree Output

The quartet puzzling tree with its support values for the internal branches and with maximum likelihood branch lengths is displayed as ASCII drawing in the "outfile". The same tree is written into the "outtree" file in CLUSTAL W format. If clock-like maximum-likelihood branch lengths are computed there will be both an unrooted and a rooted tree in the "outfile". The tree convention follows the one adopted by the CLUSTAL W team: the tree topology is described by the usual round brackets (a,b,(c,d)); where branch lengths are written after the colon a:0.22,b:0.33. To be able to display support values for each branch simultaneosly with branch lengths they are displayed as internal node labels, i.e. they follow directly after each node before the branch length to each node. Here is an example:

(Gibbon:0.1393, ((Human:0.0414, Chimpanzee:0.0538)99:0.0175, Gorilla:0.0577)98:0.0531, Orangutan:0.1003);

With the programs TreeView and TreeTool it is possible to view a tree both with its branch lengths and simultaneously with the support values for the internal branches (here 98% and 99%). Note, the PHYLIP programs DRAWTREE and DRAWGRAM may also be used with the CLUSTAL W treefile format. However, in the current version (3.5) they ignore the internal labels and simply print the tree topology along with branch lengths.

Tree Input

PUZZLE optionally also reads input trees. The default name for the file containing the input tree is "intree" but if you choose the input tree option and there is no "intree" present you will be prompted for an alternative name. The format of the input trees is identical to the trees in the "outtree" file. However, it is sufficient to provide the tree topology only, you don't need to specify branch lengths (that are ignored anyway) or internal labels (that are read, stored, and written back to the "outtree" file). The input trees needs not to be unrooted, they can also be rooted. The corresponding the basal bifurcation will automatically be removed. It is important that sequence names in the input tree file do not contain blanks (use underscores!). The tree needs not to be completely resolved, it can be multifurcating. The format of the "intree" file is easy: just put the trees into the file. PUZZLE counts the ';' at the end of each tree description to determine how many input trees there are. Any header (e.g., with the number of trees) is ignored (this is useful in conjunction with programs like MOLPHY that need this header). If there is more than one tree PUZZLE performs the Kishino-Hasegawa test to check whether each tree is significantly worse than the best tree.

Likelihood Mapping Output

PUZZLE also offers likelihood mapping analysis, a method to investigate support for internal branches of a tree without computing an overall tree and to graphically visualize phylogenetic content of a sequence alignment. The results of likelihood mapping are written in ASCII to the "outfile" as well as to a file called "outlm.eps". This file contains in encapsulated Postscript format (EPSF) a picture of the triangle that forms the basis of the likelihood mapping analysis. You may print it out on a Postscript capable printer or view it with a suitable program. The "outlm.eps" file can be edited by hand (it is plain ASCII text!) or by drawing programs that understand the Postcript language (e.g., Adobe Ilustrator).

Quick Start

Prepare your sequence input file "infile" and, optionally, your tree input file "intree". Then start the PUZZLE program. PUZZLE will choose automatically the nucleotide or the amino acid mode. If more than 85% of the characters (not counting the - and ?) in the sequences are A, C, G, T, U or N, it will be assumed that the sequences consists of nucleotides. If your data set contains amino acids PUZZLE suggests whether you have amino acids encoded on mtDNA or on nuclear DNA, and selects the appropriate model of amino acid evolution. If your data set contains nucleotides the default model of sequence evolution chosen is the HKY model. Parameters need not to be specified, they will be estimated by a maximum likelihood procedure from the data. If PUZZLE detects a file called "intree" it automatically switches to the input tree mode.

Then, a menu (PHYLIP "look and feel") appears with default options set. It is possible to change all available options. For example, if you want to incorporate rate heterogeneity you have to select option "w" as rate heterogeneity is switched off by default. Then type "y" at the input prompt and start the analysis. You will see a number of status messages on the screen during computation. When the analysis is finished all output files (e.g., "outfile", "outtree", "outdist", "outqlist", "outlm.eps", "outpstep") will be in the same directory as the input files.

To obtain a high quality picture of the output tree (including node labels) you might want to use use the TreeView program by Roderic Page. It is available free of charge and runs on MacOS and MS-Windows. It can be retrieved from http://taxonomy.zoology.gla.ac.uk/rod/treeview.html. TreeView understands the CLUSTAL W treefile conventions, reads multifurcating trees and is able to simultaneosly display branch lengths and support values for each branch. Open the "outtree" file with TreeView, choose "Phylogram" to draw branch lengths, and select "Show internal edge labels".

On a Unix you can use the TreeTool program to display and manipulate PUZZLE trees (See ftp://rdp.life.uiuc.edu/rdp/programs/TreeTool for precompiled Sun executables. A version that runs on Linux has been prepared by Anders Holmberg from the Dept. of Biochemistry at the Royal Institute of Technology, Stockholm).

Models of Sequence Evolution

Here we give a brief overview over the models implemented in PUZZLE. Formulas are written in TeX style.

Models of Substitution

The substitution process is modelled as reversible time homogeneous stationary Markov process. If the corresponding stationary nucleotide (amino acid) frequencies are denoted pi_i the most general rate matrix for the transition from nucleotide (amino acid) i to j can be written as

                |   Q_{ij} pi_j               for i != j
       R_{ij} = |
                | - Sum_m Q_{im} pi_m         for i == j

The matrix Q_{ij} is symmetric with Q_{ii} == 0 (digonals are zero). For nucleotides the most general model built into PUZZLE is the Tamura-Nei (TN) model. The matrix Q_{ij}for this model equals

                | 4*t*gamma/(gamma+1)         for i -> j pyrimidine transition
                |
       Q_{ij} = | 4*t/(gamma+1)               for i -> j purine transition
                |
                | 1                           for i -> j transversion

The parameter gamma is called the "Y/R transition parameter" whereas t is the "Transition/transversion parameter". If gamma is equal to 1 we get the HKY model (1985). Note, the ratio of the transition and transversion rates (without frequencies) is 2*t = kappa. There is a subtle but important difference between the transition-transversion parameter, the expected transition-transversion ratio, and the observed transition transversion ratio. The transition-transversion parameter simply is a parameter in the rate matrix. The expected transition-transversion ratio is the ratio of actually occuring transitions to actually occuring transversions taking into account nucleotide frequencies in the alignment. Due to saturation and multiple hits not all substitutions are observable. Thus, the observed transition-transversion ratio counts observable transitions and transversions only. If the base frequencies in the HKY model are homogeneous (pi_i = 0.25) HKY further reduces to the Kimura model. In this case t is identical to the expected transition/transversion ratio. If t is set to 0.5 the Jukes-Cantor model is obtained. The F84 model (as implemented in the various PHYLIP programs) is a special case of the Tamura-Nei model.

For amino acids the matrix Q_{ij} is fixed and does not contain any free parameters. Depending on the type of input data four different Q_{ij} matrices are available in PUZZLE. The Dayhoff and JTT matrices are for use with proteins encoded on nuclear DNA, the mtREV24 matrix is for use with proteins encoded on mtDNA, and the BLOSUM 62 model is for more distantly related amino acid sequences (Henikoff and Henikoff 1992).

For doublets (pairs of dependent nucleotides) the SH model is implemented in PUZZLE. The corresponding matrix Q_{ij} reads

                | 2*t         for i -> j transition substitution
                |
       Q_{ij} = | 1           for i -> j transversion substitution
                |
                | 0           for i -> j two substitutions

The SH model basically is a F81 model for single substitutions in doublets.

Models of Rate Heterogeneity

Rate heterogeneity is taken into account by considering invariable sites and by introducing Gamma-distributed rates for the variable sites.

For invariable sites the parameter theta ("Fraction of invariable sites") determines the probability of a given site to be invariable. If a site is invariable the probability for the constant site patterns is pi_i, the frequency of each nucleotide (amino acid).

The rates r for variable sites are determined by a discrete Gamma distribution that approximates the continous Gamma distribution

                    alpha     alpha-1
               alpha         r
       g(r) = ------------------------
                alpha r
               e        Gamma(alpha)

where the parameter alpha ranges from alpha = infinity (no rate heterogeneity) to alpha < 1 (strong heterogeneity). The mean expection of r under this distribution is 1.

In previous versions of PUZZLE instead of alpha the related parameter eta = 1/(1+alpha) was used.

The total rate heterogeneity rho (Gu et al. 1995) of the model of rate heterogeneity combining invariable sites and a Gamma distribution is rho = (1+theta alpha)/(1+alpha).

Available Options

All options can be selected and changed after PUZZLE has read the input file. Depending on the input files options are preselected and displayed in a menu ("PHYLIP look and feel"):

GENERAL OPTIONS
 b                     Type of analysis?  Tree reconstruction
 k                Tree search procedure?  Quartet puzzling
 v       Approximate quartet likelihood?  No
 u             List unresolved quartets?  No
 n             Number of puzzling steps?  1000
 j             List puzzling step trees?  No
 o                  Display as outgroup?  Gibbon
 z     Compute clocklike branch lengths?  No
 e                  Parameter estimates?  Approximate (faster)
 x            Parameter estimation uses?  Neighbor-joining tree
SUBSTITUTION PROCESS
 d          Type of sequence input data?  Nucleotides
 m                Model of substitution?  HKY (Hasegawa et al. 1985)
 t    Transition/transversion parameter?  Estimate from data set
 f               Nucleotide frequencies?  Estimate from data set
RATE HETEROGENEITY
 w          Model of rate heterogeneity?  Uniform rate

Quit [q], confirm [y], or change [menu] settings:

By typing the letters shown in the menu you can either change settings or enter new parameters. Some options (for example "m" and "w") can be invoked several times to switch through a number of different settings. The parameters of the models of sequence evolution can be estimated from the data by a variety of procedures based on maximum likelihood. The analysis is started by typing "y" at the input prompt. To quit the program type "q".

The following table lists in alphabetical order all PUZZLE options. Be aware, however, not all of them are accessible at the same time:

Option Description

a Gamma rate heterogeneity parameter alpha. This is the so-called shape parameter of the Gamma distribution.

b Type of analysis. Allows to switch between tree reconstruction by maximum likelihood and likelihood mapping.

c Number of rate categories (4-16) for the discrete Gamma distribution (rate heterogeneity).

d Data type. Specifies whether nucleotide, amino acid sequences, or two-state data serve as input. The default is automatically set by inspection of the input data.

e Approximation option. Determines whether an approximate or the exact likelihood function is used to estimate parameters of the models of sequence evolution. The approximate likelihood function is in most cases sufficient and is faster.

f Base frequencies. The maximum likelihood calculation needs the frequency of each nucleotide (amino acid, doublet) as input. PUZZLE estimates these values from the sequence input data. This option allows specification of other values.

g Group sequences in clusters. Allows to define clusters of sequences as needed for the likelihood mapping analysis. Only available when likelihood mapping is selected ("b" option).

h Codon positions or definition of doublets. For nucleotide data only. If the TN or HKY model of substitution is used and the number of sites in the alignment is a multiple of three the analysis can be restricted to each of the three codon positions and to the 1st and 2nd positions. If the SH model is used this options allows to specify that the 1st and 2nd codon positions in the alignment define a doublet.

i Fraction of invariable sites. Probability of a site to be invariable. This parameter can be estimated from the data by PUZZLE (only if the approximation option for the likelihood function is turned off).

j List puzzling steps trees. Writes all intermediate trees (puzzling step trees) used to compute the quartet puzzling tree into a file.

k Tree search. Determines how the overall tree is obtained. The topology is either computed with the quartet puzzling algorithm or is defined by the user. Maximum likelihood branch lengths will be computed for this tree. Alternatively, a maximum likelihood distance matrix only can also be computed (no overall tree).

l Location of root. Only for computation of clock-like maximum likelihood branch lengths. Allows to specify the branch where the root should be placed in an unrooted tree topology. For example, in the tree (a,b,(c,d)) l = 1 places the root at the branch leading to sequence a whereas l=5 places the root at the internal branch.

m Model of substitution. The following models are implemented for nucleotides: the Tamura-Nei (TN) model, the Hasegawa et al. (HKY) model, and the Schöniger & von Haeseler (SH) model. The SH model describes the evolution of pairs of dependent nucleotides (pairs are the first and the second nucleotide, the third and the fourth nucleotide and so on). It allows for specification of the transition-transversion ratio. The originally proposed model (Schöniger & von Haeseler 1994) is obtained by setting the transition-transversion parameter to 0.5. The Jukes-Cantor (1969), the Felsenstein (1981), and the Kimura (1980) model are all special cases of the HKY model. For amino acid sequence data the Dayhoff et al. (Dayhoff) model, the Jones et al. (JTT) model, the Adachi and Hasegawa (mtREV24) model, and the Henikoff and Henikoff (BLOSUM 62) substitution model are implemented in PUZZLE. The mtREV24 model describes the evolution of amino acids encoded on mtDNA, and BLOSUM 62 is for distantly related amino acid sequences. For more information please read the section in this manual about models of sequence evolution. See also option "w" (model of rate heterogeneity).

n If tree reconstruction is selected: number of puzzling steps. Parameter of the quartet puzzling tree search (meaning comparable to the number of bootstrap replicates). Generally, the more sequences are used the more puzzling steps are advised. The default value varies depending on the number of sequences (at least 1000).
If likelihood mapping is selected: number of quartets in a likelihood mapping analysis. Equal to the number of dots in the likelihood mapping diagram. By default 10000 dots/quartets are assumed. To use all possible quartets in clustered likelihood mapping you have to specify a value of n=0.

o Outgroup. For displaying purposes of the unrooted quartet puzzling tree only. The default outgroup is the first sequence of the data set.

p Constrain the TN model to the F84 model. This option is only available for the Tamura-Nei model. With this option the expected (!) transition-transversion ratio for the F84 model have to be entered and PUZZLE computes the corresponding parameters of the TN model (this depends on base frequencies of the data). This allows to compare the results of PUZZLE and the PHYLIP maximum likelihood programs which use the F84 model.

q Quits analysis.

r Y/R transition parameter. This option is only available for the TN model. This parameter is the ratio of the rates for pyrimidine transitions and purine transitions. You don't need to specify this parameter as PUZZLE estimates it from the data. For precise definition please read the section in this manual about models of sequence evolution.

s Symmetrize doublet frequencies. This option is only available for the SH model. With this option the doublet frequencies are symmetrized. For example, the frequencies of "AT" and "TA" are then set to the average of both frequencies.

t Transition/transversion parameter. For nucleotide data only. You don't need to specify this parameter as PUZZLE estimates it from the data. The precise definition of this parameter is given in the section on models of sequence evolution in this manual.

u Show unresolved quartets. During the quartet puzzling tree search PUZZLE counts the number of unresolved quartet trees. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, and von Haeseler 1996). If this option is selected you'll get a detailed list of all starlike quartets. Note, for some data sets there may be a lot of unresolved quartets. In this case a list of all unresolved quartets is probably not very useful and also needs a lot of disk space.

v Approximate quartet likelihood. For the quartet puzzling tree search only. Only for very small data sets it is necessary to compute an exact maximum likelihood. For larger data sets this option should always be turned on. This option was hidden in some previous versions of PUZZLE.

w Model of rate heterogeneity. PUZZLE provides several different models of rate heterogeneity: uniform rate over all sites (rate homogeneity), Gamma distributed rates, two rates (1 invariable + 1 variable), and a mixed model (1 invariable rate + Gamma distributed rates). All necessary parameters can be estimated by PUZZLE. Note that whenever invariable sites are taken into account the parameter estimation will invoke the "e" option to use an exact likelihood function. For more detailed information please read the section in this manual about models of sequence evolution. See also option "m" (model of substitution).

x Selects the methods used in the estimation of the model parameters. Neighbor-joining tree means that a NJ tree is used to estimate the parameters. Quartet sampling means that a number of random sets of four sequences are selected to estimate parameters.

y Starts analysis.

z Computation of clock-like maximum likelihood branch lengths. This option also invokes the likelihood ratio clock test.

Other Features

For nucleotide data PUZZLE computes the expected transition/transversion ratio and the expected pyrimidine transition/purine transition ratio corresponding to the selected model. Base frequencies play an important role in the calculation of both numbers.

PUZZLE also tests with a 5% level chi-square-test whether the base composition of each sequence is identical to the average base composition of the whole alignment. All sequences with deviating composition are listed in the output file. It is desired that no sequence (possibly except for the outgroup) has a deviating base composition. Otherwise a basic assumption implicit in the maximum likelihood calculation is violated.

A hidden feature of PUZZLE (since version 2.5) is the employment of a weighting scheme of quartets (Strimmer, Goldman, and von Haeseler 1997) in the quartet puzzling tree search.

PUZZLE also computes the average distance between all pairs of sequences (maximum likelihood distances). The average distances can be viewed as a rough measure for the overall sequence divergence.

If more than one input tree is provided PUZZLE uses the Kishino-Hasegawa test (1989) to check which trees are significantly worse than the best tree.

If clock-like maximum-likelihood branch lengths are computed PUZZLE checks with the help of a likelihood-ratio test (Felsenstein 1988) whether the data set is clock-like.

PUZZLE also detects sequences that occur more than once in the data and that therefore can be removed from the data set to speed up analysis.

If rate heterogeneity is taken into account in the analysis PUZZLE also computes the most probable assignment of rate categories to sequence positions, according Felsenstein and Churchill (1996).

Interpretation and Hints

Quartet Puzzling Support Values

The quartet puzzling (QP) tree search estimates support values for each internal branch. They can be interpreted in much the same way as bootstrap values (though they should not confused with them). Branches showing a QP reliability from 90% to 100% can be considered very strongly supported. Branches with lower reliability (> 70%) can in principle be also trusted but in this case it is advisable to check how well the respective internal branch does in comparison to other branches in the tree (i.e. check relative reliability). If you are interested in a branch with a low confidence it is also important to check the alternative groupings that are not included in the QP tree (they are listed in the outfile in *.** format). There should be a significant gap between the lowest reliability value of the QP tree and the most frequent grouping that is not included in the QP tree.

Percentage of Unresolved Quartets

PUZZLE computes the number and the percentage of completely unresolved maximum likelihood quartets. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, and von Haeseler 1996). The percentage of the unresolved quartets among all possible quartets is an indicator of the suitability of the data for phylogenetic analysis. A high percentage usually results in a highly multifurcating quartet puzzling tree. If you only have a few unresolved quartets we recommend to invoke option "u" to get a list of all these quartets. In a likelihood mapping analysis the percentage of completely unresolved quartets is shown in the central region of the triangle diagram.

Automatic Parameter Estimation

PUZZLE estimates both the parameters of the models of substitution (TN, HKY) and of the model of rate variation (Gamma distribution, fraction of invariable sites) without prior knowledge of an overall tree by a number of different strategies based on maximum likelihood. For all estimated parameters a corresponding standard error (S.E.) is computed. If you have good arguments to choose a different set of parameters than the values obtained by PUZZLE don't hesitate to use them. If sequences are extremly similar it is very hard for every algorithm to extract information about the model of substitution from the data set. Also, be careful if the estimated parameter values are very close to the internal upper and lower bounds:

Parameter (Symbol)	Minimal Value	Maximal Value
Transition/transversion parameter (t)	0.20	30.00
Y/R transition parameter (gamma)	0.10	6.00
Fraction of invariable sites (theta)	0.00	0.99
Gamma rate heterogeneity parameter (alpha)	0.01	99

Likelihood Mapping

Likelihood mapping (Strimmer and von Haeseler 1997) is a method to analyse the support for internal branches in a tree without having to compute an overall tree. Every internal branch in an a completely resolved tree defines up to four clusters of sequences. Sometimes only the relationship of these groups are of interest and not details of the structure of the clusters themselves. Then a likelihood mapping analysis is sufficient. The corresponding likelihood mapping triangle diagrams (as contained in various output files generated by PUZZLE) will illucidate the possible relationships in detail.

Batch Mode

Running PUZZLE from a Unix batch file is straighforward despite the lack of command switches. For eaxmple, to run PUZZLE with a the transition/transversion parameter equal to 10 the follwing lines in a batch file are sufficient:

puzzle << !
t
10
y
!

All other parameters can also be accessed the same way. There is currently no plan to support command switches in future versions of PUZZLE.

Limits and Error Messages

PUZZLE has a built-in limit to allow data sets only up to 257 sequences in order to avoid overflow of internal integer variables. At least 32767 sites should be possible depending on the compiler used. Computation time will be the largest constraint even if sufficient computer memory is available. If rate heterogeneity is taken into account every additional category slows down the overall computation by the amount of time needed for one complete run assuming rate homogeneity.

If problems are encountered PUZZLE terminates program execution and returns a plain text error message. Depending on the severity errors can be classified into three groups:

"HALT " errors: Very severe. You should never ever see one of these messages. If so, please contact the developers!

"Unable to proceed" errors: Harmless but annoying. Mostly memory errors (not enough RAM) or problems with the format of the input files.

Other errors: Completely uncritical. Occur mostly when options of PUZZLE are being set.

A standard machine (1996 Unix workstation) with 32 to 64 MB RAM PUZZLE can easily do maximum likelihood tree searches including estimation of support values for data sets with 50-100 sequences. As likelihood mapping is not memory consuming and computationally quite fast it can be applied to large data sets as well.

Are Quartets Reliable?

Quartets may be intrinsically one of the most difficult phylogenies to resolve accurately (cf. Hillis 1996). It has been asked whether this is a problem for quartet puzzling because it works with quartets.

However, this is not true. According to Hillis findings (Hillis, 1996), quartets can be hard, but extra information helps. That is, if all you have are data on species (A, B, C, D) then it might be relatively difficult to find the correct tree for them. But if you have additional data (species E, F, G, ...) and try to find a tree for all the species, then that part of the tree relating (A, B, C, D) will more likely be correct than if you had just the data for (A, B, C, D). In Hillis's big 'model' tree, there are many examples of subsets of 4 species which in themselves might be hard to resolve correctly, but which are correctly resolved thanks to the (...large amount of...) additional data. PUZZLE (quartet puzzling) also gains advantage from extra data in the same way. It's 'understanding' or resolution of the quartet (A, B, C, D) might be incorrect, but the information on the relationships of (A, B, C, D) implicit in its treatment of (A, B, C, E), (A, B, E, D), (A, E, C, D), (E, B, C, D), (A, B, C, F), (A, B, F, D), (A, F, C, D), (F, B, C, D), (A, B, C, G), etc. etc. should overcome this problem.

The facts about how well PUZZLE actually works have been investigated in the Strimmer and von Haeseler (1996) and Strimmer, Goldman, and von Haeseler (1997) papers. Their results cannot be altered by Hillis's findings. Considered as a heuristic search for maximum likelihood trees, quartet puzzling works very well.

(This section follows N. Goldman, personal communication).

Other Programs

There are a number of other very useful and widespread programs to reconstruct phylogenetic relationships and to analyse molecular sequence data that are available free of charge. Here are the URLS of some web pages that provide links to most of them (including the PHYLIP package and the MOLPHY and PAML maximum likelihood programs):

Joe Felsenstein's list of programs (well-organized and pretty exhaustive):

http://evolution.genetics.washington.edu/phylip/software.html

"Tree of Life" software page:

http://phylogeny.arizona.edu/tree/programs/programs.html

European Bioinformatics Institute:

http://www.ebi.ac.uk/biocat/biocat.html

Acknowledgements

The maximum likelihood kernel of PUZZLE is an offspring of the program NucML/ProtML version 2.2 by Jun Adachi and Masami Hasegawa (ftp://sunmh.ism.ac.jp/pub/molphy). We thank them for generously allowing us to use the source code of their program. The maze as icon for PUZZLE was suggested by Joe Felsenstein. We thank Lars S. Jermiin and Ross Crozier for submitting a very detailed error report on PUZZLE 4.0 that helped to update PUZZLE to version 4.0.1. We thank David R. Bell, José Castresana, B. L. Cohen, Ross Crozier, John Harshman, Axel Janke, Bailey D. Kessing, Catherine Letondal, Sonja Meyer, Norman J. Pieniazek, Marta Riutort, Andrew J. Roger, Mika Salminen, Richard H. Thomas, Steven M. Thompson, and Andrew Wang for valuable comments and beta testing. Norman J. Pieniazek also prepared the Windows 95/98/NT version while Florian Burckhardt compiled the Macintosh executable. We would also like to thank the European Bioinformatics Institute (EBI), the Institut Pasteur, and the University of Indiana (i.e. Don Gilbert) for kindly distributing the PUZZLE program and the Deutsche Forschungsgemeinschaft for financial support.

References

Adachi, J., and M. Hasegawa. 1996. MOLPHY: programs for molecular phylogenetics, version 2.3. Institute of Statistical Mathematics, Tokyo.

Adachi, J., and M. Hasegawa. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.

Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. In: Dayhoff, M. O. (ed.) Atlas of Protein Sequence Structur, Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington DC, pp. 345-352.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368-76.

Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annu. Rev. Genet. 22: 521-565.

Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.

Felsenstein, J., and G.A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol.13: 93-104.

Gu, X., Y.-X. Fu, and W.-H. Li. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12: 546-557.

Hasegawa, M., H. Kishino, and K. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol.22: 160-174.

Henikoff, S., J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. PNAS (USA) 89:10915-10919.

Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130-131.

Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. In: Munro, H. N. (ed.) Mammalian Protein Metabolism, New York: Academic Press, pp. 21-132.

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. CABIOS 8: 275-282.

Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.

Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29: 170-179.

Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512-526.

Tamura K. 1994. Model selection in the estimation of the number of nucleotide substitutions. Mol. Biol. Evol. 11: 154-157.

Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22: 4673-4680.

Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 1406-425.

Schöniger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol.3: 240-247.

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13: 964-969.

Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol. 14: 210-211.

Strimmer, K., and A. von Haeseler. 1997. Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. PNAS (USA). 94:6815-6819.

Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol.39:306-314.

Version History

The PUZZLE program has first been distributed in 1995. Since then it has been continually improved. Here is a list of the most important changes.

4.0.2	Update to provide precompiled Windows 95/98/NT executables. In addition: Internal rearrangement of rate matrices. Improved BLOSUM 62 matrix. Endless input loop for input files restricted to 10 trials. Source code clean up to remove compile time warnings. Explicit quit option in menu. Changes in NJ tree code. Updates of documentation (address changes, correction of errors).
4.0.1	Maintainance release. Correction of mtREV matrix. Fix of the "intree bug". Removal of stringent runtime-compatibility check to allow out-of-the-box compile on alpha. More accurate gamma distribution allowing 16 instead of 8 categories and ensuring a better alpha > 1.0. Update of documentation (mainly address changes). More Unix-like file layout, and change of licence to GPL.
4.0	Executables for Windows 95/NT and OS/2 instead of MS-DOS. Computation of clock-like branch lengths (also for amino acids and for non-binary trees). Automatic likelihood ratio clock test. Model for two-state sequences data (0,1) included. Display of most probable assignment of rates to sites. Identification of groups of identical sequences. Possibility to read multiple input trees. Kishino-Hasegawa test to check whether trees are significantly different. BLOSUM 62 model of amino acid substitution (Henikoff-Henikoff 1992). Use of parameter alpha instead of eta (for rate heterogeneity). Improvements to user interface. SH model can be applied to 1st and 2nd codon positions. Automatic check for compatible compiler settings. Workaround for severe runtime problem when the gcc compiler was used.
3.1	Much improved user interface to rate heterogeneity (less confusing menu, rearranged outfile, additional out-of-range check). Possibility to read rooted input trees (automatic removal of basal bifurcation). Computation of average distance between all pairs of sequences. Fix of a bug that caused PUZZLE 3.0 to crash on some systems (DEC Alpha). Cosmetic changes in program and documentation.
3.0	Rate heterogeneity included in all models of substitution (Gamma distribution plus invariable sites). Likelihood mapping analysis with Postscript output added. Much more sophisticated maximum likelihood parameter estimation for all model parameters including those of rate heterogeneity. Codon positions selectable. Update to mtREV24. New icon. Less verbose runtime messages. HTML documentation. Better internal error classification. More information in outfile (number of constant postions etc.).
2.5.1	Fix of a bug (present only in version 2.5) related to computation of the variance of the maximum likelihood branch lengths that caused occasional crashs of PUZZLE on some systems when applied to data sets containing many very similar sequences. Drop of support for non-FPU Macintosh version. Corrections in manual.
2.5	Improved QP algorithm (Strimmer, Goldman, and von Haeseler 1997). Bug fixes in ML engine, computation of ML distances and ML branch lengths, optional input of a user tree, F84 model added, estimation of all TN model parameters and corresponding standard errors, CLUSTAL W treefile convention adopted to allowe to show branch lengths and QP support values simultaneously, display of unresolved quartets, update of mtREV matrix, source code more compatible with some almost-ANSI compilers, more safety checks in the code.
2.4	Automatic data type recognition, chi-square-test on base composition, automatic selection of best amino acid model, estimation of transition-transversion parameter, ASCII plot of quartet puzzling tree into the outfile.
2.3	More models, many usability improvements, built-in consensus tree routines, more supported systems, bug fixes, no more dependencies of input order. First EBI distributed version.
2.2	Optimized internal data structure requiring much less computer memory. Bug fixes.
2.1	Bug fixes concerning algorithm and transition/transversion parameter.
2.0	Complete revision merging the maximum likelihood and the quartet puzzling routines into one user friendly program. First electronic distribution.
1.0	First public release, presented at the 1995 phylogenetic workshop (15-17 June 1995) at the University of Bielefeld, Germany.

Option	Description
a	Gamma rate heterogeneity parameter alpha. This is the so-called shape parameter of the Gamma distribution.
b	Type of analysis. Allows to switch between tree reconstruction by maximum likelihood and likelihood mapping.
c	Number of rate categories (4-16) for the discrete Gamma distribution (rate heterogeneity).
d	Data type. Specifies whether nucleotide, amino acid sequences, or two-state data serve as input. The default is automatically set by inspection of the input data.
e	Approximation option. Determines whether an approximate or the exact likelihood function is used to estimate parameters of the models of sequence evolution. The approximate likelihood function is in most cases sufficient and is faster.
f	Base frequencies. The maximum likelihood calculation needs the frequency of each nucleotide (amino acid, doublet) as input. PUZZLE estimates these values from the sequence input data. This option allows specification of other values.
g	Group sequences in clusters. Allows to define clusters of sequences as needed for the likelihood mapping analysis. Only available when likelihood mapping is selected ("b" option).
h	Codon positions or definition of doublets. For nucleotide data only. If the TN or HKY model of substitution is used and the number of sites in the alignment is a multiple of three the analysis can be restricted to each of the three codon positions and to the 1st and 2nd positions. If the SH model is used this options allows to specify that the 1st and 2nd codon positions in the alignment define a doublet.
i	Fraction of invariable sites. Probability of a site to be invariable. This parameter can be estimated from the data by PUZZLE (only if the approximation option for the likelihood function is turned off).
j	List puzzling steps trees. Writes all intermediate trees (puzzling step trees) used to compute the quartet puzzling tree into a file.
k	Tree search. Determines how the overall tree is obtained. The topology is either computed with the quartet puzzling algorithm or is defined by the user. Maximum likelihood branch lengths will be computed for this tree. Alternatively, a maximum likelihood distance matrix only can also be computed (no overall tree).
l	Location of root. Only for computation of clock-like maximum likelihood branch lengths. Allows to specify the branch where the root should be placed in an unrooted tree topology. For example, in the tree (a,b,(c,d)) l = 1 places the root at the branch leading to sequence a whereas l=5 places the root at the internal branch.
m	Model of substitution. The following models are implemented for nucleotides: the Tamura-Nei (TN) model, the Hasegawa et al. (HKY) model, and the Schöniger & von Haeseler (SH) model. The SH model describes the evolution of pairs of dependent nucleotides (pairs are the first and the second nucleotide, the third and the fourth nucleotide and so on). It allows for specification of the transition-transversion ratio. The originally proposed model (Schöniger & von Haeseler 1994) is obtained by setting the transition-transversion parameter to 0.5. The Jukes-Cantor (1969), the Felsenstein (1981), and the Kimura (1980) model are all special cases of the HKY model. For amino acid sequence data the Dayhoff et al. (Dayhoff) model, the Jones et al. (JTT) model, the Adachi and Hasegawa (mtREV24) model, and the Henikoff and Henikoff (BLOSUM 62) substitution model are implemented in PUZZLE. The mtREV24 model describes the evolution of amino acids encoded on mtDNA, and BLOSUM 62 is for distantly related amino acid sequences. For more information please read the section in this manual about models of sequence evolution. See also option "w" (model of rate heterogeneity).
n	If tree reconstruction is selected: number of puzzling steps. Parameter of the quartet puzzling tree search (meaning comparable to the number of bootstrap replicates). Generally, the more sequences are used the more puzzling steps are advised. The default value varies depending on the number of sequences (at least 1000). If likelihood mapping is selected: number of quartets in a likelihood mapping analysis. Equal to the number of dots in the likelihood mapping diagram. By default 10000 dots/quartets are assumed. To use all possible quartets in clustered likelihood mapping you have to specify a value of n=0.
o	Outgroup. For displaying purposes of the unrooted quartet puzzling tree only. The default outgroup is the first sequence of the data set.
p	Constrain the TN model to the F84 model. This option is only available for the Tamura-Nei model. With this option the expected (!) transition-transversion ratio for the F84 model have to be entered and PUZZLE computes the corresponding parameters of the TN model (this depends on base frequencies of the data). This allows to compare the results of PUZZLE and the PHYLIP maximum likelihood programs which use the F84 model.
q	Quits analysis.
r	Y/R transition parameter. This option is only available for the TN model. This parameter is the ratio of the rates for pyrimidine transitions and purine transitions. You don't need to specify this parameter as PUZZLE estimates it from the data. For precise definition please read the section in this manual about models of sequence evolution.
s	Symmetrize doublet frequencies. This option is only available for the SH model. With this option the doublet frequencies are symmetrized. For example, the frequencies of "AT" and "TA" are then set to the average of both frequencies.
t	Transition/transversion parameter. For nucleotide data only. You don't need to specify this parameter as PUZZLE estimates it from the data. The precise definition of this parameter is given in the section on models of sequence evolution in this manual.
u	Show unresolved quartets. During the quartet puzzling tree search PUZZLE counts the number of unresolved quartet trees. An unresolved quartet is a quartet where the maximum likelihood values for each of the three possible quartet topologies are so similar that it is not possible to prefer one of them (Strimmer, Goldman, and von Haeseler 1996). If this option is selected you'll get a detailed list of all starlike quartets. Note, for some data sets there may be a lot of unresolved quartets. In this case a list of all unresolved quartets is probably not very useful and also needs a lot of disk space.
v	Approximate quartet likelihood. For the quartet puzzling tree search only. Only for very small data sets it is necessary to compute an exact maximum likelihood. For larger data sets this option should always be turned on. This option was hidden in some previous versions of PUZZLE.
w	Model of rate heterogeneity. PUZZLE provides several different models of rate heterogeneity: uniform rate over all sites (rate homogeneity), Gamma distributed rates, two rates (1 invariable + 1 variable), and a mixed model (1 invariable rate + Gamma distributed rates). All necessary parameters can be estimated by PUZZLE. Note that whenever invariable sites are taken into account the parameter estimation will invoke the "e" option to use an exact likelihood function. For more detailed information please read the section in this manual about models of sequence evolution. See also option "m" (model of substitution).
x	Selects the methods used in the estimation of the model parameters. Neighbor-joining tree means that a NJ tree is used to estimate the parameters. Quartet sampling means that a number of random sets of four sequences are selected to estimate parameters.
y	Starts analysis.
z	Computation of clock-like maximum likelihood branch lengths. This option also invokes the likelihood ratio clock test.

"HALT " errors:	Very severe. You should never ever see one of these messages. If so, please contact the developers!
"Unable to proceed" errors:	Harmless but annoying. Mostly memory errors (not enough RAM) or problems with the format of the input files.
Other errors:	Completely uncritical. Occur mostly when options of PUZZLE are being set.