index pagewebsites.jpg

banner

PROTEIN PHYLOGENETICS

PAUP4.0* can be used for an impressive range of analytical methods involving DNA alignments.

This, unfortunately is not the case for estimating protein phylogenies. Only protein parsimony analyses can be performed in PAUP (a very simple distance method based on straight similarity can be performed as well but it is not recommended, particularly for divergent sequences).

For sophisticated distance and maximum likelihood analyses we need to use alternative programs. The following practical aims at making you familiar with three sets of programs that can be used for protein distance and protein maximum likelihood analyses, they include:
Several programs fromPHYLIP3.5 http://evolution.genetics.washington.edu/phylip/versions.html
PUZZLE 4.0 http://members.tripod.de/korbi/puzzle/READMEpuzzleboot.txt
PROTML 2.2 http://evolution.genetics.washington.edu/phylip/software.etc1.html#MOLPHY
All three programs can be used free of charge - see:

http://evolution.genetics.washington.edu/phylip/software.xref.html

For more information on these programs and many others, this is an excellent web page written and maintained by Joe Felsenstein, packed with useful information on programs for phylogeny. All three programs can read the PHYLIP format simplifying their use in parallel.

PHYLIP and PUZZLE have similar and straight forward menu-driven options whereas PROTML makes use of a command line interface.

PHYLIP is one of the most comprehensive packages for phylogenetic analyses, it is developed by Joe Felsenstein. It contains a broad range of methods for DNA (parsimony, distance and maximum likelihood) and protein analyses including protein distances and a protein parsimony method.

You will be using during these exercises several programs available in the PHYLIP package in an effort to infer protein distance trees and perform bootstrapping. For protein maximum likelihood analyses we shall use the program PUZZLE4.0 and PROTML2.2, the later being from the MOLPHY package.

In addition, PUZZLE can also be used in conjunction with PHYLIP to perform more complex distance estimates. Of particular interest in PUZZLE is the broad range of protein evolutionary models with several matrices (e.g. PAM, JTT and BLOSUM62) and the possibility to incorporate a correction for rate heterogeneity between sites using a discreet gamma shape parameter with the possibility of assuming a fraction of constant sites.

The dataset you will be using has 6 taxa and 760 aligned amino acids positions. It is a subset of the alignment of the largest subunit of the RNA polymerase II that we have analyzed in detail to investigate the phylogeny of Microsporidia (as discussed during the lectures - see also Hirt et al. 1999 for more details). It is in the PHYLIP format which can be used by all the programs you shall be using during these practicals. There are also two additional files to be used for user-defined tree comparisons (i.e. the Kishino-Hasegawa test) and a command line file for the use of PUZZLEBOOT. You will perform all these analyses using telnet. Connect to gene.dbbm.fiocruz.br and open your account.



1) PHYLIP programs.

PHYLIP3.52 programs can be run on several platforms including UNIX and Macintosh machines. Programs can be used to perform parsimony, distance and maximum likelihood analyses of both DNA and protein datasets. There are also two programs for bootstrapping and calculating majority-rule consensus trees. For DNA analyses PAUP has essentially superceded PHYLIP in terms of the diversity and complexity of DNA evolution models. PHYLIP however is still very useful for protein analyses. During these practicals you will be using the following programs from PHYLIP:

PROTDIST: calculate pairwise distances from protein alignments, it allows the calculation of distances with several mutational data matrices including the PAM matrix.

NEIGHBOR: calculate a neighbor-joining (NJ) tree from a distance matrix. It does not assume a molecular clock. NJ trees are heuristic estimates of minimum evolution trees but no alternative trees are compared (NJ is an algorithm and does not have an objective function), a unique tree is always produced without any idea of the quality of the tree. Its main advantage is its speed of execution, which might be considered to be a significant advantage when numerous taxa have to analyzed.

Bootstrapping is recommended in an effort to obtain some information on support for nodes within the tree. If you analyze few taxa, say less than 40, it is advisable to use the FITCH program described below. If the distances are additive (or nearly additive), the NJ method can identify the same tree as FITCH (see below).

FITCH: calculate a least-square tree from a distance matrix. It does not assume a molecular clock. If the distance are additive or close to be additive the program should find the best tree. The program conducts a search of tree-space and the best tree which optimizes the difference between calculated distances and the inferred distances on the tree is selected.

SEQBOOT: produces bootstrap replicates from DNA or protein alignments. I is simply a method of producing many resampled datasets from the original one.  The bootstrapped data is then used by one of the distance (or parsimony or likelihood) calculation programs.

CONSENSE: calculate a majority-rule consensus tree, is used in conjunction with SEQBOOT, PROTDIST and any tree inference program such as FITCH or NEIGHBOR.

SEQBOOT and CONSENSE can also be used in conjunction with PUZZLE, (see PUZZLEBOOT below), to allow more complex models to be used to estimate protein pairwise distances.

2) Puzzle

GENERAL OPTIONS
 b                     Type of analysis?  Tree reconstruction
 k                Tree search procedure?  Quartet puzzling
 v       Approximate quartet likelihood?  No
 u             List unresolved quartets?  No
 n             Number of puzzling steps?  1000
 j             List puzzling step trees?  No
 o                  Display as outgroup?  Gibbon
 z     Compute clocklike branch lengths?  No
 e                  Parameter estimates?  Approximate (faster)
 x            Parameter estimation uses?  Neighbor-joining tree
SUBSTITUTION PROCESS
 d          Type of sequence input data?  Nucleotides
 m                Model of substitution?  HKY (Hasegawa et al. 1985)
 t    Transition/transversion parameter?  Estimate from data set
 f               Nucleotide frequencies?  Estimate from data set
RATE HETEROGENEITY
 w          Model of rate heterogeneity?  Uniform rate

Quit [q], confirm [y], or change [menu] settings:

First you have to download the files to be used during the exercises

- inf6.760 a PHYLIP file containing the protein alignment (6 taxa, 760 aligned amino acids)
- utrees.5 a PHYLIP file containing 5 trees that were obtained with PROTML. Will be used to perform Kishino-Hasegawa tests with PROTML and PUZZLE (see later exercises).
- puzzle.cmds a command line file to instruct PUZZLE (using PUZZLEBOOT) to perform protein distance bootstrapping with the models available in PUZZLE.

<< Prev | Next >>