PILEUP⁽⁺⁾

PileUp creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment.

DESCRIPTION [ Previous | Top | Next ]

PileUp creates a multiple sequence alignment using a simplification of the progressive alignment method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).

The multiple alignment procedure begins with the pairwise alignment of the two most similar sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the next most related sequence or cluster of aligned sequences. Two clusters of sequences can be aligned by a simple extension of the pairwise alignment of two individual sequences. The final alignment is achieved by a series of progressive, pairwise alignments that include increasingly dissimilar sequences and clusters, until all sequences have been included in the final pairwise alignment.

Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree representation of clustering relationships. It is this dendrogram that directs the order of the subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of the pairwise alignments that created the final alignment.

As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final alignment restricted to a maximum length of 7,000 characters (including gap characters inserted into the sequence by PileUp to create the alignment). However, if you include long sequences in the alignment, the number of sequences PileUp can align decreases. See the RESTRICTIONS topic, below, for a more complete discussion of sequence number and size limitations.

EXAMPLE [ Previous | Top | Next ]

Here is a session using PileUp to create a multiple sequence alignment of an unaligned group of 70 kd heat shock and heat shock cognate protein sequences:


% pileup

 PileUp of what sequences ?  @hsp70.list

   1      Hs70_Brelc   676 aa
   2      Hs70_Chick   634 aa

  ///////////////////////////

  27      Hs74_Yeast   641 aa
  28      Dnak_Ecoli   637 aa

 What is the gap creation penalty (* 12 *) ?

 What is the gap extension penalty (* 4 *) ?

 This program can display the clustering relationships graphically.
 Do you want to:

     A) Plot to a FIGURE file called "pileup.figure"
     B) Plot graphics on LaserWriter attached to /dev/tty10
     C) Suppress the plot

 Please choose one (* A *):

 The minimum density for a one-page plot is 20.0 sequences/100 platen units.
 What density do you want (* 20.0 *) ?

 What should I call the output file name (* hsp70.msf *) ?

 Determining pairwise similarity scores...

   1   x     2       3.66
   1   x     3       3.69

 ////////////////////////

  26   x    28       2.18
  27   x    28       2.05

 Aligning...

   1     ................................-....
   2     ................................-.
         ................................-....

 /////////////////////////////////////////////////////////////

  26     ...............................-....
  27     .................................-.
         .................................-....

  FIGURE instructions are now being written into pileup.figure

        Total sequences:         28
       Alignment length:        720
               CPU time:   01:29.25

            Output file: hsp70.msf

%

SCREEN MONITOR [ Previous | Top | Next ]

PileUp names each sequence to be aligned as it is read in. It then displays the messageDetermining pairwise similarity scores... and shows a quality ratio for every pairwise alignment. This ratio is the alignment's quality divided by the length of the shorter sequence. If x is the number of sequences to be aligned, there are (x(x-1))/2 pairwise alignments whose ratio must be calculated.

Next PileUp displays the messageAligning... as it performs each of the pairwise alignments that together create the final multiple sequence alignment. There are x-1 alignments in this part of the program.

OUTPUT [ Previous | Top | Next ]

Below is some of the output file containing the multiple sequence alignment. By default, similar sequences are positioned close to each other in the output file, but if you put-NOSORt on the command line, the aligned sequences are listed in the same order as they were presented to the program.


!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: @hsp70.list

 Symbol comparison table: GenRunData:blosum62.cmp  CompCheck: 6430

                   GapWeight: 12
             GapLengthWeight: 4

 hsp70.msf  MSF: 718  Type: P  October 2, 1996 09:56  Check: 1236 ..

 Name: HS70_PLAFA       Len:   718  Check: 1012  Weight:  1.00
 Name: HS70_THEAN       Len:   718  Check: 8201  Weight:  1.00

 /////////////////////////////////////////////////////////////

 Name: DNAK_ECOLI       Len:   718  Check: 7946  Weight:  1.00

//

            1                                                   50
HS70_PLAFA  ~~~~~~~~~~ ~~~~~MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE
HS70_THEAN  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~MTG PAIGIDLGTT YSCVAVYKDN
HS70_LEIDO  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
HS70_LEIMA  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTFD GAIGIDLGTT YSCVGVWQNE
HS74_TRYBB  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
HS70_TRYCR  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MTYE GAIGIDLGTT YSCVGVWQNE
HS71_YEAST  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~S KAVGIDLGTT YSCVAHFAND
HS72_YEAST  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~S KAVGIDLGTT YSCVAHFSND
HS74_YEAST  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~S KAVGIDLGTT YSCVAHFAND
HS70_MAIZE  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAKSEG PAIGIDLGTT YSCVGLWQHD
HS7C_PETHY  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAGKGEG PAIGIDLGTT YSCVGVWQHD
HS7C_HUMAN  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
  HS7C_RAT  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
HS7C_MOUSE  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKG PAVGIDLGTT YSCVGVFQHG
HS70_CHICK  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSGKG PAIGIDLGTT YSCVGVFQHG
HS72_MOUSE  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MSARG PAIGIDLGTT YSCVGVFQHG
HS71_HUMAN  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKA AAIGIDLGTT YSCVGVFQHG
HS71_MOUSE  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MAKN TAIGIDLGTT YSCVGVFQHG
HS7T_MOUSE  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MAANKG MAIGIDLGTT YSCVGVFQHG
HS70_XENLA  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~MATKG VAVGIDLGTT YSCVGVFQHG
HS7A_CAEEL  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~MSKH NAVGIDLGTT YSCVGVFMHG
HS76_HUMAN  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~MQAPRE LAVGIDLGTT YSCVGVFQQG
HS72_DROME  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~M PAIGIDLGTT YSCVGVYQHG
HS70_BRELC  ~~~~~~~~~~ ~~~~~~~~~~ ~~~MAQSVSG YSVGIDLGTT YSCVGVWQND
GR78_YEAST  ~~~~~~~~~~ ~~~~~~~~~~ ~~ADDVENYG TVIGIDLGTT YSCVAVMKNG
HS75_YEAST  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~AEGVFQ GAIGIDLGTT YSCVATYESS
HS77_YEAST  MLAAKNILNR SSLSSSFRIA TRLQSTKVQG SVIGIDLGTT NSAVAIMEGK
DNAK_ECOLI  ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~G KIIGIDLGTT NSCVAIMDGT

//////////////////////////////////////////////////////////////////

The gaps at the ends of each sequence are written as tildes (~) which may represent differences in input sequence lengths rather than missing characters or significant differences in the alignment. Internal gaps in each sequence are written as periods (.). When you create an end-weighted alignment in PileUp by adding -ENDWeight to the command line, gaps at the ends of each sequence are written as periods since those gaps may represent missing characters or significant differences in the alignment. See Appendix III for more information about the two different gap characters.

DENDROGRAM [ Previous | Top | Next ]

PileUp can plot a dendrogram like the one below that shows the clustering relationships used to determine the order of the pairwise alignments that together create the final multiple sequence alignment. Distance along the vertical axis is proportional to the difference between sequences; distance along the horizontal axis has no significance at all. The interpretation of the dendrogram is discussed in the ALGORITHM topic below.

INPUT FILES [ Previous | Top | Next ]

PileUp accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein sequences as input. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for exampleproject.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of PileUp depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N orType: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

If the input sequences are named in a list file, you can specify the reverse complement strand of any particular nucleotide sequence in the list as input by using the strand:- sequence attribute. You can restrict the range of interest for any particular sequence with appropriate sequence attributes like Begin:43 and End:682. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for more information about sequence attributes in list files.) For example:


This is part of a list file suitable for input to PILEUP.

                   July 24, 1994  ..

SW:Hs77_Yeast
SW:Gr78_Yeast        Begin:43 End:682
SW:HS74_Yeast

///////////////////////////////////////

You can limit the range of interest for all of the sequences in the alignment by including expressions like -BEGin=20 and -END=70 on the command line. The command-line range limiters take precedence over the range limiters for sequences in a list file when both are used. If no range limitation is specified, the entire length of each sequence is aligned.

You can force the program to align the forward strand of all nucleotide sequences by including -NOREVerse on the command line. Conversely, you can force the program to align the reverse complement strand for all nucleotide sequences by including -REVerse on the command line. The command-line strand specification takes precedence over the strand specifications for sequences in a list file when both are used. If no strands are specified, the forward strands of all nucleotide sequences are aligned.

LineUp is a screen editor for editing multiple sequence alignments. You can edit up to 30 sequences simultaneously. New sequences can be typed in by hand or added from existing sequence files. A consensus sequence identifies places where the sequences are in conflict.

Pretty displays multiple sequence alignments and calculates a consensus sequence. It does not create the alignment; it simply displays it.

PlotSimilarity plots the running average of the similarity among the sequences in a multiple sequence alignment.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for new sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between a sequence and a group of aligned sequences represented as a profile.

The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions.

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps.

RESTRICTIONS [ Previous | Top | Next ]

As shipped, PileUp restricts each sequence in the final alignment to a maximum length of 7,000 characters. This maximum length includes the input sequence length plus the total length of all gap characters inserted into the sequence to create the final alignment. By default, each input sequence is restricted to a maximum length of 5,000. Also by default, PileUp can add a maximum of 2,000 gap characters for each sequence in the final alignment.

If you wish to align longer sequences, then you can specify a maximum sequence length of up to 7,000 with the -MAXSeg command-line parameter (e.g. -MAXSeg=6000). If you increase the maximum sequence length in this way, then the maximum amount of allowed gapping is automatically reduced so that the final aligned sequence length cannot exceed 7,000 for any sequence.

If you wish to allow for more gapping in the final alignment, then you can specify a maximum number of gap characters for each sequence with the -MAXGap command-line parameter (e.g. -MAXGAP=3000). If you increase the maximum amount of gapping permitted for each sequence in this way, the maximum sequence length is automatically decreased so that the final aligned sequence length cannot exceed 7,000 for any sequence.

As shipped, the total length of all of the sequences read into PileUp (including the gap allowance for each sequence) cannot be greater than 2,000,000. By reducing the gap allowance for each sequence using the -MAXGap command-line parameter, you can increase the number of sequences that can be read into the program up to the maximum of 500 sequences.

The surface of comparison (see the CONSIDERATIONS topic for a explanation) is limited to 2,250,000.

All of these limits are adjustable (see the CONSIDERATIONS topic below).

ALGORITHM [ Previous | Top | Next ]

A rigorously optimal alignment of even a small number of short sequences would be intractable, both in terms of memory and time. Therefore, PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

PileUp uses this clustering order and first aligns the two most-related sequences to each other in order to produce the first cluster. It then aligns the next most related sequence to this cluster or the next two most-related sequences to each other in order to produce another cluster. A series of such pairwise alignments that includes increasingly dissimilar sequences and clusters of sequences at each iteration produces the final alignment.

In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally, Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final alignment of Seq1 through Seq5.

Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned sequences rather than only individual sequences. For a pairwise alignment of individual sequences, the comparison score between any two sequence symbols is found in a scoring matrix (see the LOCAL DATA FILES topic below). For a pairwise alignment of clusters of sequences, the comparison score between any two positions in those clusters is simply the arithmetic average of the scores for all possible symbol comparisons at those positions. When gaps are inserted into a cluster to produce an alignment, they are inserted at the same position in all of the sequences of the cluster.

CONSIDERATIONS [ Previous | Top | Next ]

Because a rigorous optimal alignment of even a small number of short sequences would be intractable, PileUp uses an approach that may not produce the most optimal multiple sequence alignment. (See the ALGORITHM topic above for a description of this approach.)

Clustering

The approach used by PileUp is sensitive to the order in which sequences are aligned. A clustering algorithm determines this order from the pairwise similarities calculated before the final alignments are done. The goal of the clustering is to see that very similar sequences are aligned to each other before they are aligned to more distantly related sequences. There is, at present, no way for you to modify the order of these alignments.

While PileUp calculates the similarity between each of the sequences, this information is not used by the program to weight the sequences. That is, if there are several very similar sequences, the final alignment may be constrained to minimize the disruption of these sequences.

The dendrogram is not a phylogenetic reconstruction, although the vertical branch lengths are proportional to the distance between the sequences. Its purpose is to represent the clustering order used to create the final alignment. This order is the only information from the dendrogram used by PileUp. See the RELATED PROGRAMS topic for a description of programs in the Wisconsin Package that you can use to create phylogenetic reconstructions from multiple sequence alignments.

Global Alignment

If you know the difference between Gap and BestFit, consider PileUp an extension of the Gap program for more than two sequences, rather than an extension of the BestFit program. PileUp, like Gap, tries to find a global optimal alignment, while BestFit finds a local optimal alignment.

Because PileUp aligns sequences along their entire lengths, it is not ideally suited to finding the best local region of similarity (such as a shared motif) among all of the sequences. However, PileUp has been used successfully for this purpose.

By default, PileUp does not penalize gaps occurring at the ends of sequences. Therefore, related sequences that differ in the extent of their sequencing can be reasonably aligned by PileUp. You can override this default by placing -ENDWeight on the command line, in which case length differences among the sequences become significant.

Piling Up Unrelated Sequences

PileUp always aligns all of the sequences you specify, even if they are not related. The alignment can be degraded if some of the sequences are not similar to one another.

Arbitrary Gap Placement

In any pairwise alignment, the position of the inserted gaps may be arbitrary; equally optimal alignments can be generated by inserting the gaps differently. PileUp can exaggerate these arbitrary differences if you select either the -LOWroad or -HIGhroad parameters. This selection usually affects the final alignment. For the most part, however, the difference between the high road and low road alignments should not be very significant, although you may want to check.

Here is an example showing the difference between high and low road for the alignment of three short sequences. The first pairwise alignment creates an aligned cluster of the two most closely related sequences; the second alignment aligns this cluster to the third sequence creating the final multiple sequence alignment. Although the qualities after the first round alignments are the same, the quality of the final low-road alignment is higher than the high-road one.

             For:       Match = 10       Gap weight = 10
                     Mismatch =  0    Length weight =  0

                HighRoad                          LowRoad

                GACCAT                            GACCAT
Alignment  1    GAG.AT    Quality = 30            GA.GAT    Quality = 30

                GACC.AT                           GAC.CAT
Alignment  2    GAG..AT   Quality = 25            GA..GAT   Quality = 30
                AACGGAT                           AACGGAT

High road alignments shift all of the arbitrary gaps in the second sequence or cluster of aligned sequences to the right and all of the arbitrary gaps in the first sequence or cluster of aligned sequences to the left. Low road alignments do the opposite. When neither high road nor low road is selected, the program tries not to insert a gap whenever that is possible and uses the high road when that is not possible.

Scoring Matrices

The default scoring matrices are not necessarily appropriate for all alignments. (See Chapter 4, Using Data Files in the User's Guide for more information.) We provide several alternative scoring matrices suitable for multiple sequence alignments. These matrices are listed in Appendix VII. PileUp chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can respond to the program prompts or use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

Surface of Comparison

PileUp performs a series of pairwise alignments between clusters of sequences to create the final multiple sequence alignment. Each pairwise alignment requires enough computer memory for a surface of comparison proportional to the product of the lengths of the two clusters being aligned. Since all sequences in an aligned cluster have the same length, the length of a cluster is simply the length of any sequence within that cluster.

PileUp allows you to align sequences, the product of whose lengths is greater than the surface of comparison. In this case, the program limits the total length of gaps that can be inserted into each sequence and calculates the best alignment within this incomplete, or limited, surface of comparison. The program then performs a calculation to determine whether the alignment could possibly be improved if there were no restriction on the total length of gaps in each sequence. If the program cannot rule out this possibility, it displays the message*** Alignment is not guaranteed to be optimal *** . Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, a limited alignment often may be optimal even if this message is displayed. In any event, the program continues to completion.

Memory Requirement

PileUp is shipped with the Wisconsin Package so that you can run the program if you have access to 17.5 MB of virtual memory. If you do not have this much, you can ask your system manager to further reduce the parameters that define the program's capacity as described below. If you have more than 17.5 MB of virtual memory, you can ask your system manager to increase these same parameters so that PileUp is able to align more and longer sequences.

Adjustable parameters are set in the file GenInclude:pileupconstant.inc. Here are their current values.
Parameter MAXSEQNUM = 500 ! maximum number of sequences Parameter MAXSURFACE = 2 250 000 ! maximum surface of comparison Parameter SEGDEF = 5000 ! default maximum length of ! segment to be aligned ! (per sequence) Parameter PADDINGDEF = 2000 ! default maximum length of all ! gaps combined (per sequence) Parameter MAXSTRBUFF = 2 000 000 ! maximum combined length of ! all sequences after ! alignment

SUGGESTIONS [ Previous | Top | Next ]

Figure Files

By default, PileUp writes instructions for plotting the dendrogram into a figure file named pileup.figure. Such files can be plotted on any supported graphics device using the Figure program.

Batch Queue

PileUp can take more than a few minutes to run, depending upon the length and number of sequences being aligned. Most alignments should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using the command-line parameter -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide. Very large alignments may exceed the CPU limit set by some systems.

When PileUp is run in batch using -BATch, instructions for plotting the dendrogram are written to a figure file named pileup.figure unless the plot has been directed to a specific file or graphics device from the command line, or has been suppressed with the -NOPLOt command-line parameter.

Editing Multiple Sequence Alignments

PileUp writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. You can edit the alignments created by PileUp with LineUp using the command % lineup -MSF, but LineUp cannot be used to edit more than 30 sequences simultaneously.

You can also edit the alignment created by PileUp with a regular text editor. Any PileUp alignment that has been modified with a text editor can be put back into GCG's multiple sequence format (MSF) using the command % reformat -MSF.

The Pretty program can calculate a consensus for the multiple sequence alignment and can display the alignment several different ways.

Using the Output from PileUp

PileUp writes the alignment into a multiple sequence format (MSF) file that interleaves the sequences to show their alignment. Any or all of the sequences in this file can be used by any other GCG sequence analysis program. For instance, you could generate a profile from the sequences in an MSF file with a command like % profilemake hsp70.msf{*} and then use that profile to search the database for sequences similar to the sequences in the alignment. (See "Specifying MSF Sequences" in Chapter 2, Using Sequence Files and Databases in the User's Guide for help specifying sequences in MSF files.)

GRAPHICS [ Previous | Top | Next ]

The Wisconsin Package must be configured for graphics before you run any program with graphics output! If the % setplot command is available in your installation, this is the easiest way to establish your graphics configuration, but you can also use commands like % postscript that correspond to the graphics languages the Wisconsin Package supports. See Chapter 5, Using Graphics in the User's Guide for more information about configuring your process for graphics.

<CTRL>C [ Previous | Top | Next ]

If you need to stop this program, use <Ctrl>C to reset your terminal and session as gracefully as possible. Searches and comparisons write out the results from the part of the search that is complete when you use <Ctrl>C. The graphics device should stop plotting the current page and start plotting the next page. If the current page is the last page, plotters should put the pen away and graphic terminals should return to interactive mode.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % pileup -[INfile=]@Hsp70.List -Default

Prompted Parameters:

-GAPweight=12           gap creation penalty
-LENgthweight=4         gap extension penalty
-DENsity=20.0           number of sequences per 100 pu in the dendrogram
[-OUTfile1=]hsp70.msf   output file for multiple sequence alignment

Local Data Files:-MATRix=blosum62.cmp   scoring matrix for peptides
                 -MATRix=pileupdna.cmp  scoring matrix for nucleic acids

Optional Parameters:

-BEGin=1     sets beginning position for every sequence to be aligned
-END=100     sets ending position for every sequence to be aligned
-REVerse     uses the reverse strand for each input sequence
-ENDWeight   penalizes end gaps like other gaps
-INSitu      realign a portion of an existing alignment
-HIGhroad    selects "top" alignment path for equally optimal gaps
-LOWroad     selects "bottom" alignment path for equally optimal gaps
-MAXSeg=5000 sets maximum segment length for every input sequence
-MAXGap=2000 sets maximum combined length of all gaps added to a sequence
-NOSORt      presents output sequences in the same order as input
-LINesize=50       sets the number of sequence symbols per line
-BLOcksize=10      sets the number of sequence symbols per block
-DEGap       removes gap characters ('.' and '~') from the input sequences
-NOPLOt      suppresses plot of clustering relationships
-NOMONitor   suppresses screen trace of each alignment
-NOSUMmary   suppresses screen summary at the end of the program
-BATch       submits program to the batch queue

All GCG graphics programs accept these and other switches. See the Using
Graphics chapter of the USERS GUIDE for descriptions.

-FIGure[=FileName]  stores plot in a file for later input to FIGURE
-FONT=3             draws all text on the plot using font 3
-COLor=1            draws entire plot with pen in stall 1
-SCAle=1.2          enlarges the plot by 20 percent (zoom in)
-XPAN=10.0          moves plot to the right 10 platen units (pan right)
-YPAN=10.0          moves plot up 10 platen units (pan up)
-PORtrait           rotates plot 90 degrees

ACKNOWLEDGEMENT [ Previous | Top | Next ]

PileUp was written by Irv Edelman.

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

PileUp reads a scoring matrix from your local directory or the public database with the values for every possible symbol comparison. The file pileupdna.cmp has a 10 at every place where the set of bases implied by the alphabetic IUB ambiguity codes (see Appendix III) overlap. All of the other locations have zeros. The file blosum62.cmp is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The scores in this matrix for pairwise amino acid comparisons range from -4 to +11. You can use the Fetch program to copy these files and then modify them to suit you own needs. (See the CONSIDERATIONS topic for more information about scoring matrices.)

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see the Local Scoring Matrices topic above.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, PileUp ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, PileUp ignores ending positions specified for sequences in a list file.

-REVerse

sets the program to use the reverse strand for each input sequence. When -REVerse or -NOREVerse is on the command line, PileUp ignores any strand designation for individual sequences in a list file.

-ENDWeight

causes gaps at the ends of sequences to be penalized in the same way as all other gaps. (The default is not to penalize gaps at the ends of sequences.)

-INSitu

allows you to realign a portion of an existing alignment without changing the remainder of the alignment. You specify the portion to realign with the-BEGin and -END command-line parameters. The program removes all gaps (. and ~) from this portion of the alignment, then realigns only this portion, and finally replaces the specified part of the original alignment with the newly realigned part.

-HIGhroad and -LOWroad

exaggerates the arbitrary insertion of gaps. (See the CONSIDERATIONS topic for a description of high and low road alignments.)

-MAXSeg=5000

sets the maximum length for each individual input sequence. Setting a higher limit (up to a maximum of 7,000) allows you to align longer sequences while setting a lower limit allows you to add more and longer gaps to each sequence. (See the RESTRICTIONS topic for a more detailed description.)

-MAXGap=2000

sets the maximum combined length of all gaps that can be added to each sequence. Setting a higher limit allows you to add more and longer gaps to each sequence while setting a lower limit allows you to align a greater number of sequences. (See the RESTRICTIONS topic for a more detailed description.)

-NOSORt

writes the aligned sequences in the same order as they were presented to the program, rather than presenting closely aligned sequences close together in the output.

-LINesize=50

specifies the number of sequence symbols to display on each line of the output MSF (multiple sequence format) file.

-BLOcksize=10

specifies the number of sequence symbols to place in each block of the output MSF (multiple sequence format) file.

-DEGap

removes gap characters (. and ~) from the input sequences before aligning.

-NOPLOt

suppresses the plot of clustering relationships used to create the multiple sequence alignment.

-MONitor=1,1

shows the progress of PileUp on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it by including -NOMONitor on the command line.

The screen monitor is updated every time the program determines a pairwise similarity between two sequences (in the first part of the program) and every time the program aligns two clusters of sequences (in the second part of the program). You can append two optional values to -MONitor to set these two monitoring intervals to some other numbers, for example -MONitor=20,10, outputs a line to the screen after every 20th pairwise comparison and every 10th alignment.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default parameter to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

The parameters below apply to all GCG graphics programs. These and many others are described in detail in Chapter 5, Using Graphics of the User's Guide.

-FIGure=programname.figure

writes the plot as a text file of plotting instructions suitable for input to the Figure program instead of drawing the plot on your plotter.

-FONT=3

draws all text characters on the plot using Font 3 (see Appendix I).

-COLor=1

draws the entire plot with the pen in stall 1.

The parameters below let you expand or reduce the plot (zoom), move it in either direction (pan), or rotate it 90 degrees (rotate).

-SCAle=1.2

expands the plot by 20 percent by resetting the scaling factor (normally 1.0) to 1.2 (zoom in). You can expand the axes independently with -XSCAle and -YSCAle. Numbers less than 1.0 contract the plot (zoom out).

-XPAN=30.0

moves the plot to the right by 30 platen units (pan right).

-YPAN=30.0

moves the plot up by 30 platen units (pan up).

-PORtrait

rotates the plot 90 degrees. Usually, plots are displayed with the horizontal axis longer than the vertical (landscape). Note that plots are reduced or enlarged, depending on the platen size, to fill the page.

Printed: November 18, 1996 13:04 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

PILEUP(+)

PILEUP⁽⁺⁾