FASTA*

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
ACKNOWLEDGEMENT
COMMAND-LINE SUMMARY
LOCAL DATA FILES
OPTIONAL PARAMETERS

FUNCTION

[ Top | Next ]

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

DESCRIPTION

[ Previous | Top | Next ]

FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between one sequence (the query) and any group of sequences of the same type (nucleic acid or protein) as the query sequence. In the first step of this search, the comparison can be viewed as a set of dot plots, with the query as the vertical sequence and the group of sequences to which the query is being compared as the different horizontal sequences. This first step finds the registers of comparison (diagonals) having the largest number of short perfect matches (words) for each comparison. In the second step, these "best" regions are rescored using a scoring matrix that allows conservative replacements, ambiguity symbols, and runs of identities shorter than the size of a word. In the third step, the program checks to see if some of these initial highest-scoring diagonals can be joined together. Finally, the search set sequences with the highest scores are aligned to the query sequence for display.

What is a Word?

A word is any short sequence (n-mer or k-tuple) where you have set n to some small integer less than or equal to six. The word GGATGG is one of the 4,096 possible words of length six that can be created from an alphabet consisting of the four letters G, A, T, and C. The word QL is one of the 400 possible words of length two that you can make with the 20 letters of the amino acid alphabet.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using FastA to identify sequences in the SwissProt protein sequence database that are similar to a human globin protein sequence:


% fasta

 FASTA with what query sequence ?  ggamma.pep

                  Begin (* 1 *) ?
                End (*   444 *) ?

 Search for query in what sequence(s) (* SwissProt:* *) ?

 What word size (* 2 *) ?

 Don't show scores whose E() value exceeds: (* 10.0 *):

 What should I call the output file (* ggamma.fasta *) ?

          1 Sequences         924 aa searched    SW:104KTHEPA
        101 Sequences      33,651 aa searched    SW:1A1DPSESP

 ///////////////////////////////////////////////////////////////

 CPU time used:
       Database scan:  0:00:34.7
Post-scan processing:  0:00: 2.1
      Total CPU time:  0:00:37.1
 Output File: ggamma.fasta

%

OUTPUT

[ Previous | Top | Next ]

The output from FastA is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Chapter 2, Using Sequence Files and Databases of the User's Guide.)

Here is some of the output file:


!!SEQUENCE_LIST 1.0

(Peptide) FASTA of: ggamma.pep  from: 1 to: 148  September 17, 1996 16:21

TRANSLATE of: gamma.seq check: 6474 from: 2179 to: 2270
      and of: gamma.seq check: 6474 from: 2393 to: 2615
      and of: gamma.seq check: 6474 from: 3502 to: 3630
generated symbols 1 to: 148.
Human fetal beta globins G and A gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203. . . .

 TO: SwissProt:*  Sequences:     52,205  Symbols: 18,531,385  Word Size: 2

 Databases searched:
   SWISS-PROT, Release 33.0, Released on 22Mar96, Formatted on 22Jul1996

 Scoring matrix: GenRunData:blosum50.cmp
 Variable pamfactor used
 Gap creation penalty: 12      Gap extension penalty: 2

Histogram Key:
 Each histogram symbol represents 70 search set sequences
 Each inset symbol represents 13 search set sequences
 z-scores computed from opt scores

z-score obs    exp
        (=)    (*)

< 20    159      0 :*==
  22      1      0 :*
  24      3      0 :*
  26     10      1 :*
  28     71     12 :*=
  30    243     71 :=*==
  32    607    274 :===*=====
  34   1351    742 :==========*=========
  36   2072   1524 :=====================*========
  38   3079   2519 :===================================*========
  40   3949   3514 :==================================================*======
  42   4149   4295 :===========================================================*
  44   4037   4738 :========================================================== *
  46   3849   4826 :=======================================================    *
  48   3646   4620 :=====================================================      *
  50   3405   4216 :=================================================          *
  52   3247   3707 :===============================================     *
  54   2841   3166 :=========================================    *
  56   2427   2645 :===================================  *
  58   2101   2171 :===============================*
  60   1826   1759 :=========================*=
  62   1493   1410 :====================*=
  64   1289   1121 :================*==
  66   1122    886 :============*====
  68    869    697 :=========*===
  70    904    546 :=======*=====
  72    573    427 :======*==
  74    425    333 :====*==
  76    401    259 :===*==
  78    293    201 :==*==
  80    232    156 :==*=
  82    174    120 :=*=
  84    129     95 :=*
  86    125     73 :=*
  88     81     57 :*=
  90     74     44 :*=
  92     54     34 :*         :==*==
  94     39     26 :*         :=*=
  96     45     20 :*         :=*==
  98     36     16 :*         :=*=
 100     20     12 :*         :*=
 102     15      9 :*         :*=
 104     10      7 :*         :*
 106     18      6 :*         :*=
 108     10      4 :*         :*
 110     12      3 :*         :*
 112      5      3 :*         :*
 114      6      2 :*         :*
 116      3      2 :*         :*
 118      2      1 :*         :*
>120    639      1 :*=========:*=======================================

 Results sorted and z-values calculated from opt score
 1854 scores saved that exceeded 78
 38871 optimizations performed
 Joining threshold: 36, optimization threshold: 24, opt. width: 16

The best scores are:                    init1 initn   opt    z-sc E(51375)..

SW:HBG_HUMAN    Begin: 1  End:  146
! P02096 homo sapiens (human), and pa...  956   956   956  1325.6       0
SW:HBG1_PONPY    Begin: 1  End:  146
! P18995 pongo pygmaeus (orangutan). ...  952   952   952  1320.1       0
SW:HBG_MACMU    Begin: 1  End:  146
! P02098 macaca mulatta (rhesus macaq...  947   947   947  1313.2       0

/////////////////////////////////////////////////////////////////////////

\\End of List

ggamma.pep
SW:HBG_HUMAN

ID   HBG_HUMAN      STANDARD;      PRT;   146 AA.
AC   P02096;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE)
DE   HEMOGLOBIN GAMMA-A AND GAMMA-G CHAINS. . . .

SCORES      Init1:   956  Initn:   956  Opt:   956 z-score: 1325.6 E():      0
Smith-Waterman score: 956;    99.3% identity in 146 aa overlap

                     10        20        30        40        50        60
ggamma.pep   MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
              |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HBG_HUMAN     GHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK
                      10        20        30        40        50

                     70        80        90       100       110       120
ggamma.pep   VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HBG_HUMAN    VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG
            60        70        80        90       100       110

                    130       140
ggamma.pep   KEFTPEVQASWQKMVTGVASALSSRYHX
             ||||||||||||||||:||||||||||
HBG_HUMAN    KEFTPEVQASWQKMVTAVASALSSRYH
           120       130       140

///////////////////////////////////////////////////////////////////

! CPU time used:
!        Database scan:  0:00:34.7
! Post-scan processing:  0:00:02.1
!       Total CPU time:  0:02:37.1
! Output File: ggamma.fasta

What is the Output?

The first part of the output file contains a histogram showing the distribution of the z-scores between the query and search set sequences. (See the ALGORITHM topic for an explanation of z-score.) The histogram is composed of bins of size 2 that are labeled according to the higher score for that bin (the leftmost column of the histogram). For example, the bin labeled 24 stores the number of sequence pairs that had scores of 23 or 24.

The next two columns of the histogram list the number of z-scores that fell within each bin. The second column lists the number of z-scores observed in the search and the third column lists the number of z-scores that were expected.

The body of the histogram displays a graphical representation of the score distributions. Equal signs (=) indicate the number of scores of that magnitude that were observed during the search, while asterisks (*) plot the number of scores of that magnitude that were expected.

At the bottom of the histogram is a list of some of the parameters pertaining to the search. These are displayed even if the histogram itself has been suppressed by -NOHIStogram.

Below the histogram, FastA displays a listing of the best scores. /rev or Strand: - after the sequence name in this list indicates that the match was found between search set sequence and the bottom (reverse-complement) strand of the query sequence.

Following the list of best scores, FastA displays the alignments of the regions of best overlap between the query and search sequences. /rev following the query sequence name indicates that the search sequence is aligned with the bottom strand of the query sequence.

This program displays only the region of overlap between the two aligned sequences (plus some residues on either side of the region to provide context for the alignment) unless you put -SHOWall on the command line. The display of identities and conservative replacements between the aligned sequences depends on the value of the -MARKx command-line parameter. By default ( -MARKx=3), the pipe character (|) is used to denote identities and the colon (:) to denote conservative replacements.

INPUT FILES

[ Previous | Top | Next ]

FastA accepts a single protein sequence or a single nucleic acid sequence as the query sequence. The search set is either a single sequence or multiple sequences of the same type as the query. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of FastA depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

Segments aligns and displays the segments of similarity found by WordSearch.

If you run Compare with the command-line parameter -WORd, the program calculates the points for a dot-plot that show where common words between two sequences occur.

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

BLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. BLAST can search databases on your own computer or databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

TFastA does a Pearson and Lipman search for similarity between a query peptide sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied peptide sequences in a nucleotide sequence database are similar to my peptide sequence?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

FindPatterns, StringSearch, LookUp and Names are other programs for identifying sequences.

RESTRICTIONS

[ Previous | Top | Next ]

The query sequence cannot be longer than 32,000 symbols. You cannot select a list size of more than 1,000 best scores nor view more than 1,000 alignments. The word size must be from 1 to 6 for nucleic acid queries, and from 1 to 2 for protein queries. The sequence type (nucleic acid or protein) of the query sequence and the search set sequences must match.

For the estimates of statistical significance to be valid, the search set must contain a large sample of unrelated sequences. The statistical estimates will not be calculated at all if there are fewer than 20 scores saved (equivalent to 10 sequences in the search set when both strands are searched, or 20 sequences if only one strand is searched).

If -NOOPTall is specified on the command line, the estimates of statistical significance will not be accurate.

ALGORITHM

[ Previous | Top | Next ]

FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci. USA 85; 2444-2448 (1988)) to search for similarities between one sequence (the query) and any group of sequences.

Hashing Step

The first step in the search is to locate regions of the query sequence and the search set sequence that have high densities of exact word matches. The algorithm for this step of the search is a modification of the algorithm of Wilbur and Lipman (Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) and may be referred to as a hash-table look-up search or hashing. Wilbur and Lipman searches (including FastA) belong to a class of comparisons that use direct addressing or k-tuple preprocessing to increase the speed of the search at the expense of some sensitivity.

The hashing process works as follows. After you give FastA a word size, it makes up a dictionary of all of the possible words of that size in the query sequence. A second dictionary is created for the opposite strand if the query is a nucleic acid sequence. Each word, such as GGATGG, is converted to a unique base-4 number that serves as an index to the corresponding dictionary entry. Each entry contains a list of numbers telling the location (coordinates) of the word in the query sequence. If the word does not occur, the entry contains only the number zero. So for each word in the searched sequences, FastA only has to look up the word in the dictionary to find out if it occurs in the query sequence.

It is important to realize that the hashing process cannot deal with ambiguity! To partially compensate for this limitation, FastA converts an ambiguous base in a nucleotide sequence to its most common nonambiguous constituent before calculating the index number of the word that contains the ambiguity. For example, A is the most common nucleotide in the sequence databases, so N is converted to A during the hashing step. For protein sequences, the ambiguous amino acids B, Z, and X are not converted to unambiguous amino acids, but are treated as extra amino acids. This means that an X in a protein query sequence will match only to an X in the search set sequence during the hashing step.

If a word from a search set sequence occurs in the query sequence, FastA computes a score for the word equal to the sum of the scoring factors (see next paragraph) for each symbol in the word. It then adds this score to the score of the diagonal on which the word occurs. If a word match overlaps another word on the same diagonal, only the scoring factor(s) for the non-overlapping symbol(s) is added to the score of the diagonal. If there are intervening mismatches between matching words on a diagonal, a constant penalty for each mismatching residue is subtracted from the score.

When -PAMfactor is in effect (the default for protein query sequences), the scoring factors used to score a word are the identical match scores of the scoring matrix used. Thus a word that contains relatively immutable amino acids will add a larger score to the diagonal than a word which contains amino acids which can exchange readily. The default for a nucleic acid query sequence is -NOPAMfactor. In this case, a single constant value is used for all symbol matches, so all words contribute the same score. The program defaults can be overridden by placing -NOPAMfactor or -NOPAMfactor on the command line.

Scoring Step

At the end of the hashing step, the ten highest-scoring regions for the sequence pair (the regions with the highest density of exact word matches) are rescored using a scoring matrix that allows conservative replacements and runs of identities shorter than the size of a word. The ends of each region are trimmed to include only those residues that contribute to the highest score for the region, resulting in ten partial alignments without gaps. These are referred to as the initial regions. The score of the highest scoring initial region is saved as the init1 score.

Next, FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined. The score for the joined regions is the sum of the scores of the initial regions minus a joining penalty for each gap. The score of the highest scoring region at the end of this step is saved as the initn score.

Aligning Step

After computing the initial scores, FastA determines the best segment of similarity between the query sequence and the search set sequence, using a variation of the Smith-Waterman algorithm. This "local alignment in a band" procedure is described in Chao, Pearson, and Miller (CABIOS 8; 481-487 (1992)). The score for this alignment is reported as the opt score.

By default, FastA determines the opt score immediately if the initn score is greater than a given threshold. The opt scores are then used as the basis for keeping a list of the best matches found. The program calculates the default threshold from the length of the query sequence and the ktup setting. You can override this threshold by adding a positive, nonzero number after the -OPTall command-line parameter, for example: -OPTall=20. A threshold of 1 is the most sensitive setting. Setting the threshold higher than this will speed up the search, at the risk of missing some matches.

Alternatively, you can use the -NOOPTall command-line parameter to direct the program to use the initn scores as the basis for retaining the best matches. In this case, the opt scores are calculated for the matches with the best initn scores only after all of the search set sequences have been scanned. This speeds up the search, but at the cost of sensitivity, and the statistical estimates for such a search will not be valid. When -NOOPTall is specified, the best scores are sorted and reported in order of their initn scores, even though the opt score is calculated.

Lastly, FastA uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. (See William R. Pearson, Protein Science 4; 1145-1160 (1995) for an explanation of how this z-score is calculated.) By default, the z-score is calculated from the opt score; if -NOOPTall is on the command line, the z-score is calculated from the initn score instead.

The distribution of the z-scores tends to closely approximate an extreme-value distribution; using this distribution, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() score.

When all of the search set sequences have been compared to the query, the list of best scores is printed. If alignments were requested, the alignments are also printed. For searches with a protein query sequence against a protein search set, a full Smith-Waterman local alignment (not restricted to a band, and therefore allowing unlimited gap lengths) is performed, and a Smith-Waterman score is reported along with the other scores and the alignment itself. By default, the alignment for nucleic acid searches and TFastA is the same local alignment in a band that was performed to calculate the opt score. By means of the -SWalign command-line parameter, you can make the program perform the full Smith-Waterman alignment at the cost of increased computation time.

In evaluating the E() scores, the following rules of thumb can be used: for searches of a protein database of 10,000 sequences, sequences with E() less than 0.01 are almost always found to be homologous. Sequences with E() between 1 and 10 frequently turn out to be related as well. Optimization is important: if -NOOPTall is specified on the command line, E() overestimates the significance of the match, so that unrelated nucleic acid sequences frequently have scores less than 0.0005.

A detailed description of the FastA algorithm is William R. Pearson, "Rapid and Sensitive Sequence Comparison with FASTP and FASTA," in Methods in Enzymology, 183; 63-98, Academic Press, San Diego, California, USA, 1990.

CONSIDERATIONS

[ Previous | Top | Next ]

The E() scores are affected by similarities in sequence composition between the query sequence and the search set sequence. Unrelated sequences may have "significant" scores because of composition bias.

If there is a database entry that overlaps your query in several places, only the best overlap appears in the alignment display.

The Wisconsin Package(TM) version of FastA searches both strands of nucleic acid queries unless you put -ONEstrand on the command line. Dr. Pearson's FastA searches only the top strand.

There are two ways to control the size of the list of best scores. By default FastA will list scores until a specific E() score is reached. You may set this value by typing it in at the prompt or by using the -EXPect parameter; otherwise the program uses 10.0 for protein searches, 2.0 for nucleic acid searches. (If you are running the program interactively, it will show no more than 40 scores initially, and ask if you want to see more scores if there are any more that are less than the value of the -EXPect parameter.)

If -NOOPTall is on the command line or if the list size is specified on the command line (for example, -LIStsize=40), the E() value is ignored, and the program will list either the number of scores you requested or 40 scores if -NOOPTall is specified alone. If you are running the program interactively, it will then ask if you want to see more scores, up to the maximum of 1000 scores.

You can control the number of alignments using the -NOALIgn and -ALIgn= command-line parameters. The program behaves differently depending on whether it is being run noninteractively (in batch or with -Default on the command line) or interactively. In the noninteractive case, the program displays the number of alignments set by the -ALIgn parameter. (If this is not present, it shows 40 alignments or the number of scores that were listed, whichever is smaller.) If you run the program interactively, it displays the list of best scores, then asks you how many alignments you want to see. This allows you to override the -ALIgn command-line parameter if you see that you need more (or fewer) alignments than you had anticipated. (This prompt does not appear if -NOALIgn is on the command line.)

SUGGESTIONS

[ Previous | Top | Next ]

Word Size

By default, FastA uses the maximum word size permitted. Use of smaller word sizes increases the sensitivity at the expense of increasing the amount of CPU time required to run the program. A smaller word size (1 or 2) should be used if the query sequence is a short oligonucleotide or short peptide. Because of the way ambiguous residues are treated during the hashing stage of the search, you should not use a word size larger than the longest run of nonambiguous residues in your query sequence .

Gap Creation and Extension Penalties

FastA chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) The histogram display gives a qualitative view of the quality of fit between the actual distribution of scores and the expected distribution of scores. This information may indicate whether or not suitable gap creation and extension penalties were used for the search. When the histogram shows poor agreement between the actual distribution and the theoretical distribution, you might consider using -GAPweight and -LENgthweight to specify higher gap creation and extension penalties, respectively. For example, you might increase the gap creation penalty from 12 to 16 and the gap extension penalty from 2 to 4.

Identifying the Search Set

If you want to search a single database division instead of an entire database, see the "Using Database Sequences" topic of Chapter 2, Using Sequence Files and Databases of the User's Guide for a list of the logical names used for the databases and the divisions of each database. The search set can also consist of a group of sequence files that are not in a database. Use a multiple sequence specification to name these. For information about naming groups of sequences for the search set, see the topics "Specifying Files" and "Using Wildcards" in Chapter 1, Getting Started, and "Using Database Sequences," "Using Multiple Sequence Format (MSF) Files", "Using Rich Sequence Format (RSF) Files", and "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide.

Batch Queue

FastA is one of the few programs in the Wisconsin Package that can take more than a few minutes to run. Most comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using the command-line parameter -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems.

Interrupting a Search: <Ctrl>C

You can type <Ctrl>C to interrupt a search and see the results from the part of the search that has already been completed.

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

FastA and TFastA were written by Professor William Pearson of the University of Virginia Department of Biochemistry (Pearson and Lipman, Proc. Natl. Acad. Sci., USA 85; 2444-2448 (1988)). In collaboration with Professor Pearson, they were modified and documented for distribution with GCG Version 6.1 by Mary Schultz and Irv Edelman, and for Versions 8 and 9 by Sue Olson.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % fasta [-INfile1=]ggamma.pep -Default

Prompted Parameters:

[-INfile2=]SwissProt:*         search set (all of SwissProt)
[-OUTfile=]ggamma.fasta        output file name
-BEGin=1 -END=148              range of interest
-WORdsize=2                    word size
-EXPect=2.0                    lists scores until E() value reaches 2.0

Local Data Files:

-MATRix=fastadna.cmp           scoring matrix for nucleic acids
-MATRix=blosum50.cmp           scoring matrix for peptides

Optional Parameters:

-GAPweight=16      gap creation penalty   (12 is protein default)
-LENgthweight=4    gap extension penalty   (2 is protein default)
-SINce=6.90        limits search to sequences dated on or after June 1990
-ONEstrand         searches only the top strand of nucleotide sequences
-PAMfactor         uses scoring matrix to calculate initial diagonal scores
-LIStsize=40       shows the best 40 scores (overrides EXPect)
-NOATTRibutes      suppresses writing the Begin, End, and Strand
                     list attributes to the list of best scores
-ALIgn=20          shows the best 20 alignments
-NOALIgn           suppresses sequence alignments
-OPTall=20         immediately computes opt score when the initn score is 20
                     or higher; sorts on opt score
-NOOPTall          doesn't compute opt score during search; sorts on initn
-SWalign           does final alignment as Smith-Waterman for nuc searches
-SHOWall           shows complete sequences in alignment, not just overlaps
-MARKx=3           determines the alignment display mode
-NOHIStogram       suppresses printing the histogram
-LINesize=60       number of sequence symbols per line of the alignment
-NODOCLines        suppresses sequence documentation in the alignment
-NOMONitor         suppresses the screen trace for each search set sequence
-BATch             submits the program to run in the batch queue
-MINLength=1000    searches only sequences of 1000 or more residues
-MAXLength=5000    searches only sequences of 5000 or fewer residues

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

FastA reads a scoring matrix containing the values for every possible match from your working directory or the public database. The files fastadna.cmp (for nucleic acid sequences) and blosum50.cmp (for protein sequences) contain the default values for matches. blosum50.cmp is a BLOSUM50 matrix. You can use the Fetch program to obtain a copy of these files in order to modify them to suit your own needs.

OPTIONAL PARAMETERS

[ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see the Local Scoring Matrices topic above.

-SINce=6.90

limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.

-ONEstrand

searches only the top strand of nucleotide sequences.

-PAMfactor

uses a scoring matrix for the calculation of initial diagonal scores. Instead of using a constant factor for each match in a word, the identical match scores from the scoring matrix are used. This is the default for protein sequences, while -NOPAMfactor is the default for nucleic acid sequences.

-GAPweight=12

specifies the gap creation penalty that is subtracted from the alignment score whenever a gap is created.

-LENgthweight=2

specifies the gap extension penalty that is subtracted from the alignment score for each residue added to an existing gap.

-EXPect=2.0

shows all scores whose E() value is less than 2.0. Ignored if -LIStsize or -NOOPTall is on the command line.

-LIStsize=40

shows the best 40 scores. Overrides -EXPect.

-NOATTRibutes

suppresses writing to the list of best scores the Begin, End, and Strand attributes that indicate the region of the search set sequence that was aligned with the query sequence.

-ALIgn=10

limits the number of alignments to display in the output file to the 10 best-scoring regions in the list.

-NOALIgn

suppresses the sequence alignments in the output file. The resulting output file can be used as a list file for input to other Wisconsin Package programs.

-OPTall=20

immediately performs an alignment and calculates the opt score when the initn score is greater than the specified threshold score. This parameter allows you to override the default threshold calculated by the program. Scores are sorted and saved by opt score during the search.

-NOOPTall

doesn't compute the opt score until the search is complete. Scores are sorted and saved by initn score instead of by opt score.

-SWalign

does an unlimited Smith-Waterman alignment as the final alignment for nucleotide searches and TFastA searches, instead of the "alignment in a band" version of Smith-Waterman. (Note: this can be very slow.)

-SHOWall

shows entire sequences in the alignment display, instead of just the best region of overlap and its surroundings.

-MARKx=3

determines the alignment display mode -- especially the symbols that identify matches and mismatches. The default value, 3, uses a pipe character (|) to show identities and a colon (:) to show conservative replacements. -MARKx=0 uses a colon to show identities and a period (.) to show conservative replacements. -MARKx=1 will not mark identities; instead, conservative replacements are connected with a lowercase x, and non-conservative substitutions are connected with an uppercase X. If -MARKx=2, the residues in the second sequence are shown only if they differ from the first sequence.

Use -MARKx=10 to get aligned sequences in the FastA "parsable" output format. A document describing this format appears after FastA in the Program Manual.

-NOHIStogram

suppresses printing the histogram.

-LINesize=60

lets you set the number of sequence symbols in each line of the alignment to any number between 60 and 200.

-NODOCLines

suppresses the documentation from the search set sequence accompanying the alignment in the output file. Use -DOCLines=5 to copy only five non-blank lines of documentation.

-MINLength=1000

restricts the search to search set sequences that are equal to or longer than 1000 residues.

-MAXLength=5000

restricts the search to search set sequences that are equal to or shorter than 5000 residues.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-MONitor=100

monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

Printed: November 18, 1996 13:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com