BLAST

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INTERPRETING OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
CHOOSING SEARCH SETS
THEORY
ALGORITHM
CONSIDERATIONS
SUGGESTIONS
FILTERING OUT LOW COMPLEXITY SEQUENCES
AMINO ACID SCORING
NUCLEOTIDE SCORING
ALTERNATIVE GENETIC CODES
NETWORK CONSIDERATIONS
COMMAND-LINE SUMMARY
CITING BLAST
ACKNOWLEDGEMENT
LOCAL DATA FILES
OPTIONAL PARAMETERS

FUNCTION

[ Top | Next ]

BLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. BLAST can search databases on your own computer or databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

DESCRIPTION

[ Previous | Top | Next ]

BLAST, or Basic Local Alignment Search Tool, uses the method of Altschul et al. (J. Mol. Biol. 215; 403-410 (1990)) to search for similarities between a query sequence and all the sequences in a database. The query sequence and the database you want to search can be either protein or nucleic acid in any combination. The GCG BLAST program supports five different programs in the BLAST family:

BLASTP, Protein Query Searching a Protein Database

Each database sequence is compared to the query in a separate protein-protein pairwise comparison.

BLASTX, Nucleotide Query Searching a Protein Database

The query is translated, and each of the six products is compared to each database sequence in a separate protein-protein pairwise comparison.

BLASTN, Nucleotide Query Searching a Nucleotide Database

Each database sequence is compared to the query in a separate nucleotide-nucleotide pairwise comparison.

TBLASTN, Protein Query Searching a Nucleotide Database

Each nucleotide database sequence is translated, and each of the six products is compared to the query in a separate protein-protein pairwise comparison.

TBLASTX, Nucleotide Query Searching a Nucleotide Database

The query and each database sequence are both translated in six frames, and each of the 12 products is compared in 36 different pairwise comparisons. Because this program involves more computation than the others, it is limited to searches of Alu, STS, EST, and GSS databases when doing remote searches of the NCBI databases.

Normally, BLAST decides which BLAST program you want to use simply by looking at the type (protein or nucleic acid) of your query sequence and the database you have selected. In the case of nucleotide-nucleotide searches, there are two programs that can do the search. By default, BLASTN is used. To search using TBLASTX instead, add -TBLASTX to the command line.

BLAST either can search databases maintained at your institution (a local search), or if you are attached to the Internet, it can search databases maintained by NCBI (a remote search). Remote searches require almost no resources from your own computer. More importantly, the databases at NCBI are updated daily and may be more current than those maintained locally.

BLAST is a statistically driven search method that finds regions of similarity between your query and database sequences. These are called segment pairs, and consist of gapless alignments of any part of two sequences. Within these aligned regions, the sum of the scoring matrix values of their constituent symbol pairs is higher than some level that you would expect to occur by chance alone.

You are prompted to set an expectation level for the entire search. By default this level is 10.0, which means that hits are reported only if they have a score that would be expected to occur purely by chance fewer than 10 times in this particular search.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using BLAST to find the sequences in SWISS-PROT with similarities to a zein gene:


% blast

 BLAST search with what query sequence ?  SW:zea2_maize

 Search for query in what sequence database:

 REMOTE
   1) nr          p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
   2)   pdb       p PDB protein sequences
   3)   swissprot p SwissProt sequences
   4) yeast       p Saccharomyces cerevisiae protein sequences
   5) kabat       p Kabat Sequences of Proteins of Immunological Interest
   6) alu         p Translations of Select Alu Repeats from REPBASE
   7) month       p All new or revised GenBank CDS translation+PDB+SwissProt+PI
   8) nr          n Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's
   9)   pdb       n PDB nucleotide sequences
  10)   vector    n Vector subset of GenBank
  11) yeast       n Saccharomyces cerevisiae genomic nucleotide sequences
  12) est         n Non-redundant Database of GenBank+EMBL+DDBJ EST Division
  13) sts         n Non-redundant Database of GenBank+EMBL+DDBJ STS Division
  14) gss         n Genome Survey Sequences
  15) mito        n Database of mitochondrial sequences, Rel. 1.0, July 1995
  16) kabat       n Kabat Sequences of Nucleic Acid of Immunological Interest
  17) epd         n Eukaryotic Promotor Database
  18) alu         n Select Alu Repeats from REPBASE
  19) month       n All new or revised GenBank+EMBL+DDBJ+PDB sequences released
 LOCAL
  20) swissprot   p SWISS-PROT
  21) pir         p Protein Information Resource
  22) genembl     n GenBank+EMBL
  23) est         n Expressed Sequence Tags
  24) sts         n Sequence Tagged Sites
  25) gss         n Genome Survey Sequences
  26) nuc         n Test nucleotide data.
  27) prot        p Test protein data.

 Please choose one (* 1 *):  3

 Ignore hits expected to occur by chance more than (* 10.0 *) times?

 Limit the number of sequences in my output to (* 250 *) ?

 What should I call the output file (* zea2_maize.blastp *) ?

 Trying cruncher.nlm.nih.gov (130.14.25.175)

 Connected to cruncher.nlm.nih.gov

 Search in progress on the network server.

 .....

 Retrieving results.

 .....................................................

 Done!

%

OUTPUT

[ Previous | Top | Next ]

Below is part of the output from the search in the example session:


///////////////////////////////////////////////////////////////////////////////

BLASTP 1.4.9MP [26-March-1996] [Build 14:27:01 Apr  1 1996]

Reference:  Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol.
215:403-10.

Query=  TITLE sw:zea2_maize
        (235 letters)

Database:  Non-redundant SwissProt sequences
           52,724 sequences; 18,538,780 total letters.
Searching..................................................done

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

sp|P04704|ZEA2_MAIZE ZEIN-ALPHA PRECURSOR (19 KD) (CLONE ...  1164  3.3e-160  1
sp|P24449|ZEAC_MAIZE ZEIN-ALPHA PRECURSOR (19 KD) (PMS1).     1103  1.3e-151  1
sp|P02859|ZEA1_MAIZE ZEIN-ALPHA PRECURSOR (19 KD) (CLONE ...   644  5.6e-149  2

///////////////////////////////////////////////////////////////////////////////

sp|P00856|ATP8_YEAST ATP SYNTHASE PROTEIN 8 (ATPASE-ASSOC...    50  0.9998    1
sp|P48882|ATP8_HANWI ATP SYNTHASE PROTEIN 8 (A6L).              50  0.9998    1
sp|P05084|HUNB_DROME HUNCHBACK PROTEIN.                         43  0.9999    3

>sp|P04704|ZEA2_MAIZE ZEIN-ALPHA PRECURSOR (19 KD) (CLONE ZG99).
            Length = 235

 Score = 1164 (544.2 bits), Expect = 3.3e-160, P = 3.3e-160
 Identities = 235/235 (100%), Positives = 235/235 (100%)

Query:     1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
             MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ
Sbjct:     1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60

Query:    61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120
             AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN
Sbjct:    61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120

Query:   121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180
             QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS
Sbjct:   121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180

Query:   181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235
             PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF
Sbjct:   181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235

///////////////////////////////////////////////////////////////////////////////

>sp|P05084|HUNB_DROME HUNCHBACK PROTEIN.
            Length = 758

 Score = 43 (20.1 bits), Expect = 9.0, Sum P(3) = 1.0
 Identities = 10/20 (50%), Positives = 11/20 (55%)

Query:    68 PLSPLFLQQSSALLQQLPLV 87
             P  P+   Q SA LQ  PLV
Sbjct:   410 PAQPVATSQLSAALQGFPLV 429

///////////////////////////////////////////////////////////////////////////////

Parameters:
  V=250
  B=100
  P=4

  -ctxfactor=1.00
  E=10

  Query                        -----  As Used  -----    -----  Computed  ----
  Frame  MatID Matrix name     Lambda    K       H      Lambda    K       H
   +0      0   BLOSUM62        0.324   0.134   0.375    same    same    same

  Query
  Frame  MatID  Length  Eff.Length   E    S W   T  X     E2  S2
   +0      0      235       235      10. 57 3  10 22    0.24 32

Statistics:
  Query          Expected         Observed           HSPs       HSPs
  Frame  MatID  High Score       High Score       Reportable  Reported
   +0      0    61 (28.5 bits)  1164 (544.2 bits)     654        654

  Query         Neighborhd  Word      Excluded    Failed   Successful  Overlaps
  Frame  MatID   Words      Hits        Hits    Extensions Extensions  Excluded
   +0      0      6728    12891134     3153359     9718104    19671       755

  Database:  Non-redundant SwissProt sequences
    Release date:  September 23, 1996
    Posted date:  3:43 AM EDT Sep 23, 1996
  # of letters in database:  18,538,780
  # of sequences in database:  52,724
  # of database sequences satisfying E:  62
  No. of states in DFA:  530 (52 KB)
  Total size of DFA:  120 KB (128 KB)
  Time to generate neighborhood:  0.01u 0.00s 0.01t  Real: 00:00:00
  No. of processors used:  4
  Time to search database:  21.89u 0.14s 22.03t  Real: 00:00:05
  Total cpu time:  22.01u 0.17s 22.18t  Real: 00:00:06

The output has four parts: 1) an introduction that tells where the search occurred and what database and query were compared; 2) a list of the sequences in the database containing segment pairs whose scores were least likely to have occurred by chance; 3) a display of the alignments of the high-scoring segment pairs showing identical and similar residues; and 4) a complete list of the parameter settings used for the search.

Since BLAST only looks for alignments that do not contain gaps, there will often be more than one segment pair associated with each database sequence.

If you searched a local database, the BLAST output is a list file that is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Chapter 2, Using Sequence Files and Databases of the User's Guide.)

INTERPRETING OUTPUT

[ Previous | Top | Next ]

Scores

In the list of sequences, the column labeled High Score contains the scores of the highest-scoring segment pair for the pairwise comparison of that database sequence and your query. In the segment pair alignment display, the Score is the sum of the scoring matrix values in the segment pair being displayed.

Probabilities

There is a probability, for instance 3.3e-160 in the example, associated with each pairwise comparison in the list and with each segment pair alignment. In the list, this number (which means 3.3 x 10(-160)) is the probability that you would observe a score or group of scores as high as the observed high score (or scores) purely by chance when you do a search of the this size. More than one HSP may contribute to the probability in the list, which is why the P(N) column has the superheading "Smallest Sum Probability." In the alignment displays, the probability P is the probability that a segment pair score as high as that particular segment pair's score would be observed in a search of the this size. This probability may be higher than the probability for the whole pairwise comparison if more than one segment pair contributed to the probability for the whole comparison.

An ideal search would find hits that go from extremely unlikely to ones whose best scores should have occurred by chance alone (that is, with probabilities approaching 1.0).

N Score

This column beneath the Smallest Sum Probability heading in the list of sequences indicates how many HSPs were involved in computing P(N). If P(N) arose from combining more than one HSP score from the same pairwise comparison, this number will be greater than 1. There is more about the concept of combining the scores of more than one HSP under the THEORY topic.

Bit Score

Each aligned segment pair has a normalized score expressed in bits that lets you estimate the magnitude of the search space you would have to look through before you would expect to find an HSP score as good as or better than this one by chance. If the bit score is 30, you would have to score, on average, about 1 billion independent segment pairs (2(30)) to find a score this good by chance. Each additional bit doubles the size of the search space. This bit score represents a probability; one over two raised to this power is the probability of finding such a segment by chance. Bit scores represent a probability level for sequence comparisons that is independent of the size of the search.

The size of the search space is proportional to the product of the query sequence length times the sum of the lengths of the sequences in the database. This product, referred to as N in Altschul's publications, is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is about 0.13. This product for the search in the example session is 235 x 12,496,420 x 0.13 or about 0.38 billion, so a bit score of 30 (corresponding to a search space of 1 billion) could easily have occurred by chance alone.

BLAST Parameters

At the end of the output is a complete listing of parameter settings along with some trace information about the search. Some of these parameters are described in this document, but to get complete documentation on these parameters, look at the BLAST document on the World Wide Web at http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html.

INPUT FILES

[ Previous | Top | Next ]

BLAST accepts a single protein sequence or a single nucleic acid sequence as the query sequence. The search set is a specially formatted database. See the GCGToBLAST entry in the Program Manual for information on how to create a local database that BLAST can search from a set of sequences in GCG format. The function of BLAST depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST. TFastA does a Pearson and Lipman search for similarity between a query peptide sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied peptide sequences in a nucleotide sequence database are similar to my peptide sequence?"

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

FindPatterns, StringSearch, and Names are other sequence identification programs.

RESTRICTIONS

[ Previous | Top | Next ]

Searching remote databases opens up the possibility of unauthorized access to your query sequence. You should not use confidential query sequences for remote searches.

BLAST does not accept a conventional GCG sequence specification for the search set. You can search only databases known to BLAST and each only in its entirety. You cannot restrict the range of the sequence used as the query.

The remote search may find hits on sequences that cannot be retrieved with Fetch. To do remote searching, your computer must be connected to the Internet and your machine's IP number must be registered with NCBI. (See your system manager for more information or have your system manager contact GCG for help by phone at (608) 231-5200 or by e-mail at Help@GCG.Com.)

You cannot select a list size or expectation threshold of more than 1,000. The output file can be quite large. If you run out of disk space you may have to delete one or more files before you can continue working.

CHOOSING SEARCH SETS

[ Previous | Top | Next ]

BLAST can search only a specially compressed form of the data. Therefore, you can search only those databases that are available in this form, and you must search them in their entirety. If you want to restrict the search to a specific set of sequences, use the program GCGToBLAST to create a specially compressed database consisting of just those sequences.

To name a searchable database interactively, choose the number of the database of interest from the menu. On the command line, use a parameter like -INfile2=genbank to choose the name of the database you want to search.

Unfortunately, because there are both nucleotide and protein databases called nr (and pdb) in the menu, BLAST cannot be sure which nr you mean if you use parameter like -INfile2=nr on the command line. Therefore, if the database you want to search cannot be named unambiguously with the -INfile2 parameter, add either -DBNucleotideonly or -DBProteinonly to the command line.

THEORY

[ Previous | Top | Next ]

This section is for advanced users.

Definition of a Maximal-scoring Segment Pair (MSP)

In this discussion a segment pair refers to a gapless alignment of any part of two sequences. The score for any particular segment pair is the sum of the scoring matrix values for the symbol pairs in the segment. A maximal-scoring segment pair or MSP is the segment pair with the highest score in a pairwise comparison. MSP scores cannot be improved by extending or shortening the alignment at either end. There will be at least one MSP for every pairwise comparison.

Highest MSP Score Expected by Chance Can Be Estimated

Karlin and Altschul have shown that for a set of sequence comparisons of known size and composition, the highest segment pair score S expected to occur by chance can be estimated if you can assume that 1) the frequencies of substitution of each residue to every other residue at some level of divergence are known and 2) the probability of finding each amino acid at each position in a protein is simply proportional to its frequency in the database (rather than being heavily constrained by neighboring residues). (Proc. Natl. Acad. Sci. USA 87; 2264-68 (1990)). Using the highest expected S-score as a cutoff, all segment pairs scoring above it would not be expected to occur by chance. The ability to discriminate between expected and unexpected segment pair scores is the core concept in BLAST.

In later work, Karlin and Altschul showed that you can find statistically significant groups of high-scoring segment pairs (HSPs) at locations consistent with their being joined together in an alignment even when each HSP score taken by itself would not have been significant. (Proc. Natl. Acad. Sci. USA 90; 5873-5877 (1993)).

PAM Matrices Defined

Dayhoff measured the rate at which any residue is substituted for any other residue in a set of alignments where there were 10 or fewer substitutions per 100 residues. (With few substitutions she could be sure of the alignments and reasonably sure that each observed residue difference arose from a single substitution event.) Given a particular starting residue, the expected frequency of each kind of residue observed at its position depends on how long the sequences being compared have been diverging from each other. (Different protein families diverge at different rates.) Dayhoff invented a scale based on the number of point mutations incorporated into proteins. The scale is referred to as PAM, a rearranged acronym for Accepted Point Mutation . The distance that two protein sequences have diverged from one another is often expressed in PAM units and referred to as a PAM distance. Two homologous proteins at a PAM distance of 10 should have about 10 substitutions per 100 residues. (Atlas of Protein Sequences and Structure , M.O. Dayhoff, ed., Volume 5, Supplement 3, pp; 353-358, National Biomedical Research Foundation, Washington, D.C., USA, (1979)).

Dayhoff scored substitutions in sequences whose PAM distances were small, 10 or less. This measurement let her infer the rate of substitution per PAM distance unit of each amino acid into every other. Knowing the frequency of each amino acid, the frequency of each amino acid pair at any PAM distance could then be calculated. These frequencies are referred to as target frequencies in the literature. They are the expected frequency that you would observe each amino-acid pair in correctly aligned homologs that have diverged by a known amount.

The ratio of each of these target frequencies to the frequency for that same amino acid pair in the background is the information that makes homologous sequences detectable. PAM matrices contain, for a given PAM distance, the negative logarithms of this ratio for each amino acid pair. A segment pair score, in the context of BLAST, is simply the sum of the corresponding PAM matrix values for each amino acid pair in the alignment.

Interesting Alignments Probably Contain HSPs

One assumption of BLAST is that an optimal alignment between any two diverged, but still observably similar sequences will contain one or more segment pairs with scores high enough not to be expected by chance. If this is true, then you have a statistic that can distinguish pairs of sequences that are related from pairs that are not. If you could discover a fast way to find all of the segment pairs in a pairwise comparison that score above some reasonable cutoff score, you could build a tool to search a sequence database and report only sequences whose similarity to your query is significant. BLAST uses a heuristic algorithm that tries to do this.

ALGORITHM

[ Previous | Top | Next ]

If you have understood the THEORY topic you are ready to learn more about how BLAST is implemented.

Most HSPs Will Have Regions of Near Identity

BLAST assumes that any high-scoring segment pair will contain two similar (but not necessarily identical) words within it.

Similar Words Can Be Treated As Synonyms

Consider every possible word pair in the language of words of fixed length W as tiny segment pairs, each with some score t. Like a segment pair score, this t-score is the sum of the scoring matrix values for the symbol pairs in the word pair. There is a second user-settable parameter T, at or above which the t-score from any word pair indicates that the words in that pair should be considered synonymous. Altschul and company refer to any group of synonyms of length W and score T or greater as a neighborhood. For a given scoring matrix and value of W, increasing T reduces the size of the neighborhood, while decreasing T increases the size of the neighborhood.

Finding Word Matches

BLAST belongs to a class of comparisons that use k-tuple preprocessing. To implement this, BLAST uses what is known in our trade as a discrete finite automaton (DFA) that, for any word in a database sequence, returns the positions where that word or its synonyms occur in the query sequence. (The details of the DFA's implementation have not, to our knowledge, been published.) The search consists of taking each word in each of the database sequences and finding out if it or any word synonymous with it occurs in the query sequence.

Scoring the Segment Pair Associated With Each Word Match

If a word or synonym from a searched sequence occurs in the query, BLAST extends the alignment where the word pair occurs in both directions to find the highest score for the segment pair that contains that word pair.

To minimize the time spent extending the alignment, BLAST needs to decide if the best score has been found without scoring all the symbol pairs on the diagonal. In practice, when the segment pair score falls to zero or decreases by an amount equal to or greater than a parameter X, extension is terminated. If the score ever reaches S, the whole diagonal is scored. The segment extension drop-off score X is normally calculated for the user to be an amount equal to 10 bits of information representing a fall-off in the significance of the score by a factor of about 1,000. For nucleotide database searches with nucleotide queries, X is calculated to be equivalent to a drop-off of 20 bits of information.

HSPs and MSPs Revisited

If the score for the segment pair is higher than some threshold, the score and end points of the segment pair are stored. These high-scoring segment pairs are referred to as HSPs. The highest scoring segment pair for the whole pairwise comparison is referred to as the maximal-scoring segment pair or MSP.

Determining What to Report

When each pairwise comparison is complete, all the HSPs are analyzed to see if some could be combined into an alignment (that is, they do not overlap in either dimension by more than a certain percentage of their lengths). If the MSP score or the best combined HSP score is above a cutoff score, then the sequence is listed and the high-scoring segment pairs in it are displayed as alignments in the output. The MSP score is reported next to each sequence in the output list in the column labeled High Score.

Displaying the Best Comparisons and Their Segment Pairs

When the search is complete, BLAST sorts the pairwise comparisons with MSPs above the cutoff by probability and makes a list with one sequence per line showing the most significant hits. You must set a list size to limit the number of sequences displayed in this list. Following the list are alignments of the high-scoring segment pairs from the 100 highest-scoring sequences displayed in the list. You can display more alignments of segment pairs with the optional parameter -SEGments if you can tolerate the large amount of output involved.

CONSIDERATIONS

[ Previous | Top | Next ]

Bit Scores and the Size of the Search

Altschul has shown that for sequences that have diverged by a certain amount, there is an informativeness (or ability to discriminate between chance scores and significant scores) associated with each residue pair in the segment pair. This informativeness is the amount of information obtainable from each residue pair in a real alignment that can be used to distinguish the real alignment from a random one. This informativeness can be expressed in bits. The sum of the information available from each residue pair in a segment is the segment pair's score in bits. Such scores are intuitively understandable as the significance of a segment pair score. To express such scores as a fraction you would divide 1 by 2 to the number of bits in the score. For example, if a segment pair has a bit-score of 16, then the appropriate fraction (1/2(16)=1/65,536) would suggest that you should see a score this high by chance about once for every 65,000 independent segment pairs you examine.

For nucleotide sequences that have not diverged, there should be an informativeness of about 2 bits per nucleotide pair. For protein sequences that have not diverged, the informativeness should be slightly over 4 bits per amino acid pair. (The informativeness per pair goes down as the sequences diverge and a segment pair score is maximally informative only when a scoring matrix appropriate to the extent of divergence between the sequences is used to calculate the score.)

The bit scores are absolute, but the expectation of finding any particular score depends on the size of the search space. The number of places where a segment pair might originate is proportional to the product of the length of the query times the sum of the lengths of all the sequences searched. This product is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is approximately 0.13.

For a query sequence of length 300 aa searching a database of 12 million residues, the size of the search space would be 300 x 12,000,000 x 0.13 or 468,000,000. For a search this size, a score that only occurs once in every 65,000 potential segment pairs (that is, with a bit score of 16) would be expected to occur about 7,200 times by chance alone.

If the database being searched is highly redundant (as it might be if it contained several hundred homologous cytochromes), then size of the search space calculated by these methods will overestimate the size of the real search space.

Using BLAST for Nucleotide Searches

By default BLAST ignores HSPs that do not contain a perfect match of at least 11 nucleotides (22 bits). This is stringent enough that many obviously significant relationships are not found.

The detection of distant relationships between proteins is easier than between nucleotide sequences, even if the nucleotide sequences have to be translated in all six frames to make the amino acid comparison. To give a rough magnitude to this generalization, it is possible to detect similarities in proteins that have diverged by 250 substitutions per 100 residues (250 PAM units) while nucleotide similarities become obscure at distances much greater than 50 substitutions per 100 nucleotides (50 DNA PAM units). Nonetheless, when the nucleotide sequences being compared do not code for proteins, you have no alternative but to search at the nucleotide level. We suggest you consider either reducing the word size for BLAST from its default of 11 to perhaps six or seven, or using the FastA program when looking for nucleotide homologs.

Sensitivity

Quoting from the man pages for BLAST, "At some point accumulated mutations and errors completely obscure the presence of a relationship between two sequences; the BLAST programs' focus on ungapped alignments will sometimes cause this point to be reached sooner than for other alignment methods."

SUGGESTIONS

[ Previous | Top | Next ]

List Size Limit

A list size that is too small to display all the significant hits is a common problem. Both the screen and the output file will print a warning showing the number of significant hits that are not shown in the list. To see the unlisted hits you must run the search again with the list size limit set high enough to include everything significant. The output can get very very large if you set the list size to 1,000. It cannot be set to more than 1,000. If you cannot display everything of interest with a list size limit of 1,000, see the topic FILTERING OUT LOW COMPLEXITY SEQUENCES below.

Segment Pair Alignment Limit

BLAST displays alignments of segment pairs from the top 100 sequences in the list. You can adjust this limit with the -SEGments parameter.

Sensitivity

For nucleotide sequence comparisons, the word size defaults to 11 -- no segment pair can be scored unless it contains a perfect match of at least 11 consecutive bases. If sensitivity is much more important than selectivity, and your search cannot be done at the amino acid level, you might want to reduce the word size to seven or even six. NCBI has stated that there is only a marginal increase in sensitivity for settings smaller than this.

BLAST uses a word size of three for proteins, which is appropriate for a wide range of searches, but you can adjust the synonym threshold T downwards to increase sensitivity at the price of speed. Read the OPTIONAL PARAMETERS topic for the parameters -SYNonym, and -EXPect.

Batch Queue

Using BLAST to search a large local database can take a long time. You may want to run comparisons either remotely or in the batch queue. You can specify that this program run at a later time in the batch queue by using the command-line parameter -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide.

Relationship to FastA

For protein database searches, BLAST and FastA have similar sensitivity, although the different algorithms employed make it possible, at least in principle, for FastA to find things that BLAST misses and vice versa. For nucleotide database searches with nucleotide query sequences, FastA may be more sensitive, since by default BLAST ignores segment pairs that do not contain a perfect match of at least 11 adjacent nucleotides (22 bits). This default misses many obviously significant relationships. If you are looking for nucleotide sequence homologs that do not code for proteins (that is, if your search cannot be done at the amino acid level), we suggest you either reduce the word size to seven or use the FastA program instead of BLAST.

Search Orientation

BLAST searches both strands of nucleic acid query sequences and database entries. Read the OPTIONAL PARAMETERS topic for the parameters -TOPstrand, -BOTtomstrand, -DBTOPstrand, and -DBBOTtomstrand for information about how to limit the search.

FILTERING OUT LOW COMPLEXITY SEQUENCES

[ Previous | Top | Next ]

Short repeats and low complexity sequences, such as glutamine-rich regions, confound most database searching methods. For BLAST, the random model against which the significance of segment pair scores is evaluated assumes that at each position, each residue has a probability of occurring which is proportional to its composition in the database as a whole. Low complexity or highly repetitive sequences are inconsistent with this assumption. Suspect this problem when the number of significant segment pair scores is much higher than you would expect. The output is either enormous or the output size limits cut off your output long before all the segments are displayed.

You can filter out repeats and low complexity regions from protein query sequences by adding -FILter=xs to the command line. The x filter (Claverie and States, Computers Chem. 17; 191-201, (1993)) masks short repeats, the s filter masks low complexity sequences (Wootton and Federhen, Computers Chem. 17; 149-163 (1993)).

When you run BLAST with filtering on, masked regions are excluded from the search. These regions are replaced with X's in the output to let you identify the regions that were excluded. Here is the query sequence from the example session aligned to a filtered copy of itself to show which parts of the original sequence were filtered out:


  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60

 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120
 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120

121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180
121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180

181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235
181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235

AMINO ACID SCORING

[ Previous | Top | Next ]

BLAST normally uses the BLOSUM62 scoring matrix from Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)) whenever the sequences being compared are proteins (including cases where nucleotide databases or query sequences are translated into protein sequences before comparison). You can use the more traditional PAM40, PAM120 and PAM250 scoring matrices with command-line parameters like -MATrix=PAM40 etc. Each matrix is most sensitive for finding homologs at the corresponding PAM distance. The seminal paper on this subject is Stephen Altschul's "Amino acid substitution matrices from an information theoretic perspective" (J. Mol. Biol. 219; 555-565 (1991)). If you are new to this literature, an easier place to start reading might be Altschul et al., "Issues in searching molecular sequence databases" (Nature Genetics, 6; 119-129 (1994)).

Here are the values in the default scoring matrix, BLOSUM62. The values for perfect amino acid matches are shown in bold.


   BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
   Blocks Database = /data/blocks5.0/blocks.dat
   Cluster Percentage: >= 62
   Entropy =   0.6979, Expected =  -0.5209

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

NUCLEOTIDE SCORING

[ Previous | Top | Next ]

There is no external scoring matrix for nucleotide-nucleotide searches (that is, searches where both the query and the database are nucleotide sequences and where -TRANSlate is not on the command line). But as is explained below, you can specify a nucleotide-nucleotide scoring matrix for any PAM distance by changing the match/mismatch ratio. The default ratio is +5/-4 (equivalent to PAM 47). You can change the ratio by placing a parameter like -MATCH=4 on the command line to specify a new value for the numerator.

The explanation of nucleotide substitutions scoring below is derived from personal communication from Dr. Altschul, elaborating the information provided in States, Gish, and Altschul, METHODS: A Companion to Methods in Enzymology, 3; 66-70 (1991).

If all nucleotides were equally frequent and all substitutions were equally likely, it would take only two scores, match and mismatch, to represent a complete scoring matrix at any PAM distance. Fixing the score for a mismatch at -4 lets you select PAM matrices simply by setting the match score as summarized in the following table.


   Nucleotide Scoring Matrix Properties for Mismatch of -4

  Match     PAM      Percent    Bits/Unit   Average information
 Setting  distance  conserved     score     per position (bits)

    1       0.3       99.7        1.992            1.97
    2       5.3       94.9        0.968            1.63
    3      16.0       85.6        0.595            1.18
    4      30.2       75.0        0.396            0.79
    5      47.0       65.1        0.275            0.51
    6      65.0       56.5        0.196            0.32
    7      86.0       48.8        0.138            0.19
    8     109.0       42.5        0.096            0.11

The default match score of +5 corresponds to a PAM distance of 47. Such a scoring matrix would be maximally informative when used to compare sequences that, when back mutations are considered, are about 65 percent conserved. At this distance, there is about half a bit of information for each position in an alignment of an homologous segment pair. Searching a database of 64,000,000 nucleotides with a 1000-base query would require about 36 bits of information to achieve significance. A segment of this significance (36 bits) at this level of divergence would have to be about 72 nucleotides long. By varying the match value as shown above, you can, in effect, select other PAM matrices, which would be more efficient for other levels of sequence conservation. Running BLASTN three different times with match settings of 3, 5, and 7 (PAM matrices 16, 47, and 86), would give at least 90 percent efficiency over the range of PAM distances 1 to 108.

ALTERNATIVE GENETIC CODES

[ Previous | Top | Next ]

BLAST normally uses the standard genetic code if either the query or the database sequences requires translation. If your query comes from a system where this genetic code is inappropriate, you can select any of these alternative codes by number:


     1 Standard or Universal
     2 Vertebrate Mitochondrial
     3 Yeast Mitochondrial
     4 Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma
     5 Invertebrate Mitochondrial
     6 Ciliate Macronuclear
     7 Do not use this index
     8 Do not use this index
     9 Echinodermate Mitochondrial
    10 Alternative Ciliate Macronuclear
    11 Eubacterial
    12 Alternative Yeast
    13 Ascidian Mitochondrial
    14 Flatworm Mitochondrial

You can specify the genetic code for the query and the database independently. Use the command-line parameter -TRANSlate=2 to tell BLAST to use the vertebrate mitochondrial code to translate the query. The parameter -DBTRANSlate=3 tells BLAST to use the yeast mitochondrial code to translate the database. (Note that most of the genes in GenBank will be translated inappropriately if you select a nonstandard genetic code for database translation.)

The numbering for each of these codes has changed from version 1.3 of BLAST (used in release 8.0 of the Wisconsin Package) and you can no longer use the numbers zero, seven, and eight!

NETWORK CONSIDERATIONS

[ Previous | Top | Next ]

There are a number of possible problems with client/server applications running over the Internet. You should try to find out if you are being charged for network communications and you should certainly worry about the security and integrity of your sequences. There is always a possibility that a server will become overloaded and that your search will take much longer than normal or that your output will be lost altogether because of a network or server computer glitch. If you are working in Europe there may be services available through EMBnet that are more appropriate or robust than the NCBI BLAST server. Nonetheless, we continue to be impressed with the speed and reliability of the NCBI BLAST server.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % blast [-INfile1=]SW:Zea2_Maize  -Default

Prompted Parameters:

[-INfile2=]swissprot     database to search
-EXPect=10.0             ignores scores that would occur by chance
                           more than 10 times
-LIStsize=250            maximum number of sequences listed in the output
[-OUTfile=]zea2.blastp   output file name

Local Data Files:

[-DATa1=BLAST.rDBs]      list of available remote databases
[-DATa2=BLAST.lDBs]      list of available local databases
[-DATa3=BLAST.sDBs]      list of available site-specific databases

Optional Parameters:

-TOPstrand                   searches only the top strand of nuc. query
-BOTtomstrand                searches only the bottom strand of nuc. query
-DBTOPstrand                 translates only top strand of nuc. database
-DBBOTtomstrand              translates only bottom strand of nuc. database
-FILter=xs                   filters repeats and low complexity segments
                               out of protein query sequences
-MATrix=PAM120[,PAM250...]   specifies one [or more] protein scoring matrix
-TBLASTX                     if query and database are both nucleotide,
                               translates both and does protein comparisons
-TRANSlate=1                 genetic code for translating query
-DBTRANSlate=1               genetic code for translating database
-REMoteonly                  searches only remote databases (simplifies menu)
-LOCalonly                   searches only local databases   (   "        " )
-DBNucleotideonly            searches only nucleic databases (   "        " )
-DBProteinonly               searches only protein databases (   "        " )
-SERver=cruncher.nlm.nih.gov runs remote search on this particular host
-WORdsize=7                  sets word size (primarily for nuc-nuc searches)
-SYNonym=10                  sets minimum score for equivalent words
-MATCH=5                     scoring matrix value for nucleotide matches
-SEGments=100                displays segment pairs from top 100 sequences
-HIStogram                   shows histogram of hits versus expectation level
-APPend="string"             appends "string" to pass-through command line
-BATch                       submits program to batch queue

CITING BLAST

[ Previous | Top | Next ]

If you use BLAST remotely to search the NCBI databases, NCBI requests that you please mention that the computation was performed at NCBI using the BLAST network service. The original paper describing BLAST is Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., and Lipman, David J. (1990). Basic local alignment search tool. J. Mol. Biol. 215; 403-410. This paper remains a good way to learn more about BLAST.

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

Both the local and remote versions of BLAST were written by Warren Gish of the National Center for Biotechnology Information (NCBI) in collaboration with Stephen Altschul, Webb Miller, Eugene Myers, David Lipman, and David States. The "client" program that communicates between GCG and the BLAST server at NCBI was written by Mike Cherry while he was working at the Massachusetts General Hospital. The public domain programs for BLAST were modified by Scott Rose in collaboration with NCBI for distribution with Version 8 of the Wisconsin Package. The document you are now reading was written by John Devereux. We are extremely grateful to Stephen Altschul and Warren Gish for their careful and original work on BLAST and for their critical comments on GCG's BLAST documentation and we are very grateful to NCBI for making these programs and services available to the molecular biology community.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

BLAST reads three files, blast.rdbs (remote databases), blast.ldbs (local databases), and blast.sdbs (site-specific databases). These together list the search sets in the menu. We update blast.rdbs and blast.ldbs when we send database updates to your institution. If you have sequences of local interest that you would like to search with BLAST, read the documentation for GCGToBLAST to see how to create local BLAST-searchable databases, then fetch the file blast.sdbs, and add the name of the local search set so that it appears in the menu.

OPTIONAL PARAMETERS

[ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Following some of the optional parameters described below is a letter or short expression in parentheses. These are the names of the corresponding parameters at the bottom of your BLAST output.

-TOPstrand

searches only with the top strand of your nucleotide query sequence. The top strand is the sequence represented in the file.

-BOTtomstrand

searches only with the bottom strand of your nucleotide query sequence. The bottom strand is the reverse complement of the sequence represented in the file.

-DBTOPstrand

In searches that translate nucleotide database sequences (TBLASTX, TBLASTN), this parameter restricts the translation to the top strand of each database sequence. (The top strand is the sequence represented in the database.)

-DBBOTtomstrand

Like -DBTOPstrand, this parameter restricts translated nucleotide database searches to the bottom strand of each database sequence. (The bottom strand is the reverse-complement of the sequence represented in the database.)

-FILter=xs

filters out repeats and low complexity regions from protein query sequences. The x filter removes short repeats; the s filter removes low complexity sequences (Wootton and Federhen, Computers Chem. 17(2); 149-163 (1993)). You can use either filter alone or use both filters in any order.

When you run BLAST with filtering on, masked regions are excluded from the search. These regions are replaced with X's in the output to let you identify the regions that were excluded.

-MATrix=PAM120,PAM250...

BLAST normally uses the BLOSUM62 amino acid substitution matrix from Henikoff and Henikoff for protein sequence comparisons (including all cases where nucleotide database or query sequences are translated before comparison). You can select the more traditional PAM40, PAM120, and PAM250 scoring matrices with this command-line parameter.

You can use more than one protein scoring matrix when searching a protein sequence database with a protein query. If you use this parameter, each segment is scored with every matrix, and the score from the matrix which scores highest is retained. For more information, see the topic SCORING MATRICES.

-MATCH=4 (M)

You can change the match/mismatch ratio for nucleotide pair scoring (which defaults to +5/-4) by setting the numerator of this ratio to a value other than 5 with this command-line parameter. See the topic NUCLEOTIDE SCORING.

-TRANSlate=2

When BLAST must translate a nucleotide query sequence, it uses the standard ("universal") genetic code. If your query comes from a system where this is inappropriate, you can select any of the alternative codes listed under the topic ALTERNATIVE GENETIC CODES.

-DBTRANSlate=2

If BLAST has to translate each nucleotide sequence in the database, it will use the standard genetic code to do the translation. If you are searching for proteins from a system where this code is inappropriate you might select a code listed under the topic ALTERNATIVE GENETIC CODES to search for homologs from that system. Note that most of the genes in the nucleotide databases will be translated inappropriately if you select a nonstandard genetic code. See -TRANSlate.

-TBLASTX

You can use this parameter on the command line when searching a nucleotide sequence database with a nucleotide query sequence. BLAST will then translate the query and every sequence in the database and examine all pairwise combinations to find similarities at the amino acid level.

Because such doubly translated searches require a lot of computing, NCBI currently restricts such searches to the STS, EST, Alu, and GSS databases.

The search set menu can scroll off your screen if it contains all of the searchable databases at NCBI as well as those supported locally (on your computer). The next four parameters can reduce the size of that menu.

-REMoteonly

confines the menu to those search sets that are available from remote BLAST servers.

-LOCalonly

confines the menu to those search sets that are available locally.

-DBNucleotideonly

confines the menu to search sets containing nucleotide sequences.

-DBProteinonly

confines the menu to search sets containing protein sequences.

-SERver=cruncher.nlm.nih.gov

If BLAST is able to search remotely, it finds a list of the servers available for that purpose in a data file called GenRunData:blast.servers. This file lists the servers in the order in which they will be tried. If a server fails to respond, BLAST will try the next one on the list. If you know the name of a particular server, this parameter lets you specify it on the command line, causing BLAST to ignore the servers named in the file. (Names of servers look like addresses without the user name and the @ symbol.)

The next three parameters allow advanced users of BLAST to adjust the sensitivity or speed of the search. You should be familiar with the algorithm before trying to use these parameters.

-EXPect=10 (E)

This parameter, for which there is a prompt if you don't set it on the command line, lets you increase the number of hits in your output with scores that would be expected to have occurred by chance alone. Your setting for the number of random hits you will tolerate is used to calculate a cutoff score (S) below which MSPs or multiple HSPs (S2) are ignored. BLAST calculates this cutoff score for you. There is nothing to prevent many biologically significant, but statistically insignificant segment pairs from being screened out, so you may sometimes want to increase this parameter in order to have an opportunity see them.

-WORdsize=7 (W)

To score a nucleotide-nucleotide segment pair, BLAST normally must first see an word match of 11 consecutive bases. If you set the word size lower than 11, BLAST will find segment pairs that were skipped when the database was searched with the default value. NCBI asks that you not go below six or seven when searching the NCBI (remote) databases as there is a substantial loss of efficiency with each reduction in word size and the increase in sensitivity below this level is marginal at best.

The default word size of three for proteins seems to be appropriate for a wide range of protein search requirements and NCBI suggests that you not change it. Use the -SYNonym parameter if you are more interested in sensitivity than selectivity.

-SYNonym=10 (T)

For protein comparisons, BLAST only scores HSPs when two identical or very similar words of length three are seen. The score at or above which two words are considered equivalent is set by this parameter. The value in question is the sum of the scoring matrix values when any two words of the same length are aligned. Satisfactory values for -SYNonym are calculated automatically by the program. The value is calculated by BLAST for each run and is shown under the parameters section at the end of your output as T, which for the example OUTPUT above is 10.

SUGGESTION: The value of T is of course a function of the scoring matrix used so if you use a matrix other than BLOSUM62, you should run BLAST once with the default settings to see what value it calculates for T and then adjust it downwards slightly from that value.

-LIStsize=250 (V=250)

-SEGments=100 (B=100)

By default, BLAST lists no more than 250 pairwise comparisons in your output file even if more than 250 sequences had scores above the cutoff score. The list is sorted in order of increasing probability, that is, with the most significant sequences first. Then BLAST displays the alignments of high-scoring segment pairs from the best 100 sequences in the list. These list-size and segment-display (alignment) limits are often too small to display everything significant. A warning appears in your output below the list and the alignments of segments if these parameters were too low to include everything found. This truncation of your output is by intention, since the output from BLAST is usually very large and the first level of inference from most searches can be made from the most significant hits. Nonetheless, you can adjust either of these limits from 0 to 1,000.

-HIStogram (H=1)

This parameter will make BLAST print a histogram near the beginning of your output that shows the number of hits found at each level of expectation.

-APPend="string"

The GCG implementation of BLAST is what is known as a shell program. After collecting your input parameters, the shell calls either the original BLAST programs or the NCBI BLAST server. If you are familiar with the interface to the BLAST programs as they were originally written, you can pass command-line parameters to them directly using this parameter. Please call us if there are additional parameters you want to use with BLAST that you would like to look more like native GCG parameters.

You can read the current version of the BLAST documentation on the World Wide Web at http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: November 18, 1996 13:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com