[ Program Manual | User's Guide | Data Files | Databases ]
Consensus calculates a consensus sequence for a set of pre-aligned short nucleic acid sequences by tabulating the percent of G, A, T, and C for each position in the set. FitConsensus uses the Consensus output table as a probe to search for the best examples of the derived consensus in other nucleotide sequences.
Consensus reads a file of aligned nucleotide sequences for which you want to know the consensus pattern. Consensus constructs a consensus table with the percent of each nucleotide at each position. The total number of nucleotides contributing to each position in the sequence shown in the table is also reported. Below the table, Consensus writes the least ambiguous expression of the consensus sequence for a confidence level that you request.
Here is a session using Consensus to find the consensus of the intervening sequence acceptor splice sites from the file acceptor.dat:
% consensus CONSENSUS on sequences in what file ? acceptor.dat Find consensus to what percent certainty (* 75.0 *) ? What should I call the output file (* acceptor.csn *) ? ................ %
CONSENSUS of: acceptor.dat IVS Acceptor Splice Site Sequences from Stephen Mount NAR 10(2); 459-472 figure 1 page 460 Acceptor ***** %G 15 22 10 10 10 6 7 9 7 5 5 24 1 0 %A 15 10 10 15 6 15 11 19 12 3 10 25 4 100 %T 52 44 50 54 60 49 48 45 45 57 58 30 31 0 %C 18 25 30 21 24 30 34 28 36 35 27 21 64 0 Total 114 114 115 127 127 127 128 128 128 130 131 131 131 131 %G 100 52 24 19 %A 0 22 17 20 %T 0 8 37 29 %C 0 18 22 32 Total 131 131 131 131 ***** CONSENSUS sequence to a certainty level of 75.0 percent at each position: Length: 18 July 27, 1994 10:06 Type: N Check: 3343 .. 1 BBYHYYYHYY YDYAGVBH
Consensus does not use one of the standard GCG file formats as its input file, but instead requires a file in a specific format that you must create with a text editor. This file has a heading of indefinite length, followed by a line containing two adjacent periods (..). The sequences follow with one sequence per line, each sequence starting in the first column. There must be no space characters within the sequence. Gaps must be represented with periods. All sequences must be the same length, up to a maximum of 130 bases. Consensus assumes that the sequences are already in alignment.
Here is part of the input file for the example above:
IVS Acceptor Splice Site Sequences from Stephen Mount NAR 10(2); 459-472 figure 1 page 460 Acceptor / .. .........AAATAGGAT .........TTGTAGGTG ..........TGTAGGTG TTTATTTATTTCAAGATT ////////////////// GTCACTTGTCACTAGGTA
FitConsensus uses the file written by Consensus to search for the best places in a nucleotide sequence where the consensus table fits. The mapping programs can be run with the command-line parameter -ALL to search for all potential restriction sites in an ambiguous sequence.
ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).
Consensus makes no attempt to align the sequences in the input file, so you should be sure that they are optimally aligned before running the program. (The input file format is described above.) The ambiguous representation of the sequence may be arbitrary if there are equal numbers of observations of some nucleotides.
Consensus counts the number of G's, A's, T's, and C's in each position of the prealigned sequences. G, A, T, and C each have a value of one. The ambiguous nucleotide codes are divided. R, for instance, represents A or G and therefore contributes 0.5 to G and 0.5 to A. Periods (gaps) have no value. When the count is complete, the counts of each nucleotide at each position are totaled, normalized to 100, and rounded to the nearest integer. The normalized integers are reported as the %G, %A, etc., at each position. The total number of observations used to generate the percent figures is also shown. An observation is any IUPAC-IUB code (see Appendix III); periods do not count as observations.
For some user-set certainty level, Consensus writes the least ambiguous expression of the sequence in the table using the IUPAC-IUB ambiguity codes. For each column (position) in the table, the computer starts with the largest member (G, A, T, or C) and adds successively smaller members until the sum is equal to or greater than the certainty level set by you. If two nucleotides have the same score, Consensus picks one to add to the consensus arbitrarily. This may be somewhat misleading.
All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % consensus [-INfile=]acceptor.dat -Default Prompted Parameters: [-OUTfile=]acceptor.csn output file name -CERtainty=75.0 percent certainty at which to find consensus Local Data Files: None Optional Parameters: None
None.
None.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.