FRAMEALIGN

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames of a nucleotide sequence. Optimal alignments may include reading frame shifts.

DESCRIPTION [ Previous | Top | Next ]

FrameAlign inserts gaps to obtain the optimal local alignment of the best region of similarity between a protein sequence and the codons in a nucleotide sequence. Because FrameAlign can align the protein to codons in different reading frames of the nucleotide sequence, it can identify sequence similarity even when the nucleotide sequence contains reading frame shifts.

In standard sequence alignment programs, you routinely specify gap creation and extension penalties. In addition to these penalties, FrameAlign also allows you to specify a separate frameshift penalty for the creation of gaps that result in reading frame shifts in the nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps are penalized.)

By default, FrameAlign creates a local alignment between the nucleotide and protein sequences. If you specify the -GLObal command-line parameter, FrameAlign creates a global alignment where gaps are inserted to optimize the alignment between the entire nucleotide sequence and the entire protein sequence.

EXAMPLE [ Previous | Top | Next ]

Here is a session using FrameAlign to align the codons in the cDNA sequence EST:Atts0012 with the protein sequence SW:G3pc_Arath.


% framealign

 Local alignment of what sequence 1 ? EST:atts0012

                  Begin (* 1 *) ?
                End (*   286 *) ?
               Reverse (* No *) ?

 to what protein sequence ? SW:g3pc_arath

                  Begin (* 1 *) ?
                End (*   338 *) ?

 What is the gap creation penalty (* 12 *) ?

 What is the gap extension penalty (* 4 *) ?

 What is the frameshift penalty (* 0 *) ?

 What should I call the paired output display file (* atts0012.pair *) ?

 Aligning ................-....

          Gaps:     2
       Quality:   343
 Quality Ratio: 4.397
  % Similarity: 98.718
        Length:   240

%

OUTPUT [ Previous | Top | Next ]: Here is the output file:


 Local alignment of: Atts0012  check: 2422  from: 1  to: 286

LOCUS       ATTS0012      286 bp    RNA             EST       31-OCT-1992
DEFINITION  A. thaliana transcribed sequence; clone TAT1B11, 5' end; similar to
            GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE.
ACCESSION   Z17438
NID         g16580
KEYWORDS    expressed sequence tag; partial cDNA sequence. . . .

 to: G3pc_Arath  check: 7459  from: 1  to: 338

ID   G3PC_ARATH     STANDARD;      PRT;   338 AA.
AC   P25858;
DT   01-MAY-1992 (REL. 22, CREATED)
DT   01-MAY-1992 (REL. 22, LAST SEQUENCE UPDATE)
DT   01-MAY-1992 (REL. 22, LAST ANNOTATION UPDATE)
DE   GLYCERALDEHYDE 3-PHOSPHATE DEHYDROGENASE, CYTOSOLIC (EC 1.2.1.12). . . .

 Scoring matrix: package/share/9.0/gcgcore/data/rundata/blosum62.cmp
  CompCheck: 6430
 Translation table: /package/share/9.0/gcgcore/data/rundata/translate.txt

         Gap Weight:     12      Average Match:  2.912
      Length Weight:      4   Average Mismatch: -2.003
  Frameshift Weight:      0

            Quality:    343             Length:    240
              Ratio:  4.397               Gaps:      2
 Percent Similarity: 98.718   Percent Identity: 97.436

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   2
                    . =   1

 Atts0012 x G3pc_Arath      September 19, 1996 14:32  ..

                  .         .         .         .         .
       3 GAAATCAAGAAGGCCATCAAGGAGGAATCTGAAGGCAAAATGAAGGGAAT 52
         |||||||||||||||||||||||||||||||||||||||:::||||||||
     261 GluIleLysLysAlaIleLysGluGluSerGluGlyLysLeuLysGlyIl 277
                  .         .         .         .         .
      53 TTTGGGATACTCTGAGGATGATGTTGTGTCTACCGACTTTGTTGGTGACA 102
         ||||||||||...|||||||||||||||||||||||||||||||||||||
     278 eLeuGlyTyrThrGluAspAspValValSerThrAspPheValGlyAspA 294
                  .         .         .         .         .
     103 ACAGGTCAAGCATTTTCGATGCCAAGGCTGGATTGCATTGCATTGAGCGA 152
         ||||||||||||||||||||||||||||||||    ||||||||||||||
     295 snArgSerSerIlePheAspAlaLysAlaGly....IleAlaLeuSerAs 309
                  .         .         .         .         .
     153 CAAGTTTGTGAAGTTGGTGTCATGGTACGACAACGAATGGGGTTACACAG 202
         ||||||||||||||||||||||||||||||||||||||||||||||  ||
     310 pLysPheValLysLeuValSerTrpTyrAspAsnGluTrpGlyTyr..Se 325
                  .         .         .         .
     203 TTCTCGTGTCGTTGACCTTATCGTTCACATGTCAAAGGCC 242
         ||||||||||||||||||||||||||||||||||||||||
     326 rSerArgValValAspLeuIleValHisMetSerLysAla 338

The alignment output displays sequence similarity by printing one of three characters between a codon and an amino acid: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to 1. You can change these match display thresholds from the command line by specifying the -PAIr command-line parameter. (See the Appendix VII for more information about comparison values in scoring matrices.)

INPUT FILES [ Previous | Top | Next ]

The input to FrameAlign is a nucleotide sequence and a protein sequence. You can specify the sequences in any order as input to the program.

Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. Both Gap and BestFit align two sequences of the same type (i.e. both nucleotide sequences or both protein sequences).

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

ALGORITHM [ Previous | Top | Next ]

FrameAlign aligns a nucleotide sequence with a protein sequence. The alignment procedure is an extension of the local alignment algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of similarity between a protein sequence and the codons in a nucleotide sequence.

Scoring Matrix

To create the alignments, FrameAlign requires a scoring matrix that contains values for matches between all possible amino acids and codons. FrameAlign derives this amino acid - codon scoring matrix on the fly from a translation table and an amino acid substitution matrix. The translation table contains a list of all possible codons for each amino acid. The amino acid substitution matrix contains match values for the comparison of all possible amino acids.

In the derived amino acid - codon scoring matrix, the value of a match between any amino acid and any codon is the value of the match between the amino acid and the translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide ambiguity symbols (described in Appendix III), and all possible unambiguous representations of the codon translate to the same amino acid (e.g. MGR always translates to arginine in the standard genetic code), then the value of a match between that codon and any amino acid can be similarly determined. If all possible unambiguous representations of the codon do not translate to the same amino acid, then the value of a match between that codon and any amino acid is 0.

FrameAlign chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can respond to the program prompts or use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

Protein-Nucleotide Alignment

FrameAlign uses the values in the amino acid - codon scoring matrix to determine the score of the best alignment between the protein and nucleotide sequences. If you consider a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein sequence placed on the Y axis, then every point on the path matrix represents the best alignment between the sequences that ends at that point. For any point on the path matrix, the X coordinate is the first nucleotide of the final codon in the alignment, and the Y coordinate is the final amino acid in the alignment. Each possible alignment end point is associated with a path, which is a series of steps (insertions, deletions, matches) through the path matrix required to create the alignment. Each step has its own score, and the scores for all the steps in an alignment path determine the quality score for the alignment. The quality score for an alignment is equal to the sum of the scoring matrix values of the matches in the alignment, minus the gap creation penalty multiplied by the number of gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in the alignment that change the reading frame, minus the gap extension penalty multiplied by the total length of all gaps in the alignment. (You can set the value for each of the penalties.)


quality = SUM(scoring matrix values of the matches in the alignment) -
          gap creation penalty  x  number of gaps in the alignment -
          frameshift penalty    x  number of gaps in the alignment
                                   that change the reading frame -
          gap extension penalty x  total length of all gaps
                                   in the alignment

For example, the following protein-nucleotide alignment consists of six steps:


       1 UGUUGUAUUCG....UGGUGG 17
         ||||||:::      ||||||
       1 CysCysValGlnIleTrpTrp 7

The first two steps are UGU-Cys matches. The third step is an AUU-Val match. The fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys matches, one AUU-Val match, and two UGG-Trp matches, minus one gap creation penalty, minus four gap extension penalties, minus one frameshift penalty.

Matches between an amino acid and a partial codon, like

CG.

Gln

in the above example, do not add any match value to the alignment score. By convention, all gap characters in partial codons are placed at the end of the codon. For example, the partial codon CG. in the above example will never be written asC.G If the best alignment ending at any point has a negative value, a zero is put at that position of the path matrix; otherwise, the quality score for the alignment is put at that position. After the path matrix is completely filled, the highest value in the matrix represents the score of the best region of similarity between the sequences (optimal local alignment). This highest value is reported as the comparison score between the nucleotide and protein sequences. The alignment itself can be reconstructed for display by following the best path from this point of highest value backward to the point where the path matrix has a value of zero.

ALIGNMENT METRICS [ Previous | Top | Next ]

Four figures of merit are displayed along with the optimal alignment between the protein and nucleotide sequences: Quality, Ratio, Identity, and Similarity.

The Quality score (described above in the ALGORITHM topic) is the measure that is maximized in order to align the sequences. Ratio is the Quality divided by the smaller of one-third the number of bases in the alignment and the number of amino acids in the alignment. Gap symbols are ignored in the calculation of Ratio. Identity is the percent of identical matches between amino acids and codons in the alignment (i.e. the amino acid is identical to the translated codon). Similarity is the percent of matches between amino acids and codons in the alignment whose comparison values exceed the similarity threshold. By default, this threshold is the average positive non-identical comparison value in the scoring matrix. FrameAlign uses this same threshold to decide when to put a colon (:) between an aligned codon and amino acid in the alignment display. You can reset this threshold with the -PAIr command-line parameter.

CONSIDERATIONS [ Previous | Top | Next ]

FrameAlign Always Finds Something

FrameAlign always finds an alignment for any protein and nucleotide sequences you compare, even if there is no significant similarity between them. You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

FrameAlign Shows Only a Single Segment of Similarity

FrameAlign shows only one optimal alignment between a protein sequence and a nucleotide sequence. There are reasons why you might want to evaluate several optimal and suboptimal alignments.

- If there are several disjoint segments of similarity, the selection of only a single segment for display does not provide a comprehensive view of the relationship between the nucleotide and protein sequences.

- The alignments displayed by FrameAlign are sensitive to your choices for the scoring matrix and gap penalties. If you vary these choices even slightly, FrameAlign may calculate different optimal alignments for the same segment of similarity between the sequences. If FrameAlign were able to display multiple and suboptimal alignments of the same region, you would be able to use the variation among the different alignments to determine which portions of the alignments were reliably determined.

SUGGESTIONS [ Previous | Top | Next ]

Aligning Long Sequences

If FrameAlign cannot gain access to enough computer memory to create the alignment, the program stops. You can force the program to use less computer memory by specifying gap shift limits for each sequence with the -LIMit1 and -LIMit2 command-line parameters. See the OPTIONAL PARAMETERS topic for a description of these parameters and the potential drawbacks of their use.

Nucleotide Sequences Using Nonstandard Genetic Codes

If the nucleotide sequence is from an organism or organelle that uses a nonstandard genetic code, then you should specify an appropriate translation table using the -TRANSlate command-line parameter. Different translation tables are discussed in Appendix VII.

Aligning a Protein Sequence with a Genomic Sequence Containing Introns

If you align a genomic sequence containing long introns to its corresponding protein sequence, FrameAlign will often display the local alignment of only one of the exons to its corresponding portion of the protein. To align the entire protein sequence to the entire genomic sequence, use the -GLObal command-line parameter and reduce the gap extension penalty in response to the program prompt.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % framealign [-INfile1=]EST:Atts0012 \
                  [-INfile2=]SW:G3pc_Arath -Default

Prompted Parameters:

-BEGin1=1 -END1=286      range of interest for first sequence
-BEGin2=1 -END2=338      range of interest for second sequence
-REVerse                 strand for nucleotide sequence
-GAPweight=12            gap creation penalty
-LENgthweight=4          gap extension penalty
-FRAmeweight=0           frameshift gap penalty
[-OUTfile1]=gamma.pair   output file for alignment

Local Data Files: -MATRix=blosum62.cmp      amino acid substitution matrix
                  -TRANSlate=translate.txt  contains the genetic code

Optional Parameters:

-GLObal                    creates global alignment (default is local)
  -ENDWeight               penalizes end gaps in global alignments like
                             other gaps
-LIMit1=337                gap shift limit for nucleotide sequence
-LIMit2=285                gap shift limit for protein sequence
-HIGhroad                  among equally optimal alignments, shows one
                             with maximum gaps in protein sequence
-LOWroad                   among equally optimal alignments, shows one
                             with maximum gaps in nucleotide sequence
-PAIr=x,2,1                thresholds for displaying '|', ':', and '.'
-WIDth=50                  the number of sequence symbols per line
-PAGe=60                   adds a line with a form feed every 60 lines
-NOBIGGaps                 suppresses abbreviation of large gaps with '.'s
-OUTfile2[=atts0012.gap]   new file for nucleotide sequence with gaps added
-OUTfile3[=g3pc_arath.gap] new file for protein sequence with gaps added
-BATch                     submits program to the batch queue
-NOMonitor                 suppresses the screen trace of program progress
-NOSUMmary                 suppresses the screen summary

ACKNOWLEDGEMENTS [ Previous | Top | Next ]

FrameAlign was written by Irv Edelman.

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

FrameAlign creates a scoring matrix on the fly that contains values for matches between all possible amino acids and all possible codons. (See the ALGORITHM topic for details.) FrameAlign creates this amino acid - codon scoring matrix from a translation table and an amino acid substitution matrix. The translation table, containing a list of all possible codons for each amino acid, is defined in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file with exactly the same name in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. The amino acid substitution matrix, containing match values for the comparison of all possible amino acids, is defined in the file blosum62.cmp. This matrix is a copy of the BLOSUM62 scoring matrix described by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)). You can use the Fetch program to copy this file to your local directory and modify the match values to suit your own needs. (See Appendix VII for more information about translation tables and scoring matrices.)

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see the Local Scoring Matrices topic above.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-GLObal

aligns the entire lengths of the nucleotide and protein sequences (global alignment). By default, FrameAlign determines a local alignment of the best region of similarity between the protein sequence and the codons in the nucleotide sequence.

-ENDWweight: penalizes gaps placed before the beginning of a sequence and after the end of a sequence the same as gaps inserted within a sequence. By default, gaps placed at the very ends of sequences in global alignments are not penalized at all.

-LIMit1=337

sets the maximum allowable register shift, or gap shift limit, for any base in the nucleotide sequence being aligned to a protein sequence. For example, with a nucleotide gap shift limit of 100, a base at positionx (where x is any position) in the nucleotide sequence can align with any amino acid at position 1 through position x + 100 in the protein sequence. By default, the gap shift limit for the nucleotide sequence is the entire length of the protein sequence, minus 1. If you specify a smaller gap shift limit, the alignment will proceed more rapidly, but the program may not find the optimal alignment if that alignment requires a larger gap shift limit.

If you add -LIMit to the command line without a value, FrameAlign prompts you to enter gap shift limits for each sequence.

-LIMit2=285

sets the maximum allowable register shift, or gap shift limit, for any amino acid in the protein sequence being aligned to a nucleotide sequence. For example, with a protein gap shift limit of 150, an amino acid at position x (where x is any position) in the protein sequence can align with any base at position 1 through position x + 150 in the nucleotide sequence. By default, the gap shift limit for the protein sequence is the entire length of the nucleotide sequence, minus 1. If you specify a smaller gap shift limit, the alignment will proceed more rapidly, but the program may not find the optimal alignment if that alignment requires a larger gap shift limit.

If you add -LIMit to the command line without a value, FrameAlign prompts you to enter gap shift limits for each sequence.

-HIGhroad

displays the optimal alignment with the maximal number of gaps in the protein sequence when several equally optimal alignments are possible.

-LOWroad

displays the optimal alignment with the maximal number of gaps in the nucleotide sequence when several equally optimal alignments are possible.

-PAIr=4,2,1

changes the thresholds for the display of sequence similarity in the alignment output.

In the program output, the paired alignment displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character (|), a colon (:), or a period (.). Normally, a pipe character is put between a codon and an amino acid when the translated codon is identical to the amino acid. A colon is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than or equal to the average positive non-identical comparison value in the amino acid substitution matrix. A period is put between a codon and an amino acid when the comparison value between the translated codon and the amino acid is greater than 1.

The three parameter values for-PAIr are the display thresholds for the pipe character, colon, and period, respectively. By default, a pipe character is inserted between identical sequence symbols. If you specify a numerical threshold as the first value, a pipe character will no longer be inserted between identical symbols unless their comparison value is greater than or equal to this threshold. If you want to specify a threshold for the display of colons and periods, but you still want a pipe character to connect identical symbols, usex instead of a number as the first value. (See Appendix VII for more information about comparison values in scoring matrices.)

-WIDth=50

sets the number of sequence symbols on each line of the alignment display.

-PAGe=60

adds form feeds to the output file so that each alignment begins at the top of a new page. Also, a form feed is added after every 60 lines of each alignment output. You can change the number of lines per page for each alignment display by specifying a number after the -PAGe parameter.

-NOBIGGaps

Normally, if one of the sequences is aligned opposite gap characters for one or more complete lines of the alignment, then that portion of the alignment is abbreviated with three dots arranged in a vertical line.-NOBIGGaps displays the entire alignment without abbreviation.

-OUTfile2=atts0012.gap

writes the nucleotide sequence, with gaps added for alignment to the protein sequence, into a separate output file. You can use the output sequence as input to other GCG programs expecting nucleotide sequence input.

-OUTfile3=g3pc_arath.gap

writes the protein sequence, with gaps added for alignment to the nucleotide sequence, into a separate output file. You can use the output sequence file as input to other GCG programs expecting protein sequence input.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default parameter to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: November 18, 1996 13:04 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.