SEGMENTS

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
CONSIDERATIONS
COMMAND-LINE SUMMARY
LOCAL DATA FILES
OPTIONAL PARAMETERS

FUNCTION

[ Top | Next ]

Segments aligns and displays the segments of similarity found by WordSearch.

DESCRIPTION

[ Previous | Top | Next ]

WordSearch uses word comparison, which is very fast, to identify regions of possible similarity between a query sequence and some set of sequences. Segments uses optimal alignment, which is slow but precise, to display the best segment of similarity in the regions identified by WordSearch. WordSearch uses a method similar to the method of Wilbur and Lipman (Proc. Natl. Acad. Sci.(USA) 80; 726-730 (1983)) to find the regions of possible similarity. Segments uses the alignment procedure of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to search for the segments.

Segments uses a scoring matrix, a gap creation penalty, and a gap extension penalty to find the best region of similarity between two sequences. The best region has the highest quality, where quality is the sum of the matches minus the sum of the mismatches minus the sum of the gap creation and extension penalties for the gaps added. The best region must fall within some "width" around the peak diagonal.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Segments to align the regions of similarity between a human globin coding sequence and sequences in the GenEMBL nucleotide sequence database found in the example session for WordSearch:


% segments

 (BestFit) SEGMENTS from what WORDSEARCH file ?  ggammacod.word

 What should I call the output file (* ggammacod.pairs *) ?

 Aligning ......................-...
 Gb_Pr:Humhbgg    545 bp  Gaps:  0  Quality:   4440 / Length: 444
 Aligning .....................-..
 Gb_Pr:Hsgggphg    521 bp  Gaps:  0  Quality:   3814 / Length: 383
 Aligning ......................-...
 Gb_Om:Ocbgl2    589 bp  Gaps:  0  Quality:   2984 / Length: 444

 /////////////////////////////////////////////////////////////////

%

OUTPUT

[ Previous | Top | Next ]

Here is part of the output file:


 (BestFit) SEGMENTS from: ggammacod.word  October 11, 1996 09:59

 (Masked) (Nucleotide) WORDSEARCH of: GenDocData:ggammacod.seq  check: 2906
 from: 1  to: 444
 ASSEMBLE    July 27, 1994 11:40
Symbols:     1 to: 92    from: gamma.seq  ck: 6474,  2179 to: 2270
Symbols:    93 to: 315   from: gamma.seq  ck: 6474,  2393 to: 2615
Symbols:   316 to: 444   from: gamma.seq  ck: 6474,  3502 to: 3630
Human fetal beta globins G and A gamma . . .

 AvMatch: 3.84  AvMisMatch: -6.00  GapWeight: 50  LengthWeight: 3   ..

        Match display thresholds for the alignment(s):
                    | = IDENTITY
                    : =   3
                    . =   1

ggammacod.seq             check: 2906  from: 1      to: 444
Gb_Pr:Humhbgg             check: 7917  from: 17     to: 545
     M15386 Human glycine-gamma-globin, 3' end. 11/94
 Gaps: 0  Quality: 4440  Ratio: 10.000  Score: 442  Width: 3  Limits: +/-4
                  .         .         .         .         .
       1 ATGGGTCATTTCACAGAGGAGGACAAGGCTACTATCACAAGCCTGTGGGG 50
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      18 ATGGGTCATTTCACAGAGGAGGACAAGGCTACTATCACAAGCCTGTGGGG 67
                  .         .         .         .         .
      51 CAAGGTGAATGTGGAAGATGCTGGAGGAGAAACCCTGGGAAGGCTCCTGG 100
         ||||||||||||||||||||||||||||||||||||||||||||||||||
      68 CAAGGTGAATGTGGAAGATGCTGGAGGAGAAACCCTGGGAAGGCTCCTGG 117

/////////////////////////////////////////////////////////////////////////

ggammacod.seq             check: 2906  from: 1      to: 444
Gb_Pr:Gibhbggl            check: 6379  from: 2338   to: 11493
     J05174 Gibbon gamma-1 and gamma-2 globin genes, complete cds. 6/90
 Gaps: 0  Quality: 2142  Ratio: 9.436  Score: 205  Width: 3  Limits: +/-4
                  .         .         .         .         .
      91 AGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGG 140
         ||||||||||||||||||||||||||||||||||||||||||||||||||
    2429 AGGCTCCTGGTTGTCTACCCATGGACCCAGAGGTTCTTTGACAGCTTTGG 2478
                  .         .         .         .         .
     141 CAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCCAAAGTCAAGGCAC 190
         |||||||||||||||||||||||||||||||||||| |||||||||||||
    2479 CAACCTGTCCTCTGCCTCTGCCATCATGGGCAACCCAAAAGTCAAGGCAC 2528

/////////////////////////////////////////////////////////////////////////

INPUT FILES

[ Previous | Top | Next ]

Segments accepts the output file of WordSearch as input. If any of the search set sequences listed in this file have been changed or deleted, Segments acts as if they do not exist. If the WordSearch query sequence listed in this file no longer exists, Segments complains and stops. Segments also reads the beginning and ending positions of the query sequence in the output file from WordSearch. If Segments cannot read this range, the entry query sequence is used.

RELATED PROGRAMS

[ Previous | Top | Next ]

Segments is an automated version of the BestFit program run with the command-line parameter -LIMit, with the limits set to plus and minus width+1. The output file of WordSearch is the input file for Segments. Compare/DotPlot and BestFit are more flexible tools for examining the relationship between two sequences when automation is not desired.

BLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. BLAST can search databases on your own computer or databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

TFastA does a Pearson and Lipman search for similarity between a query peptide sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied peptide sequences in a nucleotide sequence database are similar to my peptide sequence?"

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

RESTRICTIONS

[ Previous | Top | Next ]

The diagonal of comparison cannot be longer than 30,000 and the surface of comparison may not be larger than one million. The surface of comparison can be estimated by multiplying the average length of the two sequences being compared by the sum of the two gap shift limits. (See the ALGORITHM topic below for more information about gap shift limits.) Segments truncates sequences that exceed 30,000 symbols and squeezes the gap shift limits to keep the surface within the one-million limit.

ALGORITHM

[ Previous | Top | Next ]

Segments reads the query sequence and the set of sequences and diagonals in the output list from WordSearch and then executes a limited BestFit on each pair of sequences to make an alignment near that diagonal. For a detailed description, see BestFit ( -LIMit), and imagine that the gap shift limits are both set to width + 1. Width is defined as the width of a structure in the histogram from a word comparison (see the WordSearch program). Width is the fifth column of data in the WordSearch output file.

CONSIDERATIONS

[ Previous | Top | Next ]

There is strong reason to believe that the BestFit algorithm used by Segments is the best way to search for segments of similarity (Lipman and Pearson, "Rapid and Sensitive Protein Similarity Searches," Science 227; 1435-1441 (1985)), but the best parameters to use for Segments are not yet clear. Like any alignment program, Segments produces alignments that are very different depending on the values assigned for match, mismatch, gap creation penalty, and gap extension penalty. Segments chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) Similarly, if you have done a simplified word search and adjust the match and mismatch comparison values with the -MATch and -MISmatch command-line parameters, the program will adjust the default gap penalties accordingly. You can use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values.

The Public Scoring Matrix is Quite Stringent

The public scoring matrix file segdna.cmp scores matches as +10 and mismatches as -6, which means that the segment shown is cut off if there is any significant region where mismatches outnumber matches by about a 2:1 ratio. If the words scored by WordSearch were dispersed along the diagonal, then some of them may not appear in the alignment for that diagonal.

The Alignments Miss Some Words

Segments often fails to display every word scored for the peak diagonal if the words were not tightly grouped along the diagonal. You can use the command-line parameter -WHOle to get Needleman-Wunsch alignments that traverse the entire length of the diagonal. If you run Compare with the -WORd parameter and plot the output with DotPlot, you see the exact pattern of word identities between two sequences.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax:  % segments [-INfile=]ggammacod.word -Default

Prompted Parameters:

[-OUTfile=]ggammacod.pairs  output file

Local Data Files:

-MATRix=segdna.cmp      scoring matrix for nucleic acids
-MATRix=blosum62.cmp    scoring matrix for peptide sequences

Optional Parameters:

-GAPweight=50           gap creation penalty
-LENgthweight=3         gap extension penalty
-PAIr=x,5,1             thresholds for displaying '|', ':', and '.'
-WIDth=50               the number of sequence symbols per line
-PAGe=60                adds a line with a form feed every 60 lines
-NOBIGGaps              suppresses abbreviation of large gaps with '.'s
-MATch=+10              symbol match value for simplified word searches
-MISmatch=-5            symbol mismatch value for simplified word searches
-WHOle                  aligns whole diagonal, not just the best segment
-NOMONitor              suppresses the screen monitor

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

Segments reads comparison values from the scoring matrix file segdna.cmp (nucleic acids) or blosum62.cmp (peptides). If the WordSearch sequences were simplified, Segments would use the same simplification table used by WordSearch to construct a scoring matrix.

Segments run with the command-line parameter -WHOle uses the scoring matrix files seggapdna.cmp for nucleotide sequence comparison instead of segdna.cmp. The scoring matrix for protein sequence comparisons, blosum62.cmp, is unchanged.

OPTIONAL PARAMETERS

[ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see the Local Scoring Matrices topic above.

-GAPweight=50

lets you designate a gap creation penalty if you don't want the default penalty. (See the ALGORITHM topic in BestFit for a description of gap creation penalties.)

-LENgthweight=3

lets you select a gap extension penalty if you don't want the penalty. (See the ALGORITHM topic in BestFit for a description of gap extension penalties.)

-WHOle

causes this program to make alignments using the method of Needleman and Wunsch instead of the default method of Smith and Waterman. The difference between these two methods is the same as the difference between the programs Gap and BestFit. The Needleman and Wunsch method displays the whole length of both sequences after alignment, while the Smith and Waterman method shows only the best segment of similarity from each sequence.

The -WHOle parameter causes Segments to read the local data file seggapdna.cmp for nucleotide sequence comparisons.

-PAIr=4,2,1

The paired output file from this program displays sequence similarity by printing one of three characters between similar sequence symbols: a pipe character(|), a colon (:), or a period (.). Normally a pipe character is put between symbols that are the same, a colon is put between symbols whose comparison value is greater than or equal to the average positive non-identical comparison value in the scoring matrix, and a period is put between symbols whose comparison value is greater than or equal to 1. You can change these match display thresholds from the command line. The three values associated with -PAIr are the display thresholds for the pipe character, colon, and period. The match display criterion for a pipe character changes from symbolic identity (the default) to the quantitative threshold you have set in the first parameter. A pipe character will no longer be inserted between identical symbols unless their comparison values are greater than or equal to this threshold. If you still want a pipe character to connect identical symbols, use x instead of a number as the first value. (See Appendix VII for more information about scoring matrices.)

-PAGe=60

Printed output from this program may cross from one page to another in an annoying way. Use this parameter to add form feeds to the output file in order to try to keep clusters of related information together. You can set the number of lines per page by supplying a number after -PAGe.

-WIDth=50

puts 50 sequence symbols on each line of the output file. You can set the width to anything from 10 to 150 symbols.

-NOBIGGaps

suppresses large gap abbreviations, showing all the sequence characters across from large gaps. Usually, gaps that extend one sequence by more than one complete line of output are abbreviated with three dots arranged in a vertical line.

-MATch=10

If you have done a simplified word search, Segments must make up a scoring matrix that looks like your simplification scheme. The matrix normally assigns 10 for all the symbol comparisons you treated as equivalent and -20/Alphabet size for all other symbol comparisons. The -MATch and -MISmatch parameters allow you to set values other than 10 for matches and -20/Alphabet size for mismatch.

-MISmatch=-5

See the -MATch parameter for a description of -MISmatch.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

Printed: November 18, 1996 13:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com