[ Program Manual | User's Guide | Data Files | Databases ]
The programs listed below are new to Version 9 of the Wisconsin Package(TM).
Manipulation
Sequence Exchange
BreakUp
Graphics program for the Macintosh
GCGFigure
GCGFigure is freely available to all GCG users. If you are interested, you can anonymously download it from ftp://alanine.gcg.com in the /pub/mac directory. Or it is also available on the software CD within the /gcgunsupported directory. A README file is also available. See your system manager for assistance.
SeqLab(TM), a graphical user interface based on OSF/Motif(TM), is new to Version 9 of the Wisconsin Package. SeqLab combines the best of the Wisconsin Package Interface (WPI), released in Version 8.0 of the Wisconsin Package, and the Genetic Data Environment (GDE). GDE was originally developed in the Department of Microbiology, University of Illinois, at Urbana-Champaign, Illinois, USA (Smith et al., CABIOS, 10(6):671-675 (1994)).
SeqLab lets you view and edit sequence features represented by color highlighting or schematic figures. SeqLab is also a powerful sequence selector, allowing you to select multiple ranges based on features. For example, you can select and delete all introns from a DNA sequence before translating it, or select all the promoters from an alignment of similar genes. You can add features to sequences as the result of a sequence analysis, such as secondary structure predictions or pattern searching. You can display the results from these analyses over a multiple sequence alignment as color highlighting.
With the addition of SeqLab, the Wisconsin Package now offers fully integrated editing, analysis, and annotation capabilities.
Name Change for WPI Resource Files
The graphical user interface to the Wisconsin Package is no longer called WPI; the new name is SeqLab. Therefore, if you have modified WPI resource files in your own login directory to customize fonts and colors, you will need to create SeqLab resource files to do the equivalent. To create a new version of the resource file, copy the old WPI resource file to the new SeqLab equivalent, and replace any instance of Wpi within the file with SeqLab. (Note that case sensitivity is important).
Old WPI Filename New SeqLab Filename ---------------- ------------------- OpenVMS Wpi.Dat SeqLab.Dat WpiSmall.Dat SeqLabSmall.Dat WpiLarge.Dat SeqLabLarge.Dat ---------------- ------------------- UNIX Wpi SeqLab WpiSmall SeqLabSmall WpiLarge SeqLabLarge
RSF Files New to Version 9
SeqLab creates Rich Sequence Format (RSF) files, which can contain one or more sequences as well as their sequence feature annotations. For more information, see "New File Format: RSF Files" in the Package-Wide Enhancements section of these release notes.
Menu Changes
In the Main List, the Sequences menu has been replaced by the Edit menu. In addition, you may find some of the functions originally located under the Sequences menu in the File menu.
The Wisconsin Package programs available from within the Functions menu have been reorganized. You may find programs grouped under different and multiple functional headings.
Adding Sequences to Your Main List
The function that enables you to add sequences to the list file loaded in the Main List is now found under the File menu. Previously this function was available from the Sequences menu.
User Preference Options Moved
A new User Preferences dialog box is available from the Options menu of the SeqLab Main Window. From this dialog box you can customize General, Output, and Editor Properties options. Some of the preferences that were available elsewhere in WPI have been moved to this dialog box.
General. The Working Directory option was moved from the Main Window to the General option in the User Preferences dialog box.
Output. Options that determine how and when the output from a program is displayed on your screen were moved from the Output Manager to the Output option in the User Preferences dialog box. In addition, the Global Qualifiers options now appear in the Output option.
General Changes to Running Programs
When you click the Run button in a program window, the window now automatically closes. SeqLab maintains the state of the selected parameters in the window. That is, the next time you open the program window during the session, the parameter values appear as they were the last time you ran the program.
In addition, you must now choose the input sequences for an application before you open a program window. The Change/Select button for input sequences is no longer available.
Programs Removed from SeqLab
Known Bugs
MacX
The MacX local window manager with Macintosh style window decorations is not recommended. This window manager ignores window stacking order imposed by Motif and can cause some serious difficulties in using SeqLab. Use the Motif style window manger or use a "rooted" session with the Motif window manager (mwm) running on the server.
FastA File Compatibility
On a similar note, setting the global switch for the default sequence format with the SeqFormat program is not supported by SeqLab. Import any sequences in a similar manner as above.
You will find the following information in this section:
- Program Enhancements
- Package-Wide Enhancements
- Documentation Enhancements
Program Enhancements
Editing
Fragment Assembly
Enhancement: The maximum size for a consensus sequence created by GelMerge is now 100,000 bases; the previous maximum was 75,000 bases.
Enhancement: The program can now accept contigs containing a maximum of 1,650 fragments; the previous maximum was 1000 fragments.
Change: <Ctrl>H now deletes the character to the left of the cursor; previously it moved the cursor to the beginning of the line.
Mapping
Enhancement: This program now supports the following optional parameters:
-VERtical displays enzyme names vertically over the cut points, as in previous versions of the Wisconsin Package. In Version 9, enzyme names are displayed horizontally over the cut sites by default.
-BOTtom displays cut sites on both the forward and reverse strands of nucleotide sequences.
-NOCUTline suppresses the line of pipe (|) symbols that indicates the cut positions on the sequence in the output file.
-TABle writes a table of cut positions sorted by position along the sequence. If you specify a nucleotide sequence as input, then all cut positions on both strands are written in the table. You can use this table as input to other programs, such as spreadsheets.
-SORtbyenzyme, when used with -TABle, writes a table of cut positions that is sorted alphabetically by enzyme name rather than by cut position along the sequence.
-MINSitelen=6 selects enzymes with at least six bases in their recognition sites. You can specify any minimum length for the recognition site with this parameter.
-OVErhang=0 selects only those restriction endonucleases that leave blunt-end cuts. You can select enzymes that leave either 5' or 3' overhangs by using 5 or 3, respectively, with this parameter. You can also select enzymes that leave more than one type of overhang; for instance -OVErhang=5,3 selects restriction endonucleases that leave either 5' or 3' overhangs but not blunt ends.
-CUTters writes an enzyme data file containing those enzymes that cut the input sequence. You can then use this enzyme data file as input to other mapping programs.
-NONCUTters writes an enzyme data file containing those enzymes that did not cut the input sequence. You can then use this enzyme data file as input to other mapping programs.
-EXCUTters writes an enzyme data file containing those enzymes that did cut the input sequence but were not displayed because they failed to meet the criteria specified with the -MINCuts, -MAXCuts, or -EXCLude command-line parameters. You can then use this enzyme data file as input to other mapping programs.
Enhancement: The maximum range of interest for the input template sequence is now 350,000 bp; previously the maximum length was 32,000.
Comparison
Enhancement: When comparing two sequences using the parameter -WORdsize on the command line, the maximum range of interest for the vertical sequence is now 350,000 sequence characters. Previously the vertical sequence in a word comparison was limited to a maximum range of 32,000.
Enhancement: Previously the default stringency in window/stringency comparisons was hardwired into the program and may not have been appropriate if you chose an alternate scoring matrix. Now the default stringency is calculated from the symbol comparison values in the scoring matrix. As always, you can override the default stringency in response to the program prompt or with the -STRIngency command-line parameter. The stringency is now an integer value; previously it was a floating point number.
-PENAlizedlength allows you to specify the maximum penalized length for any gap in an alignment. For instance, if you specify -PENAlizedlength=20, then all gaps longer than 20 characters will be penalized the same as a gap of length 20. This parameter may be useful, for instance, when you are aligning a cDNA with the corresponding genomic DNA containing large introns.
Enhancement: You can create longer alignments than in the previous version. Instead of restricting the amount of computer memory used in the alignment to a fixed size, the programs now allow you to use all available computer memory for longer alignments. As in the previous version, input sequences may not be more than 30,000 sequence characters long.
-BATch allows you to submit the program to the batch queue for processing after the program prompts you for all the required information. (This program previously supported the -BATch parameter, but this support was undocumented.)
Enhancement: The maximum length for any of the input query sequences is now 350,000 bp; previously the maximum length was 32,000.
Enhancement: The maximum length for any input sequence is now 350,000 bp; previously the maximum length was 32,000.
Database Searching
Enhancement: You are no longer prompted for the number of matches you want reported in the output list file. Instead you are prompted for the maximum expectation value. Matches appear in the output list only if their z-scores are expected less frequently by chance than this value. If you explicitly set an output list size on the command line with the -LIStsize command-line parameter, then you are not prompted for the maximum expectation value.
Enhancement: The final alignments produced by FastA protein searches now allow unlimited gaps. Previously alignments were restricted to a band of 32 residues. To allow unlimited gaps in alignments produced by FastA nucleotide searches and TFastA, use the new -SWalign command-line parameter.
Enhancement: In addition to the new -SWalign command-line parameter mentioned above, these programs now support the following optional parameters:
-MINLength specifies the minimum length of a sequence to be searched in the search set.
-MAXLength specifies the maximum length of a sequence to be searched in the search set.
Enhancement: You can use the output from FastA and TFastA as input to all Wisconsin Package programs that accept list files. Previously you had to specify -NOALIGN on the FastA or TFastA command line to produce a list file other programs could use as input.
Enhancement: The list file created by FastA and TFastA includes the Begin:, End:, and Strand: attributes for each sequence in the list. These attributes indicate the region of each sequence in the search set that was aligned with the query sequence.
Enhancement: The FastA and TFastA output files now include a list of those databases that were searched.
Enhancement: You can save the alignment output in a format that other programs and scripts can easily parse if you use the -MARKx=10 command-line parameter. Programmers and script writers may find this feature useful, but most users of FastA and TFastA can ignore it.
Change: These programs no longer support the -NOINCrease and -SCAle optional parameters.
Change: The default scoring matrix for protein searches in FastA and TFastA is BLOSUM50; previously the default was the PAM250 scoring matrix.
The default scoring matrix for nucleotide sequence searches in FastA has changed slightly. Matches now have a value of +5 and mismatches have a value of -4; previously matches had a value of +4 and mismatches had a value of -3.
Change: By default, FastA and TFastA now determine a rigorous local alignment score for those matches with an initn score above a given threshold. The programs then use these scores as the basis for retaining the best matches. Previously you had to add -OPTall to the program command line to determine the list of best scores in this manner.
Enhancement: The FrameSearch output file now includes a list of those databases that were searched.
Enhancement: If you specify multiple query sequences as input, and you request that the score distribution histogram for each search be written to a figure file, the program now writes a separate figure file for each query sequence. Previously all score distribution histograms were written to a single figure file.
Change: The program plots a score distribution histogram for each search by default. Previously, you had to specify -PLOt, -FIGure, or -PSINClude on the command line to plot the histogram.
ToBLAST
Multiple Sequence Analysis
-INSitu allows you to realign a portion of an existing alignment without changing the remainder of the alignment. You specify the portion to realign with the -BEGin and -END command-line parameters.
Enhancement: PileUp can now take into account the Strand: sequence attribute (+ or -) to align each sequence in a list file. As always, you can restrict the range for each sequence in a list file using the Begin: and End: sequence attributes.
Change: When you create a non-end-weighted alignment (the default), the gaps at the ends of each sequence are written as tildes (~). Tildes represent differences in input sequence lengths rather than missing characters. When you create an end-weighted alignment in PileUp by adding -ENDWeight to the command line, gaps at the ends of each sequence are written as periods (.) since those gaps are significant and may represent missing characters in the sequence. For more information see "New Gap Character" in the Package-Wide Enhancements section of these release notes.
Change: If you use the -DIFferences command-line parameter, a sequence character is shown in the alignment when its comparison with the consensus symbol has a value less than the threshold specified with -THReshold. Previously, the program showed a sequence character in the output alignment only when its comparison with the coalition-defining symbol had a value less than this threshold. Since the coalition-defining symbol was not necessarily the same as consensus symbol, the displayed symbols were not easily related to the consensus symbol.
Change: Previously if all of the sequences in a multiple sequence alignment were not of equal length, Pretty padded the shorter sequences at the end with period (.) gap characters to the length of the longest sequence. Now, the program pads the shorter sequences at the end with tilde (~) gap characters to signify that the gaps do not represent missing characters but rather differences in input sequence lengths. For more information see "New Gap Character" in the Package-Wide Enhancements section of these release notes.
Enhancement: Previously the default threshold for consensus calculation was hardwired into the program and may not have been appropriate if you chose an alternate scoring matrix. Now the default threshold is calculated from the symbol comparison values in the scoring matrix. As always, you can override the default threshold with the -THReshold command-line parameter. The threshold is now an integer value; previously it was a floating point number.
Enhancement: This program now accepts sequences of unequal length as input. Each input sequence is treated as though it was padded at the end with gap characters to the length of the longest input sequence.
Enhancement: This program now supports the following optional parameters:
-OUTfile writes an output file with the average similarity value at each position in the alignment.
-NOPLOt suppresses the plot of the average similarity value at each position in the alignment.
-CMASK writes a grayscale colormask file according to the average similarity value at each position in the alignment. This can be used to shade each column of the alignment in the Editor mode of SeqLab, where darker regions represent regions of high conservation and lighter regions represent regions of low conservation.
Change: When you specify multiple sequences as input in response to the program prompt, you are no longer prompted for the sequence range. You can still modify the sequence range with the -BEGin and -END command-line parameters.
Enhancement This program now accepts up to 5,000 sequences as input. The previous limit was 100 sequences.
-MSF writes an MSF (multiple sequence format) file with all of the input sequences aligned to each other and to the profile consensus sequence.
Enhancement: This program now supports the following optional parameter:
-MSF writes an MSF (multiple sequence format) file with all of the input sequences aligned to each other and to the profile consensus sequence.
Evolutionary Analysis
Enhancement: This program now accepts sequences of unequal length as input. For each pairwise comparison between the sequences, the shorter sequence is treated as though it was padded at the end with gap characters to the length of the longer sequence.
NewDiverge
Enhancement: This program can now accept multiple sequence input, such as list files, MSF or RSF files, or specifications using the * wildcard character. If multiple sequences are specified in a list file, you can specify the range and strand for each sequence with the Begin:, End:, and Strand: sequence attributes.
Enhancement: This program now supports the following optional parameters:
-TOFiles writes two additional output files when you specify at least three sequences as input to the program. One additional output file has a .ks file extension and contains a matrix of the estimated number of synonymous substitutions between each pair of input sequences. The other additional output file has a .ka file extension and contains a matrix of the estimated number of nonsynonymous substitutions between each pair of input sequences. You can use either of these additional matrix files as input to the GrowTree program.
Pattern Recognition
Repeat
Enhancement: Previously, the match display threshold was hardwired into the program and may not always have been appropriate if you chose an alternate scoring matrix. Now, the default match display threshold is calculated from the symbol comparison values in the scoring matrix. As always, you can override the default match display threshold with the -PAIr command-line parameter.
RNA Secondary Structure
Enhancement: Previously, the default minimum number of bonds per stem was not calculated from the values in the scoring matrix and may not have been appropriate if you chose an alternate scoring matrix. Now, the default minimum number of bonds per stem is calculated from the symbol comparison values in the scoring matrix. As always, you can override the default value with the -BONds command-line parameter or by typing a different value in response to the program prompt.
Enhancement: Previously, the match display threshold was hardwired into the program and may not have been appropriate if you chose an alternate scoring matrix. Now, the default match display threshold is calculated from the symbol comparison values in the scoring matrix. As always, you can override the default match display threshold with the -PAIr command-line parameter.
Sequence Exchange
Enhancements: Reformat now accepts input from stdin if you specify -INfile=- on the command line. If the stdin input does not contain a heading that is separated from the sequence by a line containing two dots (..), then add -NOHEAding to the Reformat command line.
Enhancement: This program now supports the following optional parameters:
-RSF allows you to reformat one or more sequences into a new RSF file. For more information see "New File Format: RSF Files" in the Package-Wide Enhancements section of these release notes.
Several new optional parameters are concerned with reformatting scoring matrices:
-OLDCMPformat converts a pre-Version 9 triangular scoring matrix (containing floating point values) to the rectangular scoring matrix format (containing integer values) that you can use in Version 9 of the Wisconsin Package. By default, when you use this parameter, each floating point value in the input matrix is first multiplied by 10 and then rounded to the nearest integer in the output matrix.
-SCAle, when used with either -COMParison or -OLDCMPformat, allows you to scale each value in the scoring matrix by a constant value. For instance -SCAle=5 creates an output scoring matrix in which each comparison value is fivefold greater than in the input matrix.
-EQUALSformat, when used with either -COMParison or -OLDCMPformat, converts a scoring matrix to a format that is less compact but that some fine more easy to read. Any program that reads scoring matrices can read this equals format file.
-GAPweight, when used with either -COMParison or -OLDCMPformat, allows you to specify the default gap creation penalty that will be associated with the reformatted scoring matrix. For more information see "New Scoring Matrices" in the Changes that Affect the Whole Package section of these release notes.
-LENgthweight, when used with either -COMParison or -OLDCMPformat, allows you to specify the default gap extension penalty that will be associated with the reformatted scoring matrix. For more information see "New Scoring Matrices" in the Package-Wide Enhancements section of these release notes.
-PROtein or -NUCleotide, when used with either -COMParison or -OLDCMPformat, allows you to specify the type of the reformatted scoring matrix. For more information see "File Typing in Version 9" in the Package-Wide Enhancements section of these release notes.
Protein Analysis
Enhancement: When a match is found to a profile derived from a motif defined in the PROSITE Dictionary of Protein Sites and Patterns , the corresponding PROSITE abstract is now written to the .scan output file along with the alignment between the query sequence and the profile. You can suppress writing the PROSITE abstract with the new -NOREFerence command-line parameter.
Enhancement: In addition to the new -NOREFerence parameter, this program now supports the following optional parameter:
-BATch allows you to submit the program to the batch queue for processing after the program prompts you for all the required information.
Manipulation
Change: This program calculates default gap creation and extension penalties from the symbol comparison values in the scoring matrix and writes them in an auxiliary data block in the output matrix file. (For more information see "New Scoring Matrices" in the Package-Wide Enhancements section of these release notes.) When you use the output matrix file with other programs, you can override the default values with the -GAPweight and -LENgthweight command-line parameters.
Display
-A4 moves all left margins to the left 9/72 inch and raises all top and bottom margins up by 24/72 inch. This command centers documents on A4 paper without changing their pagination or filling in any way.
Package-Wide Enhancements
Program Name Changes
- ToBLAST is renamed GCGToBlast.
- NewDiverge is renamed Diverge. The program formerly named Diverge is no longer supported.
- WPI, the graphical user interface to the Wisconsin Package, is enhanced and renamed SeqLab. For more information see the SeqLab, the Improved Graphical User Interface section in these release notes.
Commands No Longer Supported
New Gap Character
In addition to the existing period (.) gap character, the Wisconsin Package now supports a new gap sequence character, the tilde (~). Programs in the Wisconsin Package run from the command line or from the Main List mode of SeqLab treat the two gap characters identically in input sequences. Programs in the Wisconsin Package run from the Editor mode of SeqLab remove any tilde gap characters from the right end of each input sequence before performing their analyses.
In the future, programs run from either the command line or from SeqLab may differentiate the two gap characters in their analyses. The period gap character will increasingly be used as a space holder that may represent a missing character in a sequence. For example, the period gap character may represent a missed base call in a contig alignment in fragment assembly. The tilde gap character will increasingly be used to as a simple place holder that never represents an actual character in a sequence. For example, two tildes may be used in a translated sequence to align each codon in a nucleotide sequence with its corresponding single-letter amino acid symbol. As another example, gaps at the ends of sequences in an alignment may be written as tildes when those gaps are due to differences in input sequence lengths rather than missing characters in the input sequences. See Appendix III in the Program Manual for a list of all supported GCG sequence characters.
The Plus Symbol (+) Is No Longer a Valid Sequence Character
FastA-Format User Sequences
Sequence analysis programs in the Wisconsin Package now accept input sequences from files in FastA format when you add -FASTA to the program command line. Alternatively, you can use the global switch % seqformat fasta to automatically set the programs to accept sequences from files in FastA format. Warning: If the FastA-format sequence file contains multiple sequences, only the first one is read by the analysis program.
File Typing in Version 9
Many of the output files created by Wisconsin Package programs in Version 9 will indicate the file type on the top line of the file. This line begins with two exclamation points (!!) and is followed by text specifying the type of data in the file and the version number of the file. For example, an individual sequence file created in Version 9 will display either
!!AA_SEQUENCE 1.0
as the first line of
the file. The file
type must remain the first
line of the file and
you should not alter it
in any
way. Files created without
file types before Version 9
will work in Version 9
of the Wisconsin Package.
File
formats new to Version 9,
like RSF files, are required
to have file types.
Many of the data files used by Wisconsin Package programs in Version 9 will similarly contain file types on the top line of the file. As with sequence file types, data file types must remain the first line of the file, and you should not alter them in any way. SeqLab may not recognize data files created without file types before Version 9 of the Wisconsin Package. The new scoring matrix file format in Version 9 is required to have a file type as the first line in the file to be recognized by any Wisconsin Package program.
New Scoring Matrices
BLOSUM Matrices
New Scoring Matrix Format
The format and content of scoring matrices is changed in Version 9 of the Wisconsin Package. To see an example of the new default scoring matrix format, copy a representative scoring matrix to your local directory by typing % fetch blosum45.cmp and then view the contents of the file with any text editor. In Version 9, the scoring matrix in the data file is rectangular; previously the scoring matrix was triangular. Also, the values in the scoring matrix are now integers; previously the values were floating point numbers. These changes make the format and content of scoring matrices provided by the Wisconsin Package more similar to scoring matrices provided by others.
You can convert an old-style scoring matrix to the format required for Version 9 with % reformat -OLDCMPformat. See the Reformat notes in the Changes to Existing Programs section of these release notes for a listing of other new command-line parameters that affect the reformatting of scoring matrices.
The very top line of the scoring matrix file is the file type. (For more information see "File Typing in Version 9" in this section of these release notes.)
In Version 9, each scoring matrix can optionally specify its own default gap creation and extension penalties in an auxiliary data block. Just like the symbol comparison values in the scoring matrix, these penalties are now integers. To see the format of the auxiliary data block, look at the representative matrix you've already fetched. Any program that requires gap penalties will use the defaults found in the auxiliary data block. If optional default gap penalties are not specified for a scoring matrix, any program that requires gap penalties will calculate defaults from the symbol comparison values in the matrix. As always, you can override the default gap creation and extension penalties in response to the program prompts or on the program command line with the -GAPweight and -LENgthweight command-line parameters.
Changes to Alignment Display
BLAST-Format Scoring Matrices
Using the -MATRix command-line parameter, you can specify BLAST-format scoring matrices as alternate scoring matrices in Wisconsin Package programs. However you cannot specify default gap penalties in an auxiliary data block in a BLAST-format scoring matrix file. Any program that reads a BLAST-format scoring matrix will calculate its own default gap penalties from the symbol comparison values in the matrix. To convert a native BLAST-format scoring matrix to the standard format used by the Wisconsin Package in Version 9, use % reformat -COMParison.
Translation Tables
The alternate translation tables provided with the Wisconsin Package in the GenMoreData directory have been renamed and supplemented with additional tables. See Appendix VII in the Program Manual for a complete list of the tables provided.
New Graphics Formats
You can initialize your graphics configuration to create color EPS file output from Wisconsin Package graphics programs. At the system prompt, type % postscript CEPSF. When the computer prompts you for the name of the port to which your device which supports CEPSF (Color Encapsulated PostScript Format) is connected, respond with the name of the file you want to contain the color EPS instructions.
In addition, you can initialize your graphics configuration to create GIF (Graphics Interchange Format(c)) file output from Wisconsin Package graphics programs. At the system prompt, type % gif. The computer prompts you to select either GIF87a or GIF89a (GIF89a is a newer GIF version with extensions that are not supported by all GIF viewers), the name of the GIF output file, and the graphics image width and height.
For Version 9.0, GIF is an optional graphics driver sold separately from the Wisconsin Package. The Graphics Interchange Format is the Copyright property of CompuServe Incorporated. GIF is the Service Mark property of CompuServe Incorporated. The GIF-LZW compression software is licensed under U.S. Paten 4,558,302 and foreign counterparts.
X Windows Graphics Output
the graphics window background on a monochrome display will switch from its default color (either white or black) to the opposite color.
Using Tilde (~) in Specifying Directory Paths
Specifying Graphics Output Filenames
% postscript laserwriter '$program$.ps'
The new tokens include:
token generates filename with --------- ----------------------------- $program$ the name of the program (e.g. mapplot) $host$ the name of the computer $user$ the name of the user running the program $time$ the time of day in numeric format
% postscript laserwriter '$program$-$time$.ps'
- Creator/author of the sequence
- Sequence weight
- Creation date
- One-line description of the sequence
- Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project
- Known sequence features
RSF files are useful within SeqLab, the graphical user interface to the Wisconsin Package. Because they store positional information, you can display RSF files within SeqLab's Editor to view and edit sequence alignments and features. The features annotation allows you to graphically view and align sequences based on features as well as run programs on sequence regions selected by features. You also will find RSF files useful for distributing sequences to colleagues, since these files contain each sequence's data and descriptive information. For more information on RSF files, see "Using Rich Sequence Format (RSF) Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide.
Documentation Enhancements
Program Manual Reorganization
Comparison - Pairwise or Multiple Database Searching - Reference or Sequence Editing and Publication Evolution Fragment Assembly Gene Finding and Pattern Recognition Importing/Exporting Mapping Primer Selection Protein Analysis RNA Secondary Structure Translation Utilities
Some programs appear within multiple functional categories.
Online Help
Online help for the Wisconsin Package now includes the User's Guide as well as the Program Manual. In addition, the online help has been converted to HTML, and typing % genhelp or % genmanual now displays the text-only browser Lynx for navigating between topics and links. To use a different browser, such as Netscape, use the documentation URL in the banner that displays on your screen when you initialize the Package. As was available with the previous version of online help, you will still be able to navigate to a specific online help topic with a command like % genhelp map. For more information, see the "Improved Online Help" document accompanying this release.
Program Documentation Removed from the Program Manual
In addition, the DBIndex program has moved from the Program Manual to the Database Utilities chapter of the System Support Manual.
Data Files Manual Discontinued
You will find the following information in this section:
- Program Bug Fixes
- Package-Wide Bug Fixes
Program Bug Fixes
Problem: You were unable to enter the tilde (~) to indicate NOT syntax when you specified a pattern to find in the sequence you were editing.
Update: You can enter the tilde (~) to indicate NOT syntax when you specify a pattern to find in the sequence you are editing.
Problem: The reading frames for the forward strand protein translations were not changed if you specified a beginning position other than the first position in the sequence. For instance, frame a always began at the first position in the sequence rather than at the beginning of the selected range. Similarity, reading frames for the reverse strand protein translations were not changed if you specified an ending position other than the last position in the sequence. For instance, frame f always began at the last position in the sequence rather than at the ending of the selected range.
Update: If you specify a beginning position other than the first position in the sequence, the forward strand reading frames are changed so that the a frame starts at the beginning of the selected range. If you specify an ending position other than the last position in the sequence, the reverse strand reading frames are changed so that the f frame starts at the end of the selected range.
Problem: Unless you added the undocumented -PROtein command-line parameter, the program recognized all input sequences as nucleic acids.
Update: The program determines the sequence type from the Type: field of the input sequence (see Appendix VI in the Program Manual) in the same way the sequence type is determined by other Wisconsin Package analysis programs.
Problem: If you looked for silent restriction sites by adding -SILent to the command line, and you specified a sequence range that did not extend to the end of the sequence, the program sometimes crashed if a silent restriction site was identified at the very end of the sequence range.
Update: The program identifies silent restriction sites at all appropriate sites in the sequence without crashing.
Problem: Prime rarely calculated incorrect annealing scores between two primers. In addition, Prime also rarely calculated incorrect annealing scores between a primer and possible false priming sites on the template sequence when you specified either -ALLANNEALTemplate or -ENDANNEALTemplate on the command-line. Because of this, some primers were incorrectly rejected from consideration and others might not have been ranked at their appropriate positions in the output list.
Update: Prime correctly calculates annealing scores between two primers or between a primer and possible false priming sites on the template sequence.
Problem: If you specified an input file of primer sequences with the -PRImers command-line parameter, and you also specified either -ALLANNEALTemplate or -ENDANNEALTemplate on the command line, Prime sometimes calculated incorrect annealing scores between the primers and possible false priming sites on the template sequence. Because of this, some primers were incorrectly rejected from consideration and others may not have been ranked at their appropriate positions in the output list.
Update: Prime correctly calculates annealing scores when you specify an input file of primer sequences and you also specify either -ALLANNEALTemplate or -ENDANNEALTemplate on the command-line.
Problem: If you chose an alternative output filename and you specified -BATch on the command line, the output filename you chose was ignored and the output file was given the default name.
Update: Prime writes its output to the file you name when you specify -BATch on the command line.
Problem: If you aligned sequences contained in an MSF file, the output listed the name of the MSF file and not the names of the sequences.
Update: If you align sequences in an MSF file, the output lists the names of the sequence.
Problem: If either input sequence was contained in an MSF file, the output alignment listed the name of the MSF file instead of the name of the sequence within the MSF file.
Update: The output alignment lists the name of the sequence within the MSF file.
Problem: If you specified a group of sequences within an MSF file as input, each overlap listed the name of the MSF file instead of the names of the overlapping sequences.
Update: Each overlap lists the names of the overlapping sequences within the MSF file.
Problem: If you specified a database to search on the command line with an expression like -INfile2=MyDir:nuc and a database of the same name, but in a different directory, was also found in one of the database menu files (blast.rdbs, blast.ldbs, or blast.sdbs), the database in the menu file was searched.
Update: The database you specify on the command line is searched.
Known Problem: With the increasing sizes of the databases, it is possible that you will not have enough memory to run a local BLAST search. We have already seen this happen on some machines when a nucleotide sequence is used to search GenEMBL. The program terminates before the search begins and the program output file contains an "out of memory" error message.
Possible Work-arounds: Increase the limits on your account by typing % unlimit before running BLAST. This may help if the query sequence is short. If this doesn't work, search each strand of a nucleotide query sequence separately using the commands % blast -TOPstrand and % blast -BOTtomstrand. If this doesn't work, speak to your system manager about either partitioning the large BLAST database into two or more smaller BLAST databases or increasing swap space.
Problem: If you specified a query on the command line without a parameter (e.g. % lookup Smithies), the program stopped after complaining that no valid libraries exist.
Update: If you specify a query on the command line without a parameter, the program behaves as if you had used the -ALLtext parameter with the query and proceeds normally.
Problem: If your output list file contained comments exactly eighty characters in length, you could have problems if you tried to use the file as input to other Wisconsin Package programs. If you tried to do so, all sequence entries following those with long comment lines were skipped. Also, the long comment lines may have been missing a character at the end of the line.
Update: If your output list file contains comments exactly eighty characters in length, you can use the file as input to other Wisconsin Package programs without problem. Also, the long comment lines are no longer missing any part of the comment.
Problem: If you searched for sequences in SWISS-PROT, the program occasionally crashed.
Update: The program finds the appropriate matches in SWISS-PROT to your query and completes normally.
Problem: If you searched for some authors associated with sequence entries in SWISS-PROT, no matching entries were found.
Update: You can search for any authors associated with sequence entries in SWISS-PROT and the appropriate matching entries are reported.
Problem: If you searched for feature names containing apostrophes (e.g. 5'UTR) in GenBank, the program displayed a message claiming a syntax error in the feature name, and matching entries were not found.
Update: You can search for feature names containing apostrophes in GenBank, and the program reports the appropriate matching entries.
Problem: If you searched PIR and selected fragment output by adding -FRAgments to the command line, the program crashed.
Update: The program stops normally after displaying a message reporting that fragment output from PIR is currently unavailable.
Problem: If you added -PAMfactor to the command line when searching with a nucleic acid query sequence, the command-line parameter was ignored. Instead of using a scoring matrix for the calculation of initial diagonal scores as you requested, the program used a constant factor for each match.
Update: When you add -PAMfactor to the command line in a nucleic acid search, the program uses a scoring matrix for the calculation of initial diagonal scores.
Problem: If either the query or matching search set sequence contained more than six letters, only the first six were displayed to the left of the sequences in the alignment output.
Update: There is now space for up to twelve letters in the sequence names to the left of the sequences in the alignment output.
Problem: If you specified a search set that contained no valid sequences, and you added -Default to the command line, the program automatically substituted SwissProt:* (for a nucleotide query) or EST:* (for a protein query) as the search set.
Update: If you specify a search set containing no valid sequences and you add -Default to the command line, the program displays an error message and stops.
Problem: The program occasionally miscalculated the "Percent similarity" reported for each alignment when you specified an identity threshold with the -PAIr command-line parameter.
Update: The program correctly calculates the "Percent similarity" reported for each alignment under all circumstances.
Problem: If you ran FrameSearch with -BATch on the command line, and you specified multiple query sequences as input, the program used the sequence length of the first query sequence as the ending position for all subsequent query sequences.
Update: If you specify multiple query sequences as input, the program uses the entire length of each sequence in the search (unless you specify -BEGin or -END on the command line).
Problem: If you allowed mismatches between the pattern and sequence by specifying -MISmatch on the command line, and a mismatch occurred in that portion of the pattern containing OR matching, the program sometimes missed finding all the appropriate matches.
Update: All appropriate matches are found when you allow mismatches and your pattern contains OR matching.
Problem: If you specified -PERFect on the command line in a search of nucleotide sequences, only the forward (top) strand of each nucleotide sequence was searched for matches to each pattern.
Update: Both strands of each nucleotide sequence are searched for perfect (non-ambiguous) matches to each pattern.
Problem: If no matches were found, then no output file was created.
Update: If no matches are found, then an output file is written indicating this result.
Problem: If you specified patterns containing spaces on the command line using the -PATterns command-line parameter, no matches were reported. However, if you specified patterns containing spaces in response to the program prompt, the spaces were automatically removed before searching and the appropriate matches were reported.
Update: If you specify patterns containing spaces on the command line, the spaces are removed before searching and the appropriate matches are reported.
Problem: If you searched for patterns in sequences contained in an MSF file, the output listed the name of the MSF file and not the names of the sequences in which the patterns were found.
Update: The output lists the names of the sequences within the MSF file in which the patterns are found.
Problem: If you entered a character pattern consisting of two words separated by more than one space, and you added -BATch to the command line, the program removed all but one of the spaces separating the words before searching for the pattern.
Update: StringSearch no longer removes any spaces between words in a character pattern when you add -BATch to the command line.
Problem: If you created a list file of sequence names with LookUp, and then used that list as input to StringSearch for a definitions search, StringSearch was unable to find matches to any text patterns.
Update: You can use a LookUp output file as input to StringSearch for a definitions search, and appropriate matches to the text patterns you specify are found.
Problem: If you tried to create a BLAST-searchable database from nucleic acid sequences containing X sequence symbols, GCGToBLAST complained that X was an invalid nucleic acid code because BLAST does not recognize the X as a nucleic acid ambiguity code.
Update: GCGToBLAST converts each occurrence of X (or x) in nucleotide sequences into an N (or n) nucleic acid ambiguity symbol in the BLAST-searchable database.
Problem: You could not increase the combined length of all gaps that could be added to each sequence in the alignment with the -MAXGap command-line parameter unless you also decreased the maximum segment length of each input sequence with the -MAXSeg command-line parameter.
Update: When you increase the combined length of all gaps that can be added to each sequence in the alignment with the -MAXGap command-line parameter, the maximum segment length of each input sequence is automatically reduced so that the sum of the maximum segment length and the maximum gap length is equal to 7,000.
Problem: On the command line, you were unable to use a normal MSF sequence specification like ^&^ myseqs.msf{*}. Instead, you had to use a LineUp-specific syntax like % lineup myseqs.msf.
Update: LineUp accepts any single or multiple sequence specification on the command line using the normal syntax.
Problem: If your your local directory contained a set.keys file specifying keyboard key redefinitions for editing nucleotide sequences and the first sequence you entered into LineUp was a nucleotide sequence, the key redefinitions were ignored.
Update: If the first sequence entered into LineUp is a nucleotide sequence, the keyboard keys are redefined according to the specifications in the Set.Keys file in your local directory.
Problem: When you used the ZIp command to align a protein sequence to an existing protein consensus sequence, the program could propose meaningless alignments involving the reverse complement (-) strand of the protein sequence.
Update: The program does not propose alignments that involve the reverse complement strand of a protein sequence.
Problem: You were unable to enter the tilde (~) to indicate NOT syntax when you specified a pattern to find in the sequence you were editing.
Update: You can enter the tilde (~) to indicate NOT syntax.
Problem: If you added -PROFile to the command line and specified non-profile input, the program displayed an error message and continued prompting for additional parameters. If you responded to the additional program prompts, the program crashed.
Update: If you add -PROFile to the command line and specify non-profile input, the program displays an error message and stops.
Problem: If you tried to plot the running average similarity among the sequences in an alignment longer than 11,500 symbols, the program rejected any density for the plot you specified in response to the program prompt and continued to prompt you for a new density. You had to use <Ctrl>C to exit the program.
Update: You can plot the running average similarity among the sequences in an alignment of any length. However, since the entire plot must fit on a single page, the plot becomes more difficult to read as the length of the alignment increases.
Problem: If you entered more than the maximum limit of 100 sequences as input, the program displayed an error message for each additional sequence it tried to read.
Update: You can now enter up to 5,000 sequences as input. If this limit is exceeded, the program stops immediately.
Problem: If you specified a search set that contained no valid sequences, and you added -Default to the command line, the program automatically substituted SwissProt:* (for a protein profile) or EMBL:* (for a nucleotide profile) as the search set.
Update: If you specify a search set containing no valid sequences and you add -Default to the command line, the program displays an error message and stops.
Problem: If you specified a search set containing sequences whose lengths were all very similar, the program crashed while trying to normalize the scores.
Update: The program no longer tries to normalize the scores when the lengths of all the sequences are very similar.
Problem: If you chose to reconstruct a tree from a distance matrix using the UPGMA method, the branch lengths in the output tree were incorrect.
Update: The branch lengths in the output tree are correct.
Problem: If you specified a range of the input sequence to analyze, and you chose to reverse that specified segment either by adding -REVerse to the command line or by responding to the program prompt, the entire sequence was first reversed and then the specified segment was selected from the reverse sequence strand.
Update: If you specify a range of the input sequence to analyze, and you choose to reverse the specified segment, the specified segment is first chosen from the forward sequence strand and then this segment is reversed. This is consistent with the behavior of other Wisconsin Package programs that offer you the option to analyze the reverse strand of the input sequence.
Problem: If a repeat was longer than 55 bases or residues, a length of 55 was reported in the output file to the right of the repeat alignment.
Update: If a repeat is longer than 55 bases or residues, the correct length is reported in the output file to the right of the repeat alignment. However, only the first 55 bases of the repeat are actually displayed in the alignment.
Problem: If you tabulated the codon usage of a single sequence specified in a list file, any attributes associated with that sequence were ignored unless you added -Default to the command line.
Update: If you tabulate the codon usage of a single sequence specified in a list file, any begin, end, and strand attributes associated with that sequence are used as default input values by the program.
Problem: If you create a figure file of the 1-dimensional panel graph plot by specifying -FIGure on the command line, and you also specified a font for all text characters in the plot using -FONT, the program crashed.
Update: You can specify both -FIGure and -FONT on the PlotStructure command line and create a figure file of a 1-dimensional panel graph plot without problem.
Problem: If you ran Translate noninteractively by specifying multiple sequences as input on the command line or by adding -Default to the command line, the program occasionally gave the output file a nucleotide sequence type. This occurred when the translated sequence contained only amino acid symbols that could be recognized as IUPAC-IUB nucleotide ambiguity symbols.
Update: Translate always writes an output file with a protein sequence type.
Problem: If you translated a single sequence specified in a list file, any attributes associated with that sequence were ignored unless you added -Default to the command line.
Update: If you translate a single sequence specified in a list file, any begin, end, strand, and join attributes associated with that sequence are used as default input values by the program.
Problem: If your protein input sequence contained gap characters and you selected one of the table of back-translations menu choices (option a or b), then the table of back translations contained three periods in a row (...). Even though the output file also contained a GCG sequence appended after the table, the file was not recognized as a GCG sequence file by analysis programs.
Update: If your protein input sequence contains gap characters and you select one of the table of back-translations menu choices, the gap characters are back-translated to three tildes in a row (~~~). GCG analysis programs recognize the output file, also containing a GCG sequence appended after the table, as a GCG sequence file.
Problem: If you interactively assembled a user sequence and then chose to G)et segments from another sequence, but specified the same user sequence again, the program didn't recognize the sequence the second time.
Update: You can repetitively assemble fragments from a single sequence by repetitively specifying the same sequence name in response to the program prompt.
Problem: If you assembled a single sequence specified in a list file, any attributes associated with that sequence were ignored unless you added -Default to the command line.
Update: If you assemble a single sequence specified in a list file, any begin, end, strand, and join attributes associated with that sequence are used as default input values by the program.
Problem: If you entered a multiple sequence specification as input (for example, a sequence specification with an asterisk (*) wildcard) and that multiple sequence specification referenced only a single sequence, the program didn't read the sequence.
Update: You can enter any valid single or multiple sequence specification as input, even if the multiple sequence specification references a single sequence. This is consistent with the behavior of other Wisconsin Package programs that accept either a single or multiple sequences as input.
Problem: If you selected one of the translation menu choices with numbering (option F or G), the translations may have been numbered incorrectly in several instances. For example, if the translation began in the middle of a row of the nucleotide sequence, it was numbered as if it began at the beginning of the row. If you chose three-letter translations (option F), and an amino acid began at the end of one row and stopped at the beginning of the next row, the numbering was incorrect. If you selected more than one discontinuous translation range (e.g. translations of exons separated by introns), they were numbered as if the entire sequence from the beginning of the first range to the end of the last range had been translated.
Update: The translation numbering is now correct in Publish.
Problem: If your attempt to reformat a sequence did not succeed, the input file was deleted.
Update: If your attempt to reformat a sequence does not succeed, the input file is not deleted.
Problem: If you reformatted a scoring matrix found in a directory other than your local directory by specifying a directory path along with the filename, the output scoring matrix was written to that same directory by default.
Update: If you reformat a scoring matrix found in a directory other than your local directory, the output scoring matrix is written to your local directory by default.
Problem: Staden-format nucleotide sequences containing lowercase IUPAC-IUB sequence characters were converted to periods (.) in the GCG-format output sequence files.
Update: Staden-format nucleotide sequences containing lowercase IUPAC-IUB sequence characters are unchanged in the GCG-format output sequence files. Appendix III of the Program Manual contains an updated list of the mappings between Staden and GCG sequence characters.
Problem: In a FastA-format input file, if the documentation following the sequence name contained two adjacent periods (..), the output sequence file had two lines containing two adjacent periods. This file was not recognized as a GCG sequence file by analysis programs.
Update: If the documentation following the sequence name contains two adjacent periods (..) in a FastA-format input file, the program inserts a blank space between the periods (. .) in the output file. GCG analysis programs will then recognize the output sequence file as a GCG sequence file.
Problem: If the documentation following the sequence name for a FastA-format sequence was longer than 511 characters, the program crashed.
Update: If the documentation following the sequence name for a FastA-format sequence is longer than 511 characters, it is written as several shorter documentation lines in the GCG-format output sequence file and the program completes normally.
Package-Wide Bug Fixes
Problem: In any Wisconsin Package plotting program, if you used the -COPies command-line parameter to specify more than one copy for a plot sent to a PostScript device, only a single copy was plotted.
Update: All of the plot copies you specify with the -COPies command-line parameter are actual plotted.
Problem: In any program that recognizes the begin: and end: sequence attributes in a list file, if you specified -BEGin or -END on the command line without any value, the program ignored that command-line parameter.
Update: In any program that recognizes the begin: sequence attribute in a list file, if you specify -BEGin without any value, the program uses a beginning position of 1 for each sequence; beginning positions specified for individual sequences in the list file are ignored. In any program that recognizes the end: sequence attribute in a list file, if you specify -END without any value, the program uses the end of each sequence as the ending position; ending positions specified for individual sequences in the list file are ignored.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.