[ Program Manual | User's Guide | Data Files | Databases ]
Overview
Types of Sequence Files
Using Database Sequences
Submitting Sequences to the Databases
Specifying Database Sequences by Name
Specifying Database Sequences by Accession Number
Using Single Sequence Files
Creating and Editing Single Sequences
Specifying Single Sequence Files
Specifying Sequence Type (Nucleotide or Protein)
Using List Files
Creating and Editing List Files by Hand
Programs that Create List Files
Specifying List Files
Using Rich Sequence Format (RSF) Files
Programs that Create RSF Files
Editing RSF Files
Specifying RSF Files
Using Multiple Sequence Format (MSF) Files
Programs that Create MSF Files
Editing MSF Files
Specifying MSF Sequences
Finding and Copying Database Sequence Files
Finding Database Sequences
Copying Sequences from the Databases
Viewing Sequences
Viewing Database Sequences
Viewing Sequences in Your Directory
Reformatting Sequence Files to GCG Format
Reformatting Sequence Files
For Advanced Users
Using Personal Databases
Creating Personal Databases
Specifying Personal Databases
Refining a Sequence List
This chapter teaches you about the heart of the Wisconsin Package: using sequences. It provides information that you must know to work with the sequence databases (GenBank, EMBL (abridged), SWISS-PROT, and PIR-Protein) and to use your own sequences with Wisconsin Package programs for specific analysis.
You'll learn how to
The Wisconsin Package works with many different types of sequence files:
The Wisconsin Package provides you access to nucleotide and protein database sequences. These sequences are included in the following databases:
Note: To find out more about the databases, read the release notes that accompany each database release. If your site receives the GCG Database Update Service, these release notes are located in the directory with the logical name genmoredata. For each database you will find a file of release notes with the name of the database and the extension ".release". For example, if you want to find out more about the GenBank database, type
% to genmoredata
% more genbank.release
To refer to sequences in these databases, use the logical names listed in the Nucleic Acid Databases and Protein Databases tables in this section. You will notice that in some cases there is more than one logical name to refer to a database; use whichever you are most comfortable with. For example, to refer to sequences in GenBank, you could use the logical name GenBank or GB.
Note that you can refer to the sequences in GenBank and EMBL, excluding EST, STS, and GSS sequences, collectively with the logical names GenEMBL or GE. To search all sequences, including EST, STS, and GSS sequences, use the logical names GenEMBLPlus or GEP. If you know the specific GenEMBL division you want to search, for example Bacterial, you can search that division alone by using the logical names Bacterial or Ba.
Each sequence in the databases contains not only the sequence data but also taxonomic information about the organism and the bibliographic citation. Below is an example of the sequence Dro5S from the Invertebrate section of GenBank.
Note: Because databases are site-dependent, the above list may not include all the databases available to you, or your site may name the databases differently. In addition, because the divisions of GenBank and EMBL are subject to change, this table may not be complete.
You can submit sequences to be included in a future release of GenBank, EMBL, SWISS-PROT, and PIR. See Appendix II of the Program Manual for the data submission form for GenBank and EMBL. For SWISS-PROT, submit the form to EMBL. For PIR, submit the form to GenBank. All of these database submittal services require that you submit sequences in ASCII format.
Choose one of the following.
Note: WWW addresses sometimes change. If the above URL does not work, look at NCBI's and EBI's home pages for more information, or contact GCG's technical support staff.
You can specify database sequence entries by name. Note, however, that a sequence name is subject to change from release to release of the database. For instance, let's say an existing database sequence is merged with another sequence; the complete, merged sequence may acquire the name of the second sequence while the first sequence name is omitted. A more stable way of tracking a sequence from release to release is by its accession number, as is described in "Specifying Database Sequences by Accession Number" in this section.
Choose one of the following.
Note: Database names are case-insensitive. That is, you can type them in uppercase, lowercase, or mixed case.
There are also a number of logical names that refer to the individual divisions of GenEMBL, GenBank, EMBL, and PIR-Protein. For example, GB_In refers only to those sequences in the Invertebrate division of the GenBank database, such as GB_In:Dro5S. To refer to this same division in GenEMBL (GenBank plus EMBL), you would type Invertebrate or In, for instance In:Dro5S.
For more information on the database logical names, see the Nucleic Acid Databases and Protein Databases tables earlier in this chapter.
The sequence names of entries in the databases sometimes change from release to release, and the same entry may have a different name in GenBank and EMBL. Because of this, publications refer to sequences by accession number. Using accession numbers offers three advantages over sequence names:
Specifying a database sequence by accession number is much like specifying one by name. Database names and accession numbers are case-insensitive. That is, you can type them in uppercase, lowercase, or mixed case.
Type the name of the database (for example, GE, which is the GenEMBL database), a colon (:), and the accession number (for example, U00069)--GE:U00069. For more information on the database logical names, see the Nucleic Acid Databases and Protein Databases tables earlier in this chapter.
Note: You cannot use wildcards to specify sequences by accession number.
If you don't know the database of the accession number, type % typedata -REFerence accession_number, for example % typedata -REFerence U00069. The program finds the sequence file in the appropriate database and displays its reference information (that is, everything but the sequence itself) on your screen. The first line of this reference information tells you the database in which the sequence resides. For example, in the illustration below, the sequence U00069 is in the Bacterial (BCT) database.
If you also want to see the sequence information, use % typedata without the -REFerence parameter. Or, if you want to copy the sequence to your directory, use the Fetch program.
When a sequence is first entered into EMBL, GenBank, PIR, or SWISS-PROT, it is assigned a unique primary accession number. If that sequence is ever merged with another sequence, the accession number of the original sequence becomes a secondary accession number in the merged sequence.
The Wisconsin Package programs treat primary and secondary accession numbers the same, as long as the accession number you use is unique. Therefore, you can access unique secondary accession numbers as well as primary accession numbers. However, if you use an accession number that occurs more than once in a database, or if you try to use an accession number that does not exist, Wisconsin Package programs will display a message saying they cannot read your sequence. If this is the case, use the LookUp program to determine the accession number's corresponding sequence name and/or primary accession number.
If the accession number you use to specify a sequence has become a secondary accession number, there is no guarantee that the sequence is exactly the same as when it had a primary accession number. That is, the original sequence may be only a portion of a new, larger entry.
You may want to find out if a primary accession number has become secondary. For example, let's say you want to view a sequence listed in a journal. However, if you retrieve that sequence by accession number from the databases, it may already have been incorporated into a larger sequence.
Choose from the following.
The reference information scrolls on your screen with the accession numbers near the top. The primary accession number always appears first, before the secondary accession numbers.
Much of the work you perform may revolve around single sequences, which are sequence files stored in your personal directories. There are three ways to create single sequence files: 1) by using SeqEd, 2) by using a text editor and the Reformat program, or 3) by using SeqLab, the graphical user interface to the Wisconsin Package.
Below is an example single nucleotide sequence file created with SeqEd.
You can store single database sequences in your personal directories as well as import single sequences created by other sequence analysis software and reformat them to use with the Wisconsin Package. For more information on importing sequences, see the "Reformatting Sequence Files to GCG Format" section in this chapter.
You can create sequences from scratch in the Wisconsin Package or edit existing sequences. Each sequence must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROtein to the command line when you run SeqEd or Reformat. If you forget to do so, the programs determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequences share some symbols, the programs can guess incorrectly at the sequence type.
Choose from the following.
Heading. (optional) May contain any number of lines of text at the top of the file describing the sequence.
Dividing Line. Consists of a single line containing two periods in succession (..) to separate heading information from the sequence. This line is required only if you include heading information.
Sequence. Contains the sequence information in any format. Each line of the sequence cannot be longer than 512 characters.
Note: You also can use a text editor to modify existing sequence files, although we do not recommend this method. Once you modify a sequence with a text editor, the checksum of the sequence changes, and Wisconsin Package programs will not recognize the sequence. Therefore, if you use a text editor to modify a sequence, you must use the Reformat program to rewrite the file into GCG format.
Choose one of the following.
TIP - Sometimes the sequence files do not have characters in common; that is, you cannot use a wildcard to name several of them. If this is the case, you can create a list file to name multiple sequences. For more information, see "Using List Files" in this chapter.
Sequence type (nucleotide or protein) is an inherent part of a sequence. You can determine the type of a sequence by looking at the sequence file. Sequences in GCG format contain a dividing line between optional text heading and the sequence data. Consider the following example of a typical dividing line:
Gamma.Seq Length: 11375 December 1, 1996 10:09 Type: N Checksum: 6474 ..
The sequence type should appear on the dividing line as either Type: N for nucleotide or Type: P for protein. If the dividing line doesn't contain a Type: field, the Wisconsin Package infers the sequence type from the characters in the sequence. This inference may not always be correct.
If the Type: field of any sequence is incorrect or missing, you should correct it with the Reformat program.
Use the Reformat program. Type % reformat -NUCleotide filename or % reformat -PROtein filename. For more information on Reformat, see the Program Manual.
A list file, formerly known as a file of sequence names, is what its name implies: a file containing a list of sequence names and their locations. You can think of list files as a way to organize your sequences on a project-by-project basis.
You will find list files useful for specifying sequences from multiple locations--such as different databases, single sequences, RSF sequences, and MSF sequences in your personal directories--in one file that you can use as input to a program. List files can contain any number of the following types of sequences:
You can use list files with any program that accepts multiple sequences as input. A program prompt asking "What sequence(s)?" implies that the program accepts multiple sequences.
Below is an example of a list file.
In addition to sequence specifications, each sequence in a list file may optionally contain sequence attributes. These attributes include:
Begin Position. (Begin:n) Shows the base position you want to start with, where n= 1 to the length of the sequence.
End Position. (End:n) Shows the base position you want to end with, where n = 1 to the length of the sequence.
Strand. (Strand:+ or -) Defines the forward or reverse complement nucleic acid sequence strand, where + = forward strand and - = reverse strand.
Sequence Topology: Linear or Circular. (Circ:T or F) Defines the strand as linear or circular, where T = circular and F = linear.
Sequence Weight. (Wgt:n.n) Defines the sequence weight, or the significance of the sequence in comparison to other sequences. That is, you may not want all sequences accounted for equally to determine a result. Therefore, you can give some sequences greater weight than others. This attribute is of use only when you are using two or more sequences in the analysis.
Join. (Join:Sequence_Name) Indicates that the sequence segment should be concatenated with the next sequence in the list that has an identical Join:Sequence_Name attribute. Several contiguous sequences specified in a list file with the same Join:Sequence_Name attribute are concatenated together. (Assemble, Translate, and LookUp are the only Wisconsin Package programs that use the Join attribute. SeqLab uses the Join attribute to concatenate list file sequences in the Editor.)
Note: At the release of Version 9.0, Assemble, CodonFrequency, Distances, Diverge, FrameSearch, PileUp, PlotSimilarity, ProfileMake, Seg, Translate, and Xnu use some or all of these sequence attributes in the command-line version of the Package.
File Type. (optional) Begins with the line (all uppercase) !!SEQUENCE_LIST 1.0. SeqLab uses the file type to improve performance when loading files. Do not edit or delete the file type. If present, it must appear on the first line of the file.
Description. (optional) Contains informative text, including the date of creation, describing what is in the file.
Dividing Line. (required) Includes two periods (..) that must appear on the line preceding the sequence list.
Sequence List. (required) Includes the single sequences from your personal directory or a database, sequence specifications using wildcards, RSF files, MSF files, or lists files. You must provide the database or directory specification. You can add sequences in any order.
Sequence Attributes. (optional) Can include the begin and end position, indicate the forward or reverse strand, define the strand as linear or circular, give the sequence a weight in comparison with other sequences, and indicate whether the sequence is concatenated with other sequences in the list.
Sequence Comments. (optional) Includes an exclamation point (!) followed by a short comment or definition of the sequence(s) or list file.
Use a text editor of your choice and modify the file as necessary.
TIP - One way to specify a subset of sequences is to "comment out" those unwanted sequences within the list file. If you comment out sequences instead of deleting them, you can use them at a later time.
To comment out sequences:
Some Wisconsin Package programs can produce output in list file format. Any program that creates multiple sequence output files and can organize those sequence specifications in a list file supports the -LIStfile parameter. You can then use that list file as input to other programs.
Programs that can create list output files and their parameters (if necessary) are listed below.
Note: Some of the programs listed above, such as LineUp and ProfileSearch, may include additional program-specific information in the output list file. In addition, FastA and BLAST may include sequence alignments. This extra information does not affect the list file's performance.
Type an at sign (@) and the name of the list file and extension, for example @hsp70.list.
Note: You cannot use wildcards to specify a list file. For example, you cannot specify @hsp*.list.
A Rich Sequence Format (RSF) file contains one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be richly annotated with descriptive sequence information such as:
RSF files are powerful for using with SeqLab, the graphical user interface to the Wisconsin Package. Because they store positional information, you can display RSF files within SeqLab's Editor mode to view and edit sequence alignments and features. The features annotation allows you to graphically view and align sequences based on features as well as run programs on sequence regions selected by feature. You also will find RSF files useful for distributing sequences to colleagues, since these files contain each sequence's data and descriptive information.
Note: If you plan on using SeqLab for the bulk of your analyses, it is best to save your files as RSF if possible. RSF files are more richly annotated than list files or MSF files, which do not save sequence features information as part of the file.
Below is an example of an RSF file.
You may find the following components in an RSF file:
Choose one of the following.
Use SeqLab. If you load an RSF file into SeqLab's Editor, it graphically displays the sequences in the file. For more information, see Chapter 2, The Editor: Editing Single Sequences and Multiple Sequence Alignments in the SeqLab Guide.
Choose one of the following.
You can combine multiple sequences in a single file, called a Multiple Sequence Format (MSF) file. MSF files include not only the sequence name but also the sequence itself, which is usually aligned with the other sequences in the file. Three Wisconsin Package programs, PileUp, LineUp, and Reformat, can create MSF files. You can specify a single sequence within an MSF file, a subset of sequences, or all sequences. Like other sequences, those in an MSF file can be used with other Wisconsin Package programs.
The following illustration show an MSF file created with PileUp.
You may find the following components in an MSF file:
PileUp, LineUp, and Reformat create MSF files. These programs and their parameters (if necessary) are listed below.
Note: If you use % reformat -MSF to create an MSF file, it does not align the sequences.
Use LineUp. For more information, see the Program Manual.
You also can use a text editor to modify an MSF file. If you do so, however, the file's checksum changes, and Wisconsin Package programs will not recognize the file. Therefore, if you use a text editor to modify an MSF file, you must use the Reformat program with the -MSF parameter to rewrite it into GCG format.
Choose from the following.
Note: You cannot use wildcards to name an MSF filename (that is, you cannot specify pic*.msf). You can use wildcards only between the curly brackets { }. Also, an MSF sequence specification must contain a sequence name or wildcard within the curly brackets. The MSF filename alone is not enough.
TIP - One way to specify a subset of sequences is to "comment out" those unwanted sequences within the MSF file. If you comment out sequences instead of deleting them, you can use them at a later time.
To comment out sequences:
This section teaches you how to
The Wisconsin Package helps you find sequences in the GenBank, EMBL, PIR, and SWISS-PROT databases. Its sequence identification programs look through any set of entries you name to find all the sequences that contain some common attribute.
For more information on these programs, see the Program Manual.
The Wisconsin Package makes it easy for you to copy sequences from the databases to your directory. You can copy single or multiple sequences.
Choose from the following.
Note: If you do not know the database in which a sequence resides, you can simply type the sequence name and Fetch will find it. However, if you do this, Fetch searches through a number of directories, taking longer to complete and possibly finding files you are not interested in.
TIP - You also can copy multiple sequences from the databases by creating a list file of those sequences of interest (see "Using List Files" in this chapter for more information). This method is useful if the sequence names do not have characters in common. Then, to copy the sequences from the database, type % fetch @list_filename, for example % fetch @hiv-gag.list. The sequences in the list file are copied to your current directory.
You may want to read the reference information associated with a sequence or view the sequence itself. You can easily view the contents of sequence files by using the TypeData program. Using these commands, you can view database sequences or those in your personal directories, including single sequences, RSF files, MSF files, or list files.
Type % typedata entry_name, for example % typedata GB_IN:Dro5S. The sequence data, including reference information, scrolls on your screen. Note that you cannot edit a file using the TypeData command.
You can control screen output in the following ways:
For more information on controlling screen output, see "Controlling Screen Output" in the "Quick Reference" section of Chapter 1, Getting Started.
Type % more filename, for example % more gamma.seq. The sequence data, including reference information, displays one screen at a time. To advance from screen to screen, press the <Space Bar>.
At some point in your work with the Wisconsin Package, you may need to reformat sequence files into GCG format. This may happen when
You can use a number of differently formatted sequences with the Wisconsin Package--sequences created with a text editor or automated sequencer; sequences in a different software format (for example Staden or IntelliGenetics); or sequences in the database formats of GenBank, EMBL, SWISS-PROT, or PIR.
Each sequence in the Wisconsin Package must have a "type" associated with it, denoting the sequence as either a nucleotide or a protein. To specify the sequence type, you can add the parameter -NUCleotide or -PROtein to the command line when you run Reformat, FromStaden, FromEMBL, FromFastA, FromGenBank, FromPIR, or FromIG. If you forget to do so, the programs will determine the type for you based on the symbols in the sequence. Note that because nucleotide and protein sequences share some symbols, the programs can guess incorrectly at the sequence type.
Choose one of the following.
Note: If the sequence file contains descriptive or reference information in addition to the sequence information, you first must open the file in a text editor and insert a line that contains two periods (..) above the sequence information. Then use Reformat to rewrite the sequence to GCG format.
Note: You can use Staden sequences directly with the Wisconsin Package without reformatting them by adding -STAden to the command line when you run a Wisconsin Package program.
Note: You can use FastA sequences directly with the Wisconsin Package without reformatting them by adding -FASTA to the command line when you run a Wisconsin Package program.
The information in this section is intended for users who are familiar with using sequences within the Wisconsin Package. This section teaches you how to
You can create your own personal databases, similar to GenBank and EMBL databases, for searching with the Wisconsin Package. This option is a particular advantage if you frequently work with large list files. A large set of sequences is more compact to store and faster to search if it is assembled into a database. Thus, you can convert your large list files into databases for faster searching capabilities. When sequences are assembled into a database, all Wisconsin Package programs work with them exactly as they work with the public databases (GenBank, EMBL, etc.).
The program DataSet creates databases from any set of sequences you specify.
The program displays the prompt "What should I call the database?"
Your personal database logical names are automatically assigned in a shell script called .datasetrc in your home directory.
Specifying a personal database you created using DataSet is the same as specifying a sequence from a public database such as GenEMBL, GenBank, SWISS-PROT, etc.
Type the logical name of your database, followed by a colon (:), followed by the sequence(s) of interest. For instance, using the example above, you could type HSP:Hs70_Brelc to specify a single sequence in the personal database, or HSP:* to specify all sequences in the personal database. For more information, see "Using Database Sequences" in this chapter.
You can refine list files, RSF files, or MSF files to fit your analysis needs:
For more information on the above programs, see the Program Manual.
Note: You cannot combine MSF files in this way.
Note: You cannot "comment out" sequences in RSF files in this way.
[ Program Manual | User's Guide | Data Files | Databases ]
Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com
Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.