REFORMAT

Reformat rewrites sequence files to make them usable by the Wisconsin Package^(TM) or to alter their appearance. The following are some of the manipulations that Reformat can perform:

- Converting sequence files that were prepared or edited with a text editor or transferred to your computer from another computer into GCG format.

- Converting between multiple sequence format (MSF) and rich sequence format (RSF) and individual sequences in GCG format.

- Correcting the sequence type (protein or nucleic acid) of sequence files that have no type or that were incorrectly typed when they were created.

- Converting nucleic acid sequences between DNA (T, t) and RNA (U, u) representations.

- Converting peptide sequences between one-letter and three-letter amino acid representations.

- Converting sequences to all uppercase or all lowercase characters.

- Removing gap characters from sequence files.

In order to use Reformat on individual sequence files, the files must contain a heading, a dividing line, and a sequence, as described below. You can use a text editor to make your "foreign" sequence files conform to this arrangement.

HEADING [ Previous | Top | Next ]

The heading of a sequence file may contain any number of lines of text at the top of the file to describe the sequence. The heading must not contain two adjacent periods (..) anywhere within it.

DIVIDING LINE [ Previous | Top | Next ]

The heading is followed by a dividing line: a line containing two adjacent periods (..). Any information on the line other than the two periods is lost during reformatting. The dividing line may be omitted if there is absolutely no heading. All GCG data files contain a dividing line to separate the data from a documentary heading.

SEQUENCE [ Previous | Top | Next ]

After the dividing line comes the sequence in any format you wish. It is conventional to use uppercase letters for known parts of the sequence and lowercase letters for uncertain parts. As in the example below, the sequence may have documentary comments embedded within it. You may either use two adjacent slash characters (//) to mark the end of the sequence data or just allow the sequence to go on until the end of the file.

SEQUENCE CHARACTERS [ Previous | Top | Next ]

The alphabet of legitimate sequence characters and their meanings are defined in Appendix III. Legitimate sequence characters include all uppercase and lowercase letters. Wisconsin Package programs support the IUB-IUPAC standard ambiguity codes for the representation of nucleic acid ambiguities and the standard one-letter amino acid codes. Reformat, like all other Wisconsin Package programs, will ignore all characters that are not in the alphabet of legitimate sequence characters.

EXAMPLE [ Previous | Top | Next ]

Here is a session using Reformat to rewrite a sequence file prepared with a text editor (see the INPUT FILE topic below) to GCG format:


% reformat

 REFORMAT what sequence file(s) ?  reformat.txt

    reformat.txt  length: 1636 bp

%

OUTPUT FILE [ Previous | Top | Next ]: Here is part of the output file from the example above:


!!NA_SEQUENCE 1.0

Human fetal Beta globin G gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.

The region below is used to demonstrate REFORMAT.  It
starts at base 2101 of the sequence reported in Cell (gamma.seq).

reformat.txt  Length: 1636  October 14, 1996 10:56  Type: N  Check: 398  ..

       1  AGGAAGCACC CTTCAGCAGT TCCACA
                                      >Cap (G gamma)>
                                      CACT CGCTTCTGGA ACGTCTGAGG

      51  TTATCAATAA GCTCCTAGTC CAGACGCC
                                        >coding (G gamma)>
                                        AT GGGTCATTTC ACAGAGGAGG

    ////////////////////////////////////////////////////////////

    1551  CTTTCAAGGA TAGGCTTTAT TCTGCAAGCA ATACAAATAA TAAATCTATT

    1601  CTGCTAAGAG ATCAC
                          <POLYA (G gamma)<
                          ACATG GTTGTCTTCA GTTCTT

INPUT FILES [ Previous | Top | Next ]: Here is part of the input file used for the example above:


Human fetal Beta globin G gamma
from Shen, Slightom and Smithies,  Cell 26; 191-203.
Analyzed by Smithies et al. Cell 26; 345-353.

The region below is used to demonstrate REFORMAT.  It
starts at base 2051 of the sequence reported in Cell.

                            ..

AGGAAGCACC CTTCAGCAGT TCCACA>Cap (G gamma)>CACT CGCTT
CTGGA ACGTCTGAGG
TTATCAATAA GCTCCTAGTC CAGACGCC>coding (G gamma)>AT

////////////////////////////////////////////////////////

GCTCACTGCC CATGATGCAG
AGCTTTCAAG GATAGGCTTT ATTCTGCAAG CAATACAAAT AATAAATCTA
TTCTGCTAAG AGATCAC<POLYA (G gamma)<ACATGGTTGTCTTCAGTTCTT

SeqEd is a general purpose sequence editor.

All Wisconsin Package programs that write sequence files, such as Assemble, BackTranslate, ExtractPeptide, FromStaden, GetSeq, PepData, PileUp, Reverse, SeqEd, Shuffle, and Translate, write their sequences in GCG format.

The programs FromEMBL, FromFastA, FromGenBank, FromIG, FromPIR, and FromStaden are designed to bring files from six popular formats into GCG format. These specialized reformatting programs, in addition to reformatting the sequences, also convert the sequence characters into the nearest IUB-IUPAC equivalent character (see Appendix III).

ChopUp converts a non-GCG sequence file containing lines as long as 32,000 characters into a new file containing lines no longer than 50 characters. The new file can be read by Reformat to create a GCG-format sequence file.

BreakUp reads a GCG-format sequence file containing more than 350,000 sequence characters and writes it as a set of separate, shorter, overlapping sequence files that can be analyzed by GCG programs.

DataSet creates a GCG data library from any set of sequences in GCG format. GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST.

RESTRICTIONS [ Previous | Top | Next ]

A sequence may not contain more than 350,000 sequence symbols. BreakUp can convert a GCG-format sequence file containing more than 350,000 sequence symbols into a set of separate, shorter overlapping sequence files. Embedded comments more than 125 characters long are truncated to 125 characters. Input lines may not be more than 511 characters. ChopUp can convert a file with lines exceeding 511 characters to a file suitable for input to Reformat.

CONSIDERATIONS [ Previous | Top | Next ]

Filename Extensions

Nucleic acid and peptide sequences are generally named with the filename extensions .seq and .pep, respectively.

Use Staden Format Directly

The command % seqformat Staden sets your process so that most programs accept sequences in the format used by the Staden programs directly without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.

You can use Reformat on Staden files (or any files that contain only sequence characters) without modification as long as all the sequence characters in the file belong to the IUB-IUPAC code representation. If your Staden file contains any of Staden's ambiguity codes, use the FromStaden program instead.

Use FastA Format Directly

The command % seqformat FastA sets your process so that most programs accept user sequences in FastA format without the need for reformatting. The command % seqformat GCG restores the system to expect sequences in GCG format.

Input from stdin

Reformat accepts input from stdin if you specify -INfile=- on the command line. If the stdin input does not contain a heading that is separated from the sequence by a line containing two dots (..), then add -NOHEAding to the Reformat command line.

Multiple Sequence Format (MSF) and Rich Sequence Format (RSF) Files

Reformat can be used to convert between MSF, RSF and individual sequence format files. All embedded comments are lost when converting from individual sequence to either MSF or RSF format. In addition, when the sequence files are specified using a list file, any sequence attributes specified in the list file are ignored during the conversion to the new file. When converting from an RSF file all sequence features are lost. Likewise, when converting to an RSF file no attempt is made to create a list of sequence features from the original sequence's reference. Access to sequence features is currently available only from within SeqLab. (In Chapter 2, Using Sequence Files and Databases of the User's Guide, see "Using Multiple Sequence Format (MSF) Files" for help in specifying sequences in MSF files, "Using Rich Sequence Format Files" for help with RSF files, and "Using List Files" for information about list files.)

Following are several examples of the commands you might type to convert between MSF or RSF and individual sequence format files. These examples use the files hsp70.msf, hsp70.rsf and pretty.list, which can be copied to your local directory with the % fetch command.

To copy all of the sequences in hsp70.msf into separate sequence files, use

% reformat hsp70.msf{*}

To copy all of the sequences in hsp70.rsf into separate sequence files, use

% reformat hsp70.rsf{*}

To copy the sequence Hs70_Plafa from hsp70.msf into a separate sequence file, use

% reformat hsp70.msf{hs70_plafa}

To collect all of the sequences named in pretty.list into an RSF file, use

% reformat -RSF @pretty.list

To collect the mouse sequences in hsp70.msf into a separate MSF file, use

% reformat -MSF hsp70.msf{*mouse}

If you edit hsp70.msf with a text editor to manually adjust the alignment, you must use Reformat to rewrite the MSF file so that it can be used with Wisconsin Package programs by using

% reformat -MSF hsp70.msf{*}

FORMAT CONTROL [ Previous | Top | Next ]

For individual sequence files and MSF files, you can control the number of sequence characters per line and the number of characters in each block by setting parameters on the command line. Additionally for individual sequence files, you can control how many blank lines appear between sequence lines. Reformat defaults to groups of 10 characters in lines of 50, with a blank line between each sequence line.

CHECKSUM [ Previous | Top | Next ]

For each sequence in an MSF, RSF or individual sequence file, Reformat calculates a checksum based on the exact sequence. Reformat always adds the checksum to the file containing the sequence. All Wisconsin Package programs that read sequences recalculate the checksum and compare it to the value written by Reformat to ensure the integrity of the data. If there is disagreement between the newly calculated and previously written values of checksum, the program stops and displays an error message. There is one chance in ten thousand that two different sequences would have the same checksum.

EMBEDDED COMMENTS [ Previous | Top | Next ]

You may embed comments of up to 125 characters within a sequence in an individual sequence file by enclosing them in special comment-delimiting characters. Comments are very helpful for documenting sequences, especially sequences assembled from several sources or sequences containing many genes.

Comment Delimiting Characters

Embedded comments can begin with one of the characters <, >, or $. Each comment must begin and end with the same character.

Suggestions

The embedded comments below seem useful for the sequences we have annotated.


        >coding>         beginning of coding sequence
        <coding<         termination of coding sequence
        >Cap>            cap site
        >IVS>            intervening sequence donor
        <IVS<            intervening sequence acceptor
        <PolyA<          poly-A addition site
        >Transcript>     beginning of transcript
        <Transcript<     end of transcript
        >Promoter>       promoter
        >Ribosome>       ribosome binding site

COMMAND-LINE SUMMARY [ Previous | Top | Next ]: All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Minimal Syntax: % reformat [-INfile=]reformat.txt -Default Prompted Parameters: None Local Data Files: -DATa=translate.txt three-letter to one-letter codes Optional Parameters: [-OUTfile=]NewSeqName names the output file -EXTension=.seq specifies a file name extension for the output -LIStfile[=reformat.list] writes a list file of output sequence names -MSF reformats sequences into an MSF output file -RSF reformats sequences into an RSF output file -PROtein or -NUCleotide insists that the sequences are reformatted as protein or nucleotide sequences -DEGap removes gap characters (. and ~) from the sequence -LINesize=50 sets number of characters per line -BLOcksize=10 sets number of characters per block -BLAnklines=1 puts blank lines between the sequence lines -NONUMbering suppresses numbering -NOCOMments suppresses comments -DNA changes U into T -RNA changes T into U -UPPer makes all sequence characters uppercase -LOWer makes all sequence characters lowercase -ONEIntothree translates one-letter peptides into three-letter -THReeintoone translates three-letter peptides into one-letter -NOHEAding input sequence from stdin contains no header information

-COMparison reformats a scoring matrix instead of a sequence (used with -PROtein or -NUCleotide, insists that the matrix is reformatted as a protein or nucleotide scoring matrix) -GAPweight=12 specifies the gap creation penalty associated with the scoring matrix -LENgthweight=4 specified the gap extension penalty associated with the scoring matrix -SCAle=10 multiplies each value in the scoring matrix by 10 (use any number from .01 to 100.0) -EQUALSformat writes the scoring matrix in a form that may be more easily read -OLDCMPformat converts a pre-Version 9 scoring matrix into a Version 9 scoring matrix (all options used with -COMparison can also be used with -OLDCMPformat. -PROtein or -NUCleotide must be specified with -OLDCMPformat -TRANSlate=filename.txt lets you name the translation table -NOMONitor suppresses the screen trace showing each output file

SCORING MATRICES [ Previous | Top | Next ]

After modifying a scoring matrix, you may want to reformat it to give it a nicer appearance. To use Reformat for this purpose, run the program with % reformat -COMparison. (See Appendix VII for more information about scoring matrices.)

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

In the rare event that you are using Reformat to convert a three-letter amino acid sequence into a one-letter sequence, Reformat looks for translate.txt as a local data file.

The translation of codons to amino acids, the identification of potential start codons and stop codons, and the mappings of one-letter to three-letter amino acid codes are all defined in a translation table in the file translate.txt. If the standard genetic code does not apply to your sequence, you can provide a modified version of this file in your working directory or name an alternative file on the command line with an expression like -TRANSlate=mycode.txt. Translation tables are discussed in more detail in Appendix VII.

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-OUTfile=newseqname

selects an output filename other than the name of the input file.

-EXTension=.seq

selects a filename extension other than the input filename extension.

-LIStfile=reformat.list

writes a list file with the names of the output sequence files. This list file is suitable for input to other Wisconsin Package programs that support list files (see Chapter 2, Using Sequence Files and Databases in the User's Guide.) If you don't specify a file name, then Reformat makes one up using reformat for the file name and .list for the file name extension. If -MSF is on the command line, this parameter is ignored and a list file will not be written.

-MSF

reformats all input sequences into a multiple sequence format (MSF) output file.

-RSF

reformats all input sequences into a rich sequence format (RSF) output file.

-PROtein or -NUCleotide

reformats the sequence as a protein or nucleotide sequence.

-DEGap

removes gap characters (. and ~) from the sequence.

-LINesize=50

lets you set the number of sequence characters per line to any number between 1 and 120 in MSF and individual sequence files.

-BLOcksize=10

lets you set the number of sequence characters in each block to any number between 1 and the line size in MSF and individual sequence files.

-BLAnklines=1

leaves zero or more blank lines between the sequence lines in individual sequence files.

-NONUMbering

suppresses the numbering next to each sequence line in individual sequence files

-NOCOMments

removes any comments from the input individual sequence file.

-DNA

substitutes T for U and t for u in the whole sequence.

-RNA

substitutes U for T and u for t in the whole sequence.

-UPPer

puts all sequence characters into uppercase.

-LOWer

puts all sequence characters into lowercase.

-ONEIntothree

changes a peptide sequence of one-letter codes into three-letter codes (see Appendix III). Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.

-THReeintoone

changes a peptide sequence from three-letter codes into one-letter codes (see Appendix III). Wisconsin Package programs for peptide sequence analysis can use peptide sequences in one-letter codes only.

-COMparison

reformats a scoring matrix.

-GAPweight

specifies the default gap creation penalty associated with the scoring matrix. This penalty is written in the auxiliary data block in the output scoring matrix file. If you don't specify a default gap creation penalty with -GAPweight, the program calculates a reasonable default and writes it in the auxiliary data block. (See Appendix VII for information about the auxiliary data block in scoring matrix files.)

-LENgthweight

specifies the default gap extension penalty associated with the scoring matrix. This penalty is written in the auxiliary data block in the output scoring matrix file. If you don't specify a default gap extension penalty with -LENgthweight, the program calculates a reasonable default and writes it in the auxiliary data block. (See Appendix VII for information about the auxiliary data block in scoring matrix files.)

-SCAle=10

multiplies each value in the scoring matrix and each gap penalty in the auxiliary data block by 10. (See Appendix VII for information about the auxiliary data block in scoring matrix files.) You can specify any scale from 0.01 to 100.0 and each value in the matrix and each gap penalty is multiplied by this number and then rounded to the nearest integer.

-PROtein or -NUCleotide

reformats the matrix as either a protein or nucleotide scoring matrix. (See Appendix VII for information about scoring matrix types.)

-EQUALSformat

writes the scoring matrix in a format which is less compact but may be more easily read by some people. This equals format file can be read by any program that reads scoring matrices.

-OLDCMPformat

converts a pre-Version 9 scoring matrix to the Version 9 scoring matrix format. By default, each floating point value in the pre-Version 9 matrix is first multiplied by 10 and then rounded to the nearest integer. You must add either -PROtein or -NUCleotide to specify the type of the converted scoring matrix. (See Appendix VII for information about scoring matrix types.) All of the optional parameters that may be used with -COMparison may also be used with -OLDCMPformat.

-NOHEAding

input sequence from stdin contains no header information.

-TRANSlate=filename.txt

Usually, translation is based on the translation table in a default or local data file called translate.txt. This parameter allows you to use a translation table in a different file. (See Appendix VII for information about translation tables.)

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

Printed: November 18, 1996 13:08 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.