PROFILEMAKE

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

SPECIFYING SEQUENCES FOR PROFILEMAKE

CALCULATING THE PROFILE

FUNCTION [ Top | Next ]

ProfileMake creates a position-specific scoring table, called a profile, that quantitatively represents the information from a group of aligned sequences. The profile can then be used for database searching (ProfileSearch) or sequence alignment (ProfileGap).

DESCRIPTION [ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358 (1987)) to create a profile from a group of aligned sequences. A profile is a table that contains all of the comparison information of a group of aligned sequences. These sequences must be previously aligned (see the RELATED PROGRAMS topic below) before running ProfileMake. The profile contains as many rows as there are positions in the aligned sequences. Each row contains a score for the alignment of the corresponding position of the aligned sequences with each possible base or residue.

The profile is the input data for ProfileSearch, which can find sequences in the database similar to your group of aligned sequences, and ProfileGap, which can make an optimal alignment between the aligned sequences and another sequence.

The aligned sequences may be specified to ProfileMake with an ambiguous file expression or in a list file similar to the input for Pretty or LineUp. (See Chapter 2, Using Sequence Files and Databases in the User's Guide for more information.)

EXAMPLE [ Previous | Top | Next ]

Here is a session using ProfileMake to make a profile from aligned 70 kd heat shock and heat shock cognate peptide sequences (these sequences were aligned in the example session for PileUp):


% profilemake

    Profile of what aligned sequence(s) hsp70.msf{*}

 hsp70.msf{hs70_plafa}, begin: 1  end: 718  len: 718  weight: 1.00
 hsp70.msf{hs70_thean}, begin: 1  end: 718  len: 718  weight: 1.00
 hsp70.msf{hs70_leido}, begin: 1  end: 718  len: 718  weight: 1.00

 /////////////////////////////////////////////////////////////////

    What should I call the output file (* hsp70.prf *) ?

%

OUTPUT [ Previous | Top | Next ]: Here is some of the output file:


!!AA_PROFILE 1.0
(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 718
  Sequences: 28  MaxScore: 2172.36  October 11, 1996 11:41

                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10

         hsp70.msf{Hs70_Plafa}  From: 1         To: 718       Weight: 1.00
         hsp70.msf{Hs70_Thean}  From: 1         To: 718       Weight: 1.00

         /////////////////////////////////////////////////////////////////

         hsp70.msf{Dnak_Ecoli}  From: 1         To: 718       Weight: 1.00

Symbol comparison table: GenRunData:blosum62.cmp  FileCheck: 6430

     Relaxed treatment of non-observed characters
     Exponential weighting of characters
Cons A    B    C    D    E    F    G    H    I    K    L  ... Gap  Len  ..
 M   -1   -3   -1   -3   -2    0   -3   -2    1   -1    2 ...   9    9
 L   -1   -4   -1   -4   -3    0   -4   -3    2   -2    4 ...   9    9

 /////////////////////////////////////////////////////////////////////

 E   -1    2   -4    2    5   -3   -2    0   -3    1   -3 ...   2    2
 V    0   -3   -1   -3   -2   -1   -3   -3    3   -2    1 ...   2    2
 B   -2    6   -3    6    2   -3   -1   -1   -3   -1   -4 ...   2    2
 * 1553    0  132 1273 1380  667 1497  197 1132 1400 1327 ...

INPUT FILES [ Previous | Top | Next ]

ProfileMake accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileMake depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N orType: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

RESTRICTIONS [ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Profiles must be no more than 1000 residues long. ProfileMake cannot accept more than 5000 aligned sequences for the profile. It is your responsibility to ensure that the sequences input to ProfileMake are in alignment.

SPECIFYING SEQUENCES FOR PROFILEMAKE [ Previous | Top | Next ]

The sequences used to make the profile can be specified in two ways. (See Chapter 2, Using Sequence Files and Databases in the User's Guide for more information.) A group of sequences may be named with an ambiguous expression likekf*.pep or pileup.msf{*}. The sequences may also be specified in a list file, and a beginning and ending position can be assigned to each sequence in the list with the begin: and end: sequence attributes, respectively. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide. Make sure that the sequence ranges you specify will result in the sequences being in alignment. If beginning and ending positions are not specified, the entire sequence is used.

If the sequences are specified in a list file, you can optionally specify a weight for each sequence with the weight: sequence attribute. A weight of 1.0 is assumed if none is specified with the sequence.

You can assign weights to sequences in an MSF file by editing the MSF file and modifying the weight on the name/weight line for each sequence. (See "Using Multiple Sequence (MSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of MSF files.)

You can assign vote weights to sequences in an RSF (rich sequence format) file by modifying the weight attribute for each sequence within SeqLab. (See "Using Rich Sequence Format (RSF) Files" in Chapter 2, Using Sequence Files and Databases in the User's Guide for a complete description of RSF files. Also see "Viewing Sequence Attribute and Reference Information" in Chapter 2, Editing Sequences in the SeqLab Guide for more information about modifying the weight attribute for each sequence within an RSF file.)

If a sequence from an MSF or RSF file is listed in a list file with a weight, the sequence weight is taken from the list file (the sequence weight in the MSF or RSF file is ignored).

Part of a file of sequence names that could be used as input to ProfileMake follows.


A multiple sequence alignment represented as a list file for input to
the program PROFILEMAKE.
5/3/90   ..

fa10.ugly    begin: 201       end: 250       weight: 0.5
fa12.ugly    begin: 201       end: 250       weight: 0.5
fo1k.ugly    begin: 201       end: 250       weight: 1.0
e.ugly       begin: 201       end: 250       weight: 1.0

////////////////////////////////////////////////////////

CALCULATING THE PROFILE [ Previous | Top | Next ]

Similarity Scores

In a scoring matrix, a score can be found for the comparison of any two sequence symbols. (See Appendix VII for more information.) Given a group of aligned sequences, a score can be calculated for the comparison of a symbol to each position of the aligned sequences. This comparison score differs from position to position in the aligned sequences, because each position contains a different spectrum of sequence symbols. The overall score is, in a sense, the average of the comparison scores for the sequence symbols found at a particular aligned sequence position.

Each row of a profile contains the scores for a comparison of the corresponding position of a multiple sequence alignment to each possible sequence symbol. For example, if a profile is made from a group of aligned protein sequences, the 10th row of the profile has values for the comparison of the 10th position in the alignment to each possible amino acid. The profile has as many rows as there are positions in the alignment, and each row has as many comparison scores as there are amino acid symbols. Thus, the profile is a position-specific scoring matrix for every position in a multiple sequence alignment.

The consensus sequence character is the symbol with the largest value in each row of the profile. It is used solely for the display of alignments and not for the calculation of the optimal alignment between a profile and a sequence.

The last row of the profile contains the composition for the whole profile. In the A column, for instance, the total number of A's in the multiple sequence alignment is shown.

Sequence Symbol Weights

As stated above, the comparison score of an alignment position and a given sequence symbol is an average of the comparison scores for the different sequence symbols at that position. This average is weighted so that a symbol's weight in the calculation of the average score increases along with its fraction of the symbols at that position. Two types of weighting are currently used. Linear weighting (chosen with the command-line parameter -NOLOGwgt) gives a weight to each symbol that is directly proportional to the number of occurrences of that symbol at a given position. The default logarithmic weighting gives a symbol that predominates at a given position a disproportionately higher weight than a symbol that occurs only once. This causes positions in the aligned sequences that have many identical residues to bias the profile more strongly towards the identical residues than when linear weighting is used.

Using either kind of weighting, the weight for a residue is 0 when that residue does not occur at a given position; the weight is 1 when only that residue is found at a given position.

If the number of aligned sequences is fairly small, the sequence symbols observed at each position of the alignment may not represent the whole spectrum of symbols that would be observed if more sequences were available. In these cases, even residues that are not observed at a given position in the alignment should perhaps be given a small weight. For nucleic acids, non-observed bases are given a weight of 0 by default. The default for proteins is to give non-observed amino acids a weight equal to 0.025 divided by the sum of the sequence weights. The -STRINgent command-line parameter gives non-observed sequence symbols a weight of 0.

Gap Coefficients

The profile also includes position-specific gap coefficients, expressed as percentages. The gap coefficient determines the penalty that an alignment must pay in order to create a gap, and the gap length coefficient determines the penalty that must be paid in order to extend a gap. The actual gap penalties are calculated by multiplying the position-specific gap coefficients by the gap penalties specified when running the other Profile programs.

All gaps in the aligned sequences that overlap are treated as a single gap for purposes of calculating gap coefficients. The gap is considered to begin at the position of the leftmost gap character (. or ~) in any of the sequences, and to end at the rightmost gap character. The position-specific gap coefficients are reduced from 100 percent as a function of the longest gap through the position of interest in the aligned sequences. The gap coefficient G and gap length coefficient L are calculated as



G = C_(G) x ( R_(G) / (1 + GapLength x R_(L) )



L = C_(G) x ( R_(G) / (1 + GapLength x R_(L) )

where GapLength is the length of the gap as defined above. GapCoefficient (C_(G)), GapRatio (R_(G)), and GapLengthRatio (R_(L)) have default values of 100, 0.33, and 0.1 respectively, but can be changed by optional parameters entered on the command line (see the COMMAND LINE SUMMARY topic below).

You can edit the profile with a text editor and change the gap coefficients to any values you wish.

CONSIDERATIONS [ Previous | Top | Next ]

If you edit a profile, the "length:" entry must agree with the actual length of the profile (number of rows).

If you create a profile from a single peptide sequence, you should use the -STRINgent command-line parameter to give a weight of 0 to all symbols not occurring at each position in the sequence.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % profilemake [-INfile=]hsp70.msf{*} -Default

Prompted Parameters:

[-OUTfile=]hsp70.prf   name of output file containing profile

Local Data Files:

-MATRix=blosum62.cmp     scoring matrix for peptides
-MATRix=profiledna.cmp   scoring matrix for nucleic acids

Optional Parameters:

-BEGin=1                 sets the beginning position in the aligned
                           sequences
-END=738                 sets the ending position in the aligned
                           sequences
-WEIGHT=1                sets the weight for all input sequences
-GAPCoefficient=100      sets the maximum gap creation penalty in a
                           region WITH NO gaps
-LENGTHCoefficient=100   sets the maximum gap extension penalty in a region
                           WITH NO gaps
-GAPRatio=0.33           GAPRatio multiplied by GAPWeight sets the
                           maximum gap creation and extension penalties
                           in a region WITH gaps
-LENGTHRatio=0.1         determines how rapidly gap creation and extension
                           penalties decrease with increasing gap size
-NOLOGwgt                linear weighting for symbols is used to produce
                           the profile score.  The default is exponential
                           weighting
-STRINgent               symbols not occurring at a particular position
                           in aligned sequences are given a weight of 0
-SEQout[=pretty.pep]     writes the consensus into a sequence file

ACKNOWLEDGMENT [ Previous | Top | Next ]

Profile analysis was first described in 1987 by Michael Gribskov, Andrew McLachlan and David Eisenberg (Proc. Natl. Acad. Sci. USA 84; 4355-4358). Other recent publications describing profile technology are referenced at the end of the Profile Analysis Essay above. The profile programs in the Wisconsin Package were developed and communicated to us by Dr. Gribskov.

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Local Scoring Matrices

This program reads one or more scoring matrices for the comparison of sequence characters. The program automatically reads the program default scoring matrix file in a public data directory unless you either 1) have a data file with exactly the same name as the program default scoring matrix in your current working directory; or 2) have a data file with exactly the same name as the program default scoring matrix in the directory with the logical name MyData; or 3) name a file on the command line with an expression like -MATRix=mymatrix.cmp. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see "Using a Special Kind of Data File: A Scoring Matrix" in Chapter 4, Using Data Files in the User's Guide.

ProfileMake reads a scoring matrix file called blosum62.cmp for peptide alignments or profiledna.cmp for nucleotide alignments. The peptide scoring matrix is based on substitutions between amino acid pairs in ungapped blocks of aligned protein segments as measured by Henikoff and Henikoff. The nucleotide scoring matrix has 10 for matches, -6 for mismatches, and intermediate positive values for overlaps between IUPAC-IUB ambiguity symbols. All comparisons to four-way ambiguity symbols N, X, or gap (. or ~) are given a value of 0. Read the header of the matrix files for more information about their construction. (See Appendix VII for more information about scoring matrices.)

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-MATRix=mymatrix.cmp

allows you to specify a scoring matrix file name other than the program default. If you don't include a directory specification when you name a file on the command line with -MATRix, the program searches for the file first in your local directory, then in the directory with the logical name MyData, then in the public data directory with the logical name GenMoreData, and finally in the public data directory with the logical name GenRunData. For more information see the Local Scoring Matrices topic above.

-BEGin=1

sets the beginning position for all input sequences. When the beginning position is set from the command line, ProfileMake ignores beginning positions specified for individual sequences in a list file.

-END=100

sets the ending position for all input sequences. When the ending position is set from the command line, ProfileMake ignores ending positions specified for sequences in a list file.

-WEIGHT=1.0

sets the sequence weight for all input sequences. When the weight is set from the command line, ProfileMake ignores weights specified for individual sequences in a list file, a multiple sequence format (MSF) file, or a rich sequence format (RSF) file.

-GAPCoefficient=100

sets the maximum gap coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap coefficient in each row of the profile is multiplied by an interactively specified gap creation penalty to calculate the penalty for creating a gap at that position.

-LENGTHCoefficient=100

sets the maximum gap length coefficient for the profile. This coefficient is expressed as a percentage and has a default maximum value of 100 percent. This value is found in each row of the profile where the corresponding alignment has no gaps at all. The gap length coefficient is reduced from 100 percent at positions in the alignment that have gaps. In the other profile programs, the gap length coefficient in each row of the profile is multiplied by an interactively specified gap extension penalty to calculate the penalty for extending a gap at that position.

-GAPRatio=0.33

is used to calculate the gap and gap length coefficients for a row of the profile where the multiple sequence alignment has gaps. GAPRatio multiplied by GAPCoefficient is approximately equal to the maximum gap coefficient in a region with gaps. Similarly, GAPRatio multiplied by LENGTHCoefficient is approximately equal to the maximum gap length coefficient in a region with gaps.

-LENGTHRatio=0.1

determines how rapidly the gap coefficient and gap length coefficient decrease with increasing gap size. With a gap of length GapLength, both of these coefficients decrease from their maximum values by a factor of



GAPRatio / ( 1 + (LENGTHRatio x GapLength) )

-NOLOGwgt

uses linear weighting of the residues at each position in the aligned sequences. The weight of each residue is directly proportional to the number of times the residue occurs at a given position in the aligned sequences. The default is exponential weighting that causes positions in the aligned sequences with many identical residues to bias the profile more strongly towards the identical residues than does linear weighting.

-STRINgent

gives a weight of 0 to all symbols not occurring at a given position in the aligned sequences. This is the default for nucleic acids. For proteins, residues not occurring at a position in the aligned sequences are given a small weight by default.

-SEQout=hsp70.pep

writes the consensus from the profile into a new sequence file. This sequence output file is written in addition to the file with the profile. The sequence file can be named by you on the command line or ProfileMake gives it the same name as the profile, but with the extension .seq for DNA or .pep for protein.

Printed: November 18, 1996 13:04 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.