PROFILESEARCH

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
NORMALIZATION OF SCORES
CONSIDERATIONS
SUGGESTIONS
PROFILE FILE FORMAT
COMMAND-LINE SUMMARY
LOCAL DATA FILES
OPTIONAL PARAMETERS

FUNCTION

[ Top | Next ]

ProfileSearch uses a profile (representing a group of aligned sequences) as a query to search the database for new sequences with similarity to the group. The profile is created with the program ProfileMake.

DESCRIPTION

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

Using the method of Gribskov, et al. (Methods in Enzymology, 183; 146-159 (1989)), ProfileSearch accepts a profile from ProfileMake and uses it to search a database (or any set of sequences you specify) for sequences that are similar to the aligned probe sequences used to create the profile. The algorithm calculates the score (quality) of the optimal alignment between the profile and each sequence in the database and creates a list of all of the sequences in the database with an alignment score above some threshold. The results of ProfileSearch are corrected for compositional effects of the sequence and for systematic effects of the sequence length on the score. The output list can be displayed as optimal alignments with ProfileSegments.

The gap creation and gap extension penalties specified for ProfileSearch are maximum values. The actual position-specific gap penalties at any position are determined by multiplying the gap creation penalty by the percent value in the second to the last column of the profile, and the gap extension penalty by the percent value in the last column of the profile.

ProfileSearch does a lot of computing so you will probably want to run it in the batch queue (see the CONSIDERATIONS topic below).

EXAMPLE

[ Previous | Top | Next ]

Here is a session using ProfileSearch to search through the entire protein sequence database with a profile generated from a group of aligned 70 kd heat shock and heat shock cognate protein sequences. The profile was generated in the example session from the ProfileMake program.


% profilesearch

  PROFILESEARCH with what query profile ?  hsp70.prf

  "hsp70.prf" is a profile of length: 718

  Search for query in what sequences(s) (* SwissProt:* *) ?

  What is the gap creation penalty (* 24.0 *) ?

  What is the gap extension penalty (* 0.27 *) ?

  What should I call the output file (* hsp70.pfs *) ?

       1 Sequences          924 aa searched      SW:104K_THEPA
     501 Sequences      194,360 aa searched      SW:A85A_MYCTU
   1,001 Sequences      411,750 aa searched      SW:ACT2_ARTSX

  ////////////////////////////////////////////////////////////

  49,001 Sequences   18,299,815 aa searched      SW:YTRE_PSESS
  49,501 Sequences   18,428,958 aa searched      SW:ZN23_HUMAN

 Writing results...

              Sequences searched: 49571

   CPU time (seconds) for search: 4950.63 sec.
        Total CPU time (seconds): 4991.51 sec.
            Output file: hsp70.pfs

%

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


(Peptide) PROFILESEARCH of: hsp70.prf Length: 718 to: SwissProt:*

	 Scores are corrected for composition effects

	         Gap Weight: 24.00
	  Gap Length Weight: 0.27
	 Sequences Examined: 49571
	 CPU time (seconds): 4950
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Profile information:
(Peptide) PROFILEMAKE v4.50 of: hsp70.msf{*}  Length: 718
  Sequences: 28  MaxScore: 2172.36  October 11, 1996 11:41
                          Gap: 1.00              Len: 1.00
                     GapRatio: 0.33         LenRatio: 0.10
         hsp70.msf{Hs70_Plafa}  From: 1         To: 718       Weight: 1.00
         hsp70.msf{Hs70_Thean}  From: 1         To: 718       Weight: 1.00 . . .
*    *    *    *    *    *    *    *    *    *    *    *    *    *    *    *
Normalization: 					October 11, 1996 16:28

	 Curve fit using 40 length pools
	 2 of 42 pools were rejected

	 Normalization equation:

		 Calc_Score = 519.50 * ( 1.0 - exp(-0.0026*SeqLen + 0.0812) )

	 Correlation for curve fit: 0.280

	 Z score calculation:
	 Average and standard deviation calculated using 49397 scores
	 174 of 49571 scores were rejected

		 Z_Score = ( Score/Calc_Score - 1.060 ) / 0.227

          Sequence  Strd ZScore   Orig Length ! Documentation  ..
SW:HS7B_DROME         +   17.49 240.64     68 ! P11146 drosophila melanogas ...
SW:HS7C_HUMAN         +   17.15 2056.48    646 ! P11142 homo sapiens (human ...
SW:HS7C_RAT           +   17.15 2056.44    646 ! P08109 rattus norvegicus ( ...

/////////////////////////////////////////////////////////////////////////// ...

SW:NXL4_NAJNA         +    2.50  83.91     71 ! P25672 naja naja (indian co ...
SW:YRL5_MYCCA         +    2.50  83.90     71 ! P43040 mycoplasma capricolu ...
SW:UBIL_CAEBR         +    2.50  81.89     70 ! Q07371 caenorhabditis brigg ...

The output file from ProfileSearch is an ordered list of the sequences with the highest alignment scores when compared to the profile. Unless the normalization procedure is disabled or fails, the file is ordered according to the Z scores (see the NORMALIZATION OF SCORES topic below).

The documentation section of the file (before the sequence listing) is divided by rows of asterisks into three parts. The first section describes the conditions used in running ProfileSearch, the number of sequences examined, and the amount of CPU time used. The second section reports the documentary information from the profile. The third section of the documentation records the process of the normalization (see the NORMALIZATION OF SCORES topic below).

INPUT FILES

[ Previous | Top | Next ]

ProfileSearch requires a profile as one of its input files. You can create profiles from aligned sequences by means of the ProfileMake program. In the ProfileDir directory, GCG provides a large number of amino acid profiles derived from the PROSITE database.

ProfileSearch accepts multiple (two or more) sequences of the same type as the search set. (They should be of the same type as the sequences that were used to create the profile.) You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileSearch depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

RESTRICTIONS

[ Previous | Top | Next ]

We have little experience using nucleotide sequences with profile analysis.

Because of memory constraints, searches are currently limited to a maximum of 100,000 protein sequences and 50,000 nucleic acid sequences (because both forward and reverse strands are searched). Therefore, you cannot search the entire nucleic acid sequence database at one time!. If you attempt to search more than 50,000 nucleotide sequences, ProfileSearch reports the results of first 50,000 sequences searched, as well as the name of the last sequence searched. Because of memory constraints, only the first 350,000 sequence symbols in each sequence are searched.

ALGORITHM

[ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

NORMALIZATION OF SCORES

[ Previous | Top | Next ]

The scores for comparison of sequences and a profile are systematically correlated with the lengths of the sequences. ProfileSearch corrects the observed alignment scores for these systematic affects of sequence length to give normalized scores.

The relationship between the length of a sequence (SeqLen) and the observed score for comparison of the profile and sequence (Score) is usually very close to


Score = C * (1 - e(A*SeqLen+B))

where A, B, and C are some constants. These constants can be empirically determined by fitting the alignment scores of the profile and the search set sequences to the above equation.

The scores for each sequence are sorted by the sequence length and pooled together in groups. Each pool contains at least 50 sequences (20 sequences if there are fewer than 1,500 total scores to normalize), but if the sequences are very similar in length, the program continues to add sequences to the pool until a sequence is encountered that is more than 50 residues longer than the shortest sequence in the pool. For each pool, the mean length, score, and the standard deviation of the scores are determined. If a pool contains more than the minimum number of sequences (50 or 20), these calculations are derived from an evenly-spaced sample of 50 (or 20) sequences from the pool rather than from all sequences in the pool.

The best fit of these values to the above equation is determined by non-linear fitting by the Levenberg-Marquardt method. The fit is weighted by the standard deviations of the pools; pools that contain values that represent true similarity between the sequence and profile have large standard deviations and, therefore, make relatively small contributions to the fit.

The assumption of the normalization is that the alignment scores being normalized represent sequences unrelated to the profile. Because there are usually some sequences that are related to the profile, a second pass of normalization is made. In the second pass, the average score of each length pool is compared to the calculated score for a sequence of the same length. Pools whose average scores differ from their expected scores by more than 5.0 times the average standard deviation of all pools are eliminated from consideration. This effectively eliminates most pools that contain many sequences related to the profile, since these pools have average scores with large deviations from the predicted scores. The curve fitting procedure is now repeated and new values for A, B, and C calculated.

Each normalized score (Score(N)) is then calculated


Score(N) = Score(orig) / (C * (1 - e(A*SeqLen+B)))

The average (Score(N)(avg)) and standard deviation (SD(N)) of all normalized scores are then calculated. Normalized scores whose original scores differ from the calculated score for a sequence of the same length by more than 5.0 times the average standard deviation of all length pools are omitted from this calculation. This ensures that sequences similar to the profile will not affect the calculation of the mean and standard deviation.

The Z scores are the differences in standard deviation units between each normalized comparison score and the mean normalized comparison score for sequences unrelated to the profile. Therefore, a Z score of 5.0 means that a comparison score is significant at the 5.0 sigma level. The Z score is calculated as


ZScore = (Score(N) - Score(N)(avg)) / SD(N)

The Z score has a mean of 0.0 and a standard deviation of 1.0.

The third section of the documentation in the ProfileSearch output file records the process of the normalization described above. First, the number of length pools used in the normalization, and the number of pools rejected because of high standard deviation are reported. The empirically derived equation of the curve is then presented, along with the correlation coefficient for the agreement between the calculated curve and the observed results. This value is typically about 0.95. If it is much lower (e.g., less than 0.90), the reported Z scores are not accurate. Finally, the equation used for the calculation of Z scores is reported, along with the number of scores omitted from the calculation of the normalized mean and standard deviation.

Optionally, a file recording the observed and calculated scores for the length pools used in the normalization procedure may be produced with the command line parameter -FITfile. For each pool used in the calculation, the file contains the average and standard deviation of the lengths and observed scores, and the calculated score for a sequence of the average length of the pool.

Failure of Normalization

The normalization procedure described above appears to be robust, but it must be remembered that it is based on an empirically derived description of the distribution of scores. In particular, this normalization may not be appropriate for very long profiles with large numbers of rows with low gap creation penalties. It also fails if the sequences in the search set are all very similar in length.

If the first pass of the curve fitting procedure is unsuccessful, the Z scores are all reported as zero. If the first pass is successful, but the second pass fails, Z scores are calculated and the entries in the output file are sorted based on the Z scores; however, a warning message is produced in the documentation of the output. If fewer than 400 entries are saved in the output or if fewer than three length pools are present, normalization will not be attempted and all Z scores will be reported as zero.

CONSIDERATIONS

[ Previous | Top | Next ]

ProfileSearch attempts to choose default maximum gap creation and extension penalties that are appropriate for the query profile it reads. You can use -GAPweight and -LENgthweight or respond to the program prompts to specify alternative maximum gap penalties if you don't want to accept the default values.

ProfileSearch requires the computer to make a calculation that is proportional to the product of the profile length times the length of the database. The example search with a profile of length 718 tested against the whole SWISSPROT protein sequence database used almost one and a half hours of CPU time. Searches of the nucleotide sequence database with profiles prepared from nucleotide sequence alignments may take substantially longer because of the larger size of the database and the necessity to search both forward and reverse strands of each database sequence.

Since ProfileSearch can take such a long time to run, most searches should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using the command-line parameter -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide.

ProfileSearch/ProfileSegments finds only the best fit of the profile to any sequence. Be aware that other regions with a lower degree of similarity to the profile may also exist in the same sequence, especially in nucleic acid sequences.

SUGGESTIONS

[ Previous | Top | Next ]

Because ProfileSearch can search a maximum of 50,000 nucleotide sequences at one time (see the RESTRICTIONS topic above), GenEMBL must be divided into smaller parts in order to search the entire nucleic acid sequence database. Each division can contain several of the GenEMBL subsections specified in a list file. (See Chapter 2, Using Sequence Files and Databases in the User's Guide for help in naming GenEMBL subsections in a list file.)

PROFILE FILE FORMAT

[ Previous | Top | Next ]

Look at the entry for ProfileMake for a description and an example of what profile files look like.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % profilesearch [-INfile1=]hsp70.prf -Default

Prompted Parameters:

[-INfile2=]SW:*          the search set
-GAPweight=4.50          maximum position-specific gap creation penalty
-LENgthweight=0.05       maximum position-specific gap extension penalty
[-OUTfile=]hsp70.pfs     output file name

Local Data Files: None

Optional Parameters:

-LIStsize=100           sets a limit to the size of the output list
                             (100,000 is the default)
-SEQLimit=100000        maximum number of sequences to search
                             (can't be set higher than 100,000)
-MINList=2.5            lowest Z score to report in output list.
-MINSeq=50              minimum length sequence to examine in the search
-MINNorm=50             minimum length sequence to use in normalization
-NONORmalize            turns off normalization of comparison scores
-NOAVErage              does not adjust quality scores for sequence
                             composition
-NOSEArch               normalizes a pre-existing file of search scores
-FITfile[=hsp70.fit]    output file with curve fitting information for
                             normalized scores
-CPUlimit=1000          limits the search to 1,000 seconds (default
                             is 86,400)
-SINce=6.90             limits search to sequences dated on or after
                             June 1990 (does not work for PIR databases)
-NOMONitor              suppresses the screen trace for the search set
                             sequences
-BATch                  submits program to the batch queue

LOCAL DATA FILES

[ Previous | Top | Next ]

None.

OPTIONAL PARAMETERS

[ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-LIStsize=100

sets a limit to the number of entries in the output list. The default is the maximum number of sequences that can be searched (currently 100,000 protein sequences, 50,000 nucleic acid sequences).

-SEQLimit=1000

sets a limit on the number of sequences to search. The value cannot be set higher than the default value of 100,000.

-MINList=2.5

sets the lowest Z score for an entry to be reported in the output list. If Z scores aren't calculated, then all entries (up to the maximum number determined by the -LIStsize parameter) are reported. The default reports an entry if its score is 2.5 standard deviations above the expected score for a sequence of the same length.

-MINSeq=50

sets the minimum length for a sequence to be searched.

-MINNorm=50

set the minimum length of a sequence to be used in the normalization.

-NONORmalize

turns off the normalization of alignment scores. The listed sequences are sorted by their original comparison scores, rather than the scores adjusted for systematic effects of sequence length. In the output file, all Z scores are 0.

-NOAVErage

turns off the adjustment of scores for sequence composition. In the default ( -AVErage), a score due to the similarity in composition between the profile and sequence of interest is subtracted from the original alignment score.

-NOSEArch

turns off the database search and normalizes a previously existing file of ProfileSearch scores, instead. The normalization occurs only if the existing file contains over 400 saved entries.

-FITfile=pretty.fit

writes an output file containing the observed and calculated scores for the length pools used in the normalization procedure.

-CPUlimit=1000

sets a limit in seconds beyond which the search is aborted. The limit defaults to 86,400 seconds (24 hours). ProfileSearch reports the results for the sequences searched before the time limit is exceeded.

-SINce=6.90

limits the search to sequences that have been entered into the database or modified since June 1990. As this is being written, only the EMBL, GenBank, and SWISS-PROT databases support this parameter.

-MONitor=100

monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with -NOMONitor.

The monitor is updated every time the program processes 100 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.


-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: November 18, 1996 13:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com