PROFILESCAN

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

DESCRIPTION [ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileScan uses the method of Gribskov et al. (CABIOS 4(1); 61-66 (1988)) to find structural and sequence motifs in protein sequences. These motifs are represented as profiles in a library. ProfileScan aligns each profile motif to the sequence, and displays all alignments between the profile and sequence that have a normalized score above a set threshold. Because more than one alignment between a sequence and a particular motif can be found, each repeat of a duplicated structure (such as the zinc finger motif) can be presented.

EXAMPLE [ Previous | Top | Next ]

Here is a session using ProfileScan to search for known structural motifs in the sequence Ygbyad from the PIR database:


% profilescan

 PROFILESCAN of what sequence(s) ?  PIR:Ygbyad

                  Begin (* 1 *) ?
                End (*  1392 *) ?

 What profile library (* profilescan.fil *) ?

 What should I call the alignment output file (* ygbyad.scan *) ?

 What should I call the summary output file (* ygbyad.sum *) ?

Beginning initial scan...
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 ........................................
 .........

Beginning multiple alignment to matching patterns...
.

%

OUTPUT [ Previous | Top | Next ]: Here is some of the .scan output file:


 PROFILESCAN of : ygbyad  check: 5237  from: 1  to: 1392

L-aminoadipate-semialdehyde dehydrogenase (EC 1.2.1.31) - yeast (Saccharomyces
 cerevisiae)
N;Alternate names: alpha-aminoadipate reductase; protein YBR0910; protein
 YBR115c
C;Species: Saccharomyces cerevisiae
C;Date: 31-Dec-1991 #sequence_revision 31-Dec-1991 #text_change 01-Sep-1995
C;Accession: JU0448; S48279; S45983; A25815; S37810; S25367; S34171; S44694
R;Morris, M.E.; Jinks-Robertson, S. . . .

 Compare to profile library: GenRunData:profilescan.fil

 ..
 -------------------------------------------------------------------------------
 Profile: profiledir:amp_binding.prf
   Gap weight:  4.50     Gap Length weight:   0.05
   Ave match:   0.12     Ave mismatch     :  -0.10
(Peptide) PROFILEMAKE v4.40 of: 0455.Msf2{*}  Length: 59
  Sequences: 28  MaxScore: 15.35  December 2, 1992  01:06
This profile is derived from PROSITE release 10.0 and has been tested
by a database search against SWISS-PROT release 26.0.  A comparison
of the SWISS-PROT annotation and the results of the database search
follows.  For further information about this motif, consult the . . .

Profile: profiledir:amp_binding.prf     alignment: 1

 Quality:  10.69       Gaps: 0
   Ratio:   0.21     Length: 51
 Normalized quality:  2.34
                  .         .         .         .         .
S    399 DHYKDTRTGVVVGPDSNPTLSFTSGSEGIPKGVLGRHFSLAYYFNWMSKR 448
         :. .:: :.....::. : | |||||:| |||||  | ::.   . ::::
P      7 EQSEDTETTQPDDPEDLAFIIFTSGTTGKPKGVMLTHKGVVNSVSSLSDR 56

S    449 F 449
         |
P     57 F 57

*****************************************
* Putative AMP-binding domain signature *
*****************************************

It has been shown [1 to 5] that a number of prokaryotic and eukaryotic enzymes
which all probably act via  an ATP-dependent  covalent binding of AMP to their
substrate, share a region of sequence similarity. These enzymes are:

//////////////////////////////////////////////////////////////////////////////

-Consensus pattern: [LIVMFY]-x(2)-[STG](2)-G-[ST]-[STEI]-[SG]-x-[PASLIVM]-K
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: 5.

-Note: in a majority of cases the residue that  follows  the Lys at the end of
 the pattern is a Gly.

-Last update: November 1995 / Pattern and text revised.

[ 1] Toh H.
     Protein Seq. Data Anal. 4:111-117(1991).
[ 2] Smith D.J., Earl A.J., Turner G.
     EMBO J. 9:2743-2750(1990).
[ 3] Schroeder J.
     Nucleic Acids Res. 17:460-460(1989).
[ 4] Mallonee D.H., Adams J.L., Hylemon P.B.
     J. Bacteriol. 174:2065-2071(1992).
[ 5] Turgay K., Krause M., Marahiel M.A.
     Mol. Microbiol. 6:529-546(1992).

//////////////////////////////////////////////////////////////////////////////

The .sum file lists the number of occurrences of each motif in the sequence of interest, the score for each occurrence, and the threshold score for that motif.

INPUT FILES [ Previous | Top | Next ]

ProfileScan takes as input one or more protein sequences. You can specify multiple sequences in a number of ways: by using a list file, for example@project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If ProfileScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

PileUp creates a multiple sequence alignment from a group of related sequences. LineUp is a multiple sequence editor used to create multiple sequence alignments. Pretty displays multiple sequence alignments.

ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using predetermined parameters to determine significance.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns.

RESTRICTIONS [ Previous | Top | Next ]

Unknown.

ALGORITHM [ Previous | Top | Next ]

See the Profile Analysis Essay for an introduction to associating distantly related proteins and finding structural motifs.

ProfileScan acts similarly to ProfileGap to align the motif profile to a sequence. Unlike ProfileGap, all alignments with scores above a set threshold are displayed. The scores are normalized for systematic effects of sequence length on the score. Since the average normalized score for sequences unrelated to the profile is expected to be 1.0, the threshold can be viewed as the factor by which an alignment score must exceed the expected alignment score for unrelated sequences to be reported. For instance, if the threshold is set at 2.0, an alignment is reported if its normalized score is at least 2.0 times the expected score for sequences unrelated to the profile.

In practice, two possible thresholds, high and interesting, can be selected. The threshold values for each motif are present in the motif library file, profilescan.fil. The interesting level is usually set at 3.0 standard deviations above the mean score for sequences in the database unrelated to the profile, and the high level is usually set at the 5.0 to 6.0 standard deviation level. The default high threshold can be overridden with the -INTEResting command-line parameter. (See the entry for ProfileSearch in the Program Manual for a complete description of normalized scores.)

Validated Profiles

The motif library consists of validated profiles derived from aligned sequences known to contain each structural motif. A validated profile has the following properties: 1) all of the sequences used to create the profile correctly align to the profile; and 2) all sequences known to contain the motif score above the high threshold. The scores for these sequences are higher in every case than the scores for sequences known to lack the motif. Operationally, the process of creating a validated profile is as follows:

Each sequence known to contain the motif is aligned to the profile using ProfileGap. The alignment generated should correspond to the original alignment. If the alignments differ significantly, they are repeated with different gap creation and gap extension penalties until they agree.

Each motif profile is compared to all the sequences in the database using ProfileSearch. All sequences known to contain the motif represented by the profile should have higher scores than any sequences that lack the motif.

If the profile does not adequately discriminate between sequences with the motif and those without, and if changing the gap creation and gap extension penalties does not improve the discrimination, the alignments are examined by eye to determine why the sequences without the motif are giving high scores. The profile can then be edited by hand to reduce the scores in the profile at the positions that are contributing to the high scores of the sequences lacking the motif.

CONSIDERATIONS [ Previous | Top | Next ]

ProfileScan may report multiple occurrences of a motif profile in a protein sequence. The alignments may represent repeats of a duplicated structure, or they may represent distinct alignments between the motif profile and the same region of the protein sequence. These alternatives can be distinguished by looking at the alignments in the .scan file.

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % profilescan [-INfile=]PIR:Kihua -Default

Prompted Parameters:

-BEGin=1 -END=194           range of interest for sequence
-REVerse                    use reverse strand (nucleic acid only)
[-LIBrary=]profilescan.fil  profile library file
[-OUTfile=]kihua.scan       paired alignment output file name
[-SUMfile=]kihua.sum        summary output file name

Optional Parameters:

-INTEResting       reports scores higher than the INTERESTING threshold,
                     rather than the default HIGH
-NOAVErage         does not adjust quality score for sequence composition
-NOREFerence       suppresses printing the PROSITE abstract
-PAIr=1.0,0.5,0.1  thresholds for displaying '|', ':', and '.'
-BATch             submits the program to run in the batch queue

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

ProfileScan reads a library file containing a list of each validated profile, the high and interesting thresholds, the gap creation and gap extension penalties for each profile, and the three constants A, B, and C used for length dependent normalization of scores. See the entry for ProfileSearch in the Program Manual for details on the calculation of these constants.

Any profile can be used by ProfileScan by including its file name and appropriate values in the library file. Values for the two thresholds and two gap penalties must be included for each profile added to the library file. If values for the three constants A, B, and C are omitted from the library file, the values 0.0, 0.0, and 1.0 are used, respectively.

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-INTEResting

reports alignments whose scores are higher than the interesting threshold, rather than the more stringent high default.

-NOAVErage

turns off the adjustment of scores for sequence composition. In the default ( -AVErage), a score due to the similarity in composition between the profile and sequence of interest is subtracted from the original alignment score.

-NOREFerences

If a motif profile was derived from a pattern defined in the PROSITE Dictionary of Protein Sites and Patterns , the PROSITE abstract normally appears beneath each alignment in the .scan file. Use -NOREFerences if you don't want to see this information in the output.

-PAIr=1.0,0.5,0.1

The paired output file from this program displays sequence similarity by putting a pipe character (|), colon (:), and period (.) between similar sequence symbols. The thresholds for the characters are determined by the values in the profile. The pipe character is put between symbols whose comparison value in the profile is at least the average positive value in the profile plus one tenth the difference between the maximum and average values in the profile. The colon character threshold is the average positive value in the profile. The period character threshold is the larger of the average positive value in the profile minus one tenth the difference between the maximum and average values, and one half the average value.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

Printed: November 18, 1996 13:07 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.