MOTIFS

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
PROSITE ABSTRACTS
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
MISMATCHES
PATTERN FILE
FREQUENT MOTIFS
SUGGESTIONS
CONSIDERATIONS
DEFINING PATTERNS
COMMAND-LINE SUMMARY
ACKNOWLEDGEMENTS
LOCAL DATA FILES
OPTIONAL PARAMETERS

FUNCTION

[ Top | Next ]

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

DESCRIPTION

[ Previous | Top | Next ]

Motifs looks for protein sequence motifs by checking your protein sequence for every sequence pattern in the PROSITE Dictionary. Motifs can recognize the patterns with some of the symbols mismatched, but not with gaps. Currently, Motifs can only search for patterns in protein sequences.

There is a very informative abstract on every motif in the PROSITE Dictionary. These abstracts are displayed next to any motif found in your sequence.

The PROSITE Dictionary was written by Dr. Amos Bairoch of the University of Geneva.

EXAMPLE

[ Previous | Top | Next ]

Here is a session using Motifs to look for sequence motifs in SW:kad1_human:


% motifs

 MOTIFS from what protein sequence(s) ?  Sw:Kad1_Human

 What should I call the output file (* kad1_human.motifs *) ?

          KAD1_HUMAN len:        194 .......................

             Total finds:          1
            Total length:        194
         Total sequences:          1
          CPU time (sec):       2.72

             Output file:"/usr/users/abernathy/kad1_human.motifs"

%

OUTPUT

[ Previous | Top | Next ]

Here is some of the output file:


 MOTIFS from: SW:Kad1_Human

 Mismatches: 0                October 1, 1996 13:16  ..

          KAD1_HUMAN  Check: 1652  Length: 194   ! P00568 homo sapiens ...

______________________________________________________________________________

Adenylate_Kinase      (L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)
                          (L,I,F){3}DG(Y)PRx{3}(Q)
            90: NTSKG           FLIDGYPREVQQ           GEEFE

******************************
* Adenylate kinase signature *
******************************

Adenylate kinase  (EC 2.7.4.3) (AK) [1]  is  a  small  monomeric  enzyme
that catalyzes the reversible transfer of MgATP to AMP (MgATP + AMP = MgADP
+ ADP). In mammals there are three different isozymes:

 - AK1 (or myokinase), which is cytosolic.
 - AK2, which is located in the outer compartment of mitochondria.
 - AK3 (or GTP:AMP phosphotransferase),  which is located in the
   mitochondrial matrix and which uses MgGTP instead of MgATP.

The sequence of  AK has also  been  obtained from different  bacterial
species and from yeast.

Two other enzymes have been found to be evolutionary related to AK. These
are:

 - Yeast uridylate kinase (EC 2.7.4.-) (UK) (gene URA6) [2] which catalyzes
   the transfer of a phosphate group from ATP to UMP to form UDP and ADP.
 - Slime mold UMP-CMP kinase (EC 2.7.4.14) [3] which catalyzes the transfer
   of a phosphate group from ATP to either CMP or UMP to form CDP or UDP
   and ADP.

Several regions of  AK  family enzymes  are well conserved, including the
ATP-binding domains.  We have  selected the  most conserved  of  all
regions as a signature for this type  of  enzyme.   This  region includes
an aspartic acid residue that is  part of the  catalytic  cleft  of  the
enzyme  and  that  is involved in  a salt  bridge.  It  also  includes an
arginine  residue whose modification leads to inactivation of the enzyme.

-Consensus pattern: [LIVMFYW](3)-D-G-[FY]-P-R-x(3)-[NQ]
-Sequences known to belong to this class detected by the pattern: ALL,
 except for Schistosoma mansoni (blood fluke) and Yersinia enterocolitica AK.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Note: archebacterial AK do not belong to this family [4].

-Last update: November 1995 / Text revised.

[ 1] Schulz G.E.
     Cold Spring Harbor Symp. Quant. Biol. 52:429-439(1987).
[ 2] Liljelund P., Sanni A., Friesen J.D., Lacroute F.
     Biochem. Biophys. Res. Commun. 165:464-473(1989).
[ 3] Wiesmueller L., Noegel A.A., Barzu O., Gerisch G., Schleicher M.
     J. Biol. Chem. 265:6339-6345(1990).
[ 4] Kath, T.H., Schmid, R., Schaefer, G.
     Arch. Biochem. Biophys. 307:405-410(1993).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Above each find, the complex expression being searched ((L,I,V,M,F,Y,W)3DG(F,Y)PRx3(N,Q)) is simplified ((L,I,F){3}DG(Y)PRx{3}(Q)) so that you can see what was actually found. The find is displayed in the midst of the five symbols flanking it on either side in the original sequence. The number to the left of the find is the first coordinate of the motif (not of the flanking symbols). In the example above, 90 is the coordinate of the first F in FLIDGYPREVQQ, not of the first N in NTSKG.

PROSITE ABSTRACTS

[ Previous | Top | Next ]

The PROSITE Dictionary contains an extensive abstract of the current literature for each motif. Motifs displays the abstract below each pattern that is found. If the same pattern is found in more than one sequence, the abstract is only shown below the pattern in the first sequence in which the pattern is found. Several different patterns may share the same abstract. If you want to reduce the size of your output you can suppress these abstracts with the command-line parameter -NOREFerence. When abstracts are being suppressed there will be a filename, such as 0179.pdoc, that appears in parentheses below each pattern found. You can use the Fetch program to make a copy of this file in order to look at the abstract.

INPUT FILES

[ Previous | Top | Next ]

Motifs takes as input one or more protein sequences. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. If Motifs rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence.

RELATED PROGRAMS

[ Previous | Top | Next ]

FindPatterns and all of the Wisconsin Package(TM) mapping programs use the same search algorithm and pattern file format as Motifs. ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

RESTRICTIONS

[ Previous | Top | Next ]

Patterns may not be more than 350 characters long.

MISMATCHES

[ Previous | Top | Next ]

Motifs will not introduce gaps but it can tolerate mismatches when it is run with the command-line parameter -MISmatch=n. Mismatched finds are shown in the output in lowercase. Mismatches cannot occur within NOT expressions (see the DEFINING PATTERNS topic below).

PATTERN FILE

[ Previous | Top | Next ]

In addition to your protein sequence, Motifs reads a local data file like the one below to find the search patterns. This file is modeled on the enzyme data files for the mapping programs described in Appendix VII. The offset field is not used by Motifs, but the field must have a number in it to make the file compatible with the mapping files.

The exact column used for each field does not matter, only the order of the fields in the line. You give several patterns the same name, but put all of the entries for that name on adjacent lines of this file. The patterns may not be more than 350 characters long. Blank lines and lines that start with an exclamation point (!) are ignored.

Here is some of the standard public pattern data file:


PROSITETOGCG of: Prosite.Doc and Prosite.Dat  December 18, 1995 11:18

Release 13.0  (11/95)

Name            Offset Pattern                  ..                 PDoc_Name

11s_Seed_Storage     1 NGx(D,E)2x(L,I,V,M,F)C(S,T)x{11,12}(P,A,G)D 0284.pdoc
1433_1               1 RNL(L,I)SV(G,A)YKN(I,V)                     0633.pdoc
1433_2               1 YK(D,E)STLIMQLL(R,H)DNLTLW(T,A)(S,A)        0633.pdoc
25a_Synth_1          1 GGSx(A,G)(K,R)xTxL(K,R)(G,S,T)xSD(A,G)      0653.pdoc
25a_Synth_2          1 RPVILDPx(D,E)PT                             0653.pdoc

////////////////////////////////////////////////////////////////////////////

Zinc_Finger_C2h2     1 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H       0028.pdoc
Zinc_Finger_C3hc4    1 CxHx(L,I,V,M,F,Y)Cx2C(L,I,V,M,Y,A)          0449.pdoc
Zinc_Protease        1 (G,S,T,A,L,I,V,N)x2HE(L,I,V,M,F,Y,W)~(D,E,H,R,K,P ...
Zn2_Cy6_Fungal       1 (G,A,S,T,P,V)Cx2C(R,K,H,S,T,A,C,W)x2(R,K,H)x2Cx{5 ...
Zp_Domain            1 (L,I,V,M,F,Y,W)x7(S,T,A,P,D,N)x3(L,I,V,M,F,Y,W)x( ...

FREQUENT MOTIFS

[ Previous | Top | Next ]

The PROSITE Dictionary contains a number of short sequence patterns that occur frequently in protein sequences. Most of these frequently found patterns are post-translational modifications, but more specific patterns such as leucine zippers also fall into this category. Such frequently found patterns are not normally shown by Motifs, but you can display them by using the command-line parameter -FREquent. More so than with other patterns in the PROSITE Dictionary, the presence of these frequently occurring patterns does not assure you that the protein actually contains the corresponding function.

Here are some of the patterns that the PROSITE Dictionary classifies as frequently occurring:


;Amidation           1 xG(R,K)(R,K)                             0009.pdoc
;Asn_Glycosylation   1 N~(P)(S,T)~(P)                           0001.pdoc
;Camp_Phospho_Site   1 (R,K)2x(S,T)                             0004.pdoc
;Ck2_Phospho_Site    1 (S,T)x2(D,E)                             0006.pdoc
;Glycosaminoglycan   1 SGxG                                     0002.pdoc
;Leucine_Zipper      1 Lx6Lx6Lx6L                               0029.pdoc
;Microbodies_Cter    1 (S,A,G,C,N)(R,K,H)(L,I,V,M,A,F)>         0299.pdoc
;Myristyl            1 G~(E,D,R,K,H,P,F,Y,W)x2(S,T,A,G,C,N)~(P) 0008.pdoc
;Pkc_Phospho_Site    1 (S,T)x(R,K)                              0005.pdoc
;Rgd                 1 RGD                                      0016.pdoc
;Tyr_Phospho_Site    1 (R,K)x{2,3}(D,E)x{2,3}Y                  0007.pdoc

SUGGESTIONS

[ Previous | Top | Next ]

The file prosite.seqcat contains short definitions of each motif in the PROSITE Dictionary. You can use Fetch to copy this file to your local directory so that you can view it with a text editor. The PDoc_Name field in the pattern file prosite.patterns has the name of a PDoc (PROSITE Document) file containing the abstract for each pattern. You can use Fetch to look at any abstracts of interest. If you run Motifs with the command-line parameter -NOREFerences, the name of the corresponding PDoc file is shown below each pattern found.

If you specify more than one sequence, Motifs displays each one's name on the screen as it is searched. However, unless you use the -SHOw parameter, the output file shows only those sequences in which a motif was actually found.

If you run Motifs with the command-line parameter -NAMes, the output file is a list file. (See "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide for more information about list files.)

CONSIDERATIONS

[ Previous | Top | Next ]

With the publication of the PROSITE Dictionary, Amos Bairoch has shown that regular expressions can reliably recognize known protein sequence motifs. When new examples of a known motif are discovered, these expressions can usually be modified to recognize the new example. The process of modifying a regular expression so that it covers all of the members of a newly expanded family of similar sequences could be referred to as "ambiguation."

The problem with regular expressions is that they often fail to recognize sequences that are not yet known to be members of the sequence family. You should consider using Profile technology if your aim is to bring together similar sequences whose association has not yet been recognized.

There are a few patterns in PROSITE that are defined with rules rather than regular expressions. Motifs does not look for these patterns.

DEFINING PATTERNS

[ Previous | Top | Next ]

FindPatterns, Map, MapSort, MapPlot, and Motifs all let you search with ambiguous expressions that match many different sequences. The expressions can include any legal GCG sequence character (see Appendix III). The expressions can also include several non-sequence characters, which are used to specify OR matching, NOT matching, begin and end constraints, and repeat counts. For instance, the expression TAATA(N){20,30}ATG means TAATA, followed by 20 to 30 of any base, followed by ATG. Following is an explanation of the syntax for pattern specification.

Implied Sets and Repeat Counts

Parentheses () enclose one or more symbols that can be repeated some number of times. Braces {} enclose numbers that tell how many times the symbols within the preceding parentheses must be found.

Sometimes, you can leave out part of an expression. If braces appear without preceding parentheses, the numbers in the braces define the number of repeats for the immediately preceding symbol. One or both of the numbers within the braces may be missing. For instance, both the pattern GATG{2,}A and the pattern GATG{2}A mean GAT, followed by G repeated from 2 to 350,000 times, followed by A; the pattern GATG{}A means GAT, followed by G repeated from 0 to 350,000 times, followed by A; the pattern GAT(TG){,2}A means GAT, followed by TG repeated from 0 to 2 times, followed by A; the pattern GAT(TG){2,2}A means GAT, followed by TG repeated exactly 2 times, followed by A. (If the pattern in the parentheses is an OR expression (see below), it cannot be repeated more than 2,000 times.)

OR Matching

If you are searching nucleic acids, the ambiguity symbols defined in Appendix III let you define any combination of G, A, T, or C. If you are searching proteins, you can specify any of several symbol choices by enclosing the different choices in parentheses and separating the choices with commas. For instance, RGF(Q,A)S means RGF followed by either Q or A followed by S. The length of choices need not be the same, and there can be up to 31 different choices within each set of parentheses. The pattern GAT(TG,T,G){1,4}A means GAT followed by any combination of TG, T, or G from 1 to 4 times followed by A. The sequence GATTGGA matches this pattern. There can be several parentheses in a pattern, but parentheses cannot be nested.

NOT Matching

The pattern GC~CAT means GC, followed by any symbol except C, followed by AT. The pattern GC~(A,T)CC means GC, followed by any symbol except A or T, followed by CC.

Begin and End Constraints

The pattern <GACCAT can only be found if it occurs at the beginning of the sequence range being searched. Likewise, the pattern GACCAT> would only be found if it occurs at the end of the sequence range.

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % motifs [-INfile=]SW:Kad1_Human -Default

Prompted Parameters:

[-OUTfile=]kad1_human.motifs    the output file name

Local Data Files:

-DATa=prosite.patterns   file of protein sequence patterns

Optional Parameters:

-NOREFerence    suppresses the PROSITE abstract for each pattern found
-FREquent       shows motifs that are frequently found in proteins
-MISmatch=1     allows one mismatch
-NAMes          writes the output as a list file
-APPend         appends the pattern data file to your output file
-SHOw           shows every file searched, even if no pattern was found
-ONCe           limits finds to patterns found only once
-MINCuts=2      limits finds to patterns found a minimum of 2 times
-MAXCuts=3      limits finds to patterns found a maximum of 3 times
-EXCLude=n1,n2  excludes patterns found between positions n1 and n2
-NOMONitor      suppresses the screen trace showing each file
-NOSUMmary      suppresses the screen summary at the end of the program

ACKNOWLEDGEMENTS

[ Previous | Top | Next ]

The publication of the PROSITE Dictionary of Protein Sites and Patterns by Dr. Amos Bairoch of the University of Geneva is one of the great achievements of sequence analysis. Dr. Bairoch's prodigious efforts can be seen in every abstract of this extraordinary collection. His generosity in distributing it freely, and his patience in compiling it so carefully, puts all of us in his debt.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

Motifs reads the regular expressions for the motifs of interest from the file prosite.patterns.

OPTIONAL PARAMETERS

[ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-NOREFerences

suppresses the PROSITE abstract that normally appears below each pattern that is found.

-FREquent

Frequently found patterns, such as post-translational modifications, are not normally shown by Motifs. You can display them by using this command-line parameter.

-NAMes

writes the output file as a list file suitable for input to other Wisconsin Package programs that support indirect file specification (see "Using List Files" in Chapter 2, Using Sequence Files and Databases of the User's Guide). The output showing the location of the patterns found is suppressed when you choose the list file format.

-SHOw

Usually, Motifs shows that a motif was searched only if there were one or more matches in the sequence. With the -SHOw parameter, Motifs shows every motif searched whether or not a pattern was actually found in the sequence. ( -SHOw is equivalent to setting -MINCuts=0.)

-MISmatch=1

causes Motifs to recognize places where patterns are found with one or fewer mismatches. If you allow too many mismatches, you get too much output. The display uses case to distinguish between matches and mismatches.

-APPend

appends the pattern data file to your output file. (See the PATTERN FILE topic above.)

The descriptions of the exclusionary parameters below were written for the Wisconsin Package mapping programs. A find in these applications is referred to as a cut while a pattern is referred to as a restriction enzyme recognition site.

The -MINCuts, -MAXCuts, -ONCe, and -EXCLude parameters suppress the display of selected enzymes. The list of excluded enzymes in the program output includes both selected enzymes that cut within excluded ranges and selected enzymes that did not cut the right number of times.

-MINCuts=2

excludes enzymes that do not cut at least two times.

-MAXCuts=2

excludes enzymes that cut more than two times.

-ONCe

excludes, from the set of enzymes displayed, those enzymes that cut your sequence more than once (equivalent to setting both mincuts and maxcuts to one).

-EXCLude=n1,n2[n3,n4,...]

excludes enzymes that cut anywhere within one or more ranges of the sequence. If an enzyme is found within an excluded range, then the enzyme is not displayed. The list of excluded enzymes includes enzymes that cut within excluded ranges. The ranges are defined with sets of two numbers. The numbers are separated by commas. Spaces between numbers are not allowed. The numbers must be integers that fall within the sequence beginning and ending points you have chosen. The range may be circular if circular mapping is being done. Exclusion is not done if there are any non-numeric characters in the numbers or numbers out of range or if there is an odd number of integers following the parameter.

-MONitor

This program normally monitors its progress on your screen. However, when you use -Default to suppress all program interaction, you also suppress the monitor. You can turn it back on with this parameter. If you are running the program in batch, the monitor will appear in the log file.

-SUMmary

writes a summary of the program's work to the screen when you've used the -Default parameter to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.

You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.

Printed: November 18, 1996 13:07 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Copyright (c) 1982, 1983, 1985, 1986, 1987, 1989, 1991, 1994, 1995, 1996, 1997 Genetics Computer Group, Inc. a wholly owned subsidiary of Oxford Molecular Group, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Genetics Computer Group

www.gcg.com