HTHSCAN*

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents

FUNCTION

DESCRIPTION

FUNCTION [ Top | Next ]

HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.

DESCRIPTION [ Previous | Top | Next ]

HTHScan predicts helix-turn-helix (H-T-H) motifs in protein sequences. For each sequence, HTHScan prints a list of possible H-T-H motifs sorted in descending order according to score. Associated with each score is the probability of achieving that score in the target sequence by chance using the given family-specific weight matrix. HTHScan has weight matrices for the araC and lysR families of H-T-H motifs and one for homeobox domains.

EXAMPLE [ Previous | Top | Next ]

Here is a session with HTHScan that was used to find H-T-Hs in the arabinose operon regulatory protein araC sequence from E. coli:


% hthscan

  HTHScan of what sequence(s)? SW:Arac_Ecoli

                  Begin (* 1 *) ?
                End (*   292 *) ?

  Search using weight matrix for which H-T-H family:

      A.  AraC
      B.  LysR
      C.  Homeobox

     Please choose one: (* A *):

  Only display H-T-Hs whose score exceeds (* 4.0 *) ?

  What should I call the output file (* arac_ecoli.hthscan *) ?

  Input sequences processed: 1
  Number of sequences with predicted H-T-Hs: 1

  Results written to file "arac_ecoli.hthscan".
  CPU time (sec): 0.80

%

OUTPUT [ Previous | Top | Next ]: Here is the output file:


HTHScan of sw:arac_ecoli  August 6, 1997 14:34

  Weight matrix: GenRunData:htharac.dat
  Minimum score for H-T-Hs (threshold): 4.0

> sequence: sw:arac_ecoli
      name: arac_ecoli  check: 4061  from: 1  to: 292

   1. 197 IASVAQHVCLSPSRLSHLFR 216
      Score: 39.8
      Probability: 0.000E+00

  Databases searched:
        SWISS-PROT, Release 34.0, Released on 30Nov96, Formatted on 30Dec1996
  Input sequences searched: 1
  Number of sequences with predicted H-T-Hs: 1
  CPU time (sec): 0.80

The N-terminus->C-terminus direction of the predicted H-T-H is from left to right. The position of the first residue in the H-T-H is shown to the left. The position of the last residue in the H-T-H is shown to the right.

Below the H-T-H display is the score computed for the predicted H-T-H and the probability of random occurrence of that score or better given a sequence whose residue distribution is uniform and whose positions are independent of one another.

INPUT FILES [ Previous | Top | Next ]

The input to HTHScan is one or more protein sequences. If HTHScan rejects your protein sequence, turn to Appendix VI to see how to change or set the type of a sequence. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*.

Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds. FindPatterns identifies sequences that contain short patterns like GAATTC or YRYRYRYR. You can define the patterns ambiguously and allow mismatches. You can provide the patterns in a file or simply type them in from the terminal. SPScan scans protein sequences for the presence of secretory signal peptides (SPs). CoilScan locates coiled-coil segments in protein sequences.

CONSIDERATIONS [ Previous | Top | Next ]

Because of the way HTHScan sorts and stores predicted H-T-H motifs during scanning, no particular ordering is guaranteed among H-T-H motifs that have exactly the same score .

ALGORITHM [ Previous | Top | Next ]

HTHScan uses a log-odds position-weight matrix ("weight matrix") to detect the presence of H-T-H motifs in protein sequences. The weight matrix encodes the H-T-H motif as a set of weights representing the likelihood of each amino acid residue to appear in each position of the motif. The score reported by HTHScan for each prediction is a measure of the local goodness of fit between the target sequence and the H-T-H signal represented by the weight matrix. This score is the sum of the weights corresponding to the amino acid residues found in the target sequence at each weight matrix position.

The statistical significance of each score is computed as the probability of random occurrence of that score or better in a sequence with the same amino acid residue distribution as the target sequence and whose positions are all independent of each other (Claverie, J.-M. and Audic, S. CABIOS 12(5); 431-439 (1996)).

The weight matrices used by HTHScan were prepared using sequence sets taken from Pfam Release 2.0 (Sonnhammer, E.L.L. et al. Proteins, in press (1997)). The Pfam families used were HTH 1 (bacterial regulatory helix-loop-helix proteins, lysR family), HTH 2 (bacterial regulatory helix-loop-helix proteins, araC family), and homeobox (homeobox domain). The log-odds weight matrices were constructed from these sequences with MEME version 2.1 (Bailey, T.L. and Elkan, C. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36 (1994)).

COMMAND-LINE SUMMARY [ Previous | Top | Next ]

All parameters for this program may be put on the command line. Use the parameter -CHEck to see the summary below and to have a chance to add things to the command line before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % hthscan [-INfile=]SwissProt:ARAC_ECOLI -Default

Prompted Parameters:

-BEGin=1 -END=292                sequence range of interest
-FAMily=arac                     specify weight matrix by H-T-H family:
                                   "arac", "lysr", or "homeobox"
-THRESHold=4.0                   minimum score for H-T-H detection
[-OUTfile=]arac_ecoli.hth        name of results file

Local Data Files:

-DATa=htharac.dat                weight matrix for the araC family H-T-Hs
      HTHLysR.Dat                weight matrix for the lysR family H-T-Hs
      HTHHomeobox.Dat            weight matrix for the homeobox family H-T-Hs

Optional Parameters:

-NUMTOPscores=3                  maximum number of H-T-Hs to report
-EVEn                            assume even target residue distribution
-NOPROBabilities                 don't compute score probabilities
-VERbose                         use verbose output
-MONitor                         display screen trace of progress
-NOSUMmary                       suppress report of run information
                                   to screen at exit

ACKNOWLEDGEMENT [ Previous | Top | Next ]

We thank Tim Bailey, Charles Elkan, and Bill Grundy for MEME (http://www.sdsc.edu/MEME), which was used to create the log-odds weight matrices. We thank Erik Sonnhammer, Sean Eddy, and Richard Durbin for the Pfam protein domain family database (http://www.sanger.ac.uk/Software/Pfam/), which was used to create input sequence sets for MEME.

HTHScan was written by Ted Slater.

LOCAL DATA FILES [ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

If you choose to search for the araC family of H-T-H motifs (the default), HTHScan will use the weight matrix file HTHAraC.Dat. If you choose to search for the lysR family of H-T-H motifs, HTHScan will use the weight matrix fileHTHLysR.Dat. If you choose to search for the homeobox family of H-T-H motifs, HTHScan will use the weight matrix file HTHHomeobox.Dat.

OPTIONAL PARAMETERS [ Previous | Top | Next ]

The parameters listed below can be set from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

-FAMily=arac

allows you to specify the weight matrix used by choosing the H-T-H motif family by name. You may specify arac for the araC family of bacterial regulatory proteins (represented by the weight matrix file HTHAraC.Dat), lysr for the lysR family of bacterial regulatory proteins (represented by the weight matrix file HTHLysR.Dat), or homeobox for the homeobox domain, (represented by the weight matrix file HTHHomeobox.Dat).

-THRESHold=4.0

allows you to specify the minimum acceptable score for an H-T-H motif prediction. If you do not specify this parameter on the command line, and you run HTHScan interactively, HTHScan will prompt you for this value giving a default taken from the weight matrix file itself.

-NUMTOPscores=3

specifies the maximum number of predicted H-T-H motifs to report for each sequence scanned. For example, if you specify-NUMTOPscores=3 on the command line, HTHScan will display no more than three of the highest scoring H-T-Hs predicted for each sequence. Use -NUMTOPscores=1 if you want to see only the highest scoring H-T-H in each sequence. By default, HTHScan will display all H-T-Hs predicted for each sequence.

-EVEn

tells HTHScan to assume that amino acid residues are distributed evenly throughout the length of the target sequence for the purpose of calculating score probabilities. This makes HTHScan perform a little faster, because it does not have to compute the actual distribution of residues in each input sequence. However, reliability of the score probability calculations may be adversely affected.

-NOPROBabilities

tells HTHScan to forgo the calculation of the probability of random occurrence of the score in a sequence with even amino acid residue distribution whose positions are all independent of each other. This makes HTHScan run much faster.

-VERbose

tells HTHScan to print more documentation about each sequence to the output file. The number of lines of documentation printed depends upon the -Doclines global switch described in the User's Guide.

-MONitor=10

monitors this program's progress on your screen. Use this parameter to see this same monitor in the log file for a batch process. If the monitor is slowing down the program because your terminal is connected to a slow modem, suppress it with-NOMONitor.

The monitor is updated every time the program processes 10 sequences or files. You can use a value after the parameter to set this monitoring interval to some other number.

-NOSUMmary

tells HTHScan not to print a summary of the run just before it exits.

Printed: August 27, 1997 11:05 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]

Documentation Comments: doc-comments@gcg.com
Technical Support: help@gcg.com

Licenses and Trademarks Wisconsin Package is a trademark of Genetics Computer Group, Inc. GCG and the GCG logo are registered trademarks of Genetics Computer Group, Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.