Clustering Practical Brazil 2001 Tutorial for clustering of a small EST dataset. • ESTs and partial transcripts are in the databases or are generated from local projects. It is necessary to extract the EST required. • Formats – EMBL, GenBank, DDBJ – SRS and NCBI sequence extraction Format is very important indeed, as it defines the degree to which the sequence data can be manipulated. A GenBank entry contains several annotations that are useful in the process of reconstruction. The most important of these are the definition of 3' and 5', the definition of the library source, and the clone ID. The European Bioinformatics Institute, the DNA Databank of Japan, and the National Centre for Biotechnology Information all share information on DNA entries using their own 'flatfile' formats. The information contained is the same between the three systems, differing only in terms of the format in which the data is presented. As the transcript reconstruction system we are going to use requires GenBank formatted files or strictly defined fast files for best results, we will use the GenBank format. TODO Look at the datasets provided. Open (using the UNIX command, 'more' or using an editor provided such as VI or EMACS) the file named ESTs.genbank. Note the first line of the file details some important features: LOCUS AA000001 474 bp mRNA EST 17-OCT-1996 LOCUS is a unique identifier (AA000001) that will be given to this record. It can be used subsequently to extract the same information about the EST once it has been imported into STACK_PACK. 474bp is the actual length in total characters, including ACTG and N of the EST. mRNA refers to the 'molecule' type, and implies that the data in the record is has been created as a transcript from an underlying genome sequence. EST is a useful definition, as it means that any data stored in the system with this tag will be recognised as an EST. 17-OCT-1996 The date the record was submitted or updated. This information is very useful, as it can be used to assess kinds of errors you may expect in the data. For instance: ESTs submitted by the IMAGE consortium prior to 2000 have all been created using slabgel sequencers. Later submissions, or submissions by other groups, may have been created using capillary sequencers. QS What common sort of error can be results from sequences run using a slab gel sequencer? DEFINITION zd84h07.s1 Soares_fetal_heart_NbHH19W Homo sapiens cDNA clone IMAGE:347389 3', mRNA sequence. Note the .s1 extension on the sequence reaction name zd84h07.s1 This extension can provide orientation in terms of the direction from the template in which the reaction was performed. .s1 and .r1 are commonly used by groups supplying reaction name information. Note also that the definition contains the context of the reaction: 3' means that the reaction was performed from the end nearest the PolyAAA tail of the transcript. QS What is the clone ID? What is the name of the library from which the EST was sequenced? What does IMAGE mean? Choice of dataset: ESTs are roughly the same length, and always, by definition, represent only part of a complete transcript. In this exercise, we will work only with ESTs, and not a mixture of ESTs, cDNAs and genomic sequences. There are two ways to interface with the STACKPACK sequence reconstruction system: a. Via the web interface. b. Via the command line interface. We will perform a brief exercise using the command line interface. The data generated can then be analysed subsequently using the web interface. Each clustering run performed with stackPACK is associated with a single input file and a single project. Projects are created and managed by the stack_ProjectManager program. The stack_ProjectManager program consists of a number of operations which allow the user to list, delete and create projects as well as to display summary statistics and get information on specified projects. Each operation has its own defined set of necessary parameters. --------------------------------------------------------------------- Command: stack_ProjectManager Usage: stack_ProjectManager [parameters] Valid operations, with their parameters are: -menu run with a simple interactive menu -create create a new project -list list all projects -info get information on specified project -summary display summary stats for specified project -delete delete specified project ------------------------------------------------------------------ 3.3 Projects must first be created before the clustering pipeline can process data. One project is created per data input file and it is usual for the project name to reflect the type of data to be processed. Projects can be created through the command line or through the basic menuing system. Command line creation of a project 3.3.1 : --------------------------------------------------------------------- Command: stack_ProjectManager -create Info: creates and manages projects Usage: stack_ProjectManager -create Where: Project= Brief one-word alphanumeric project name. Project names may not include any punctuation and may not begin with a number. Project Info= One-line project description for your reference. Put multi-word descriptions in quotations. Project info is shown in subsequent project listings. Project owner= Email address or name of owner of project TODO Set up yor own project. Use a name that is unique to you. Example: stack_Project Manager -create testolf "clustering olfactory data" liza --------------------------------------------------------------------- The above example returns the following from the system: stackPACK version 2.0 Creating project: testolf Description: clustering olfactory data Owner: liza Created project 'testolf' TODO Create a unique project with your name and a unique name for the data Projects can also be created using a command - line menu programme. It is not needed to re-create a project, but you can vie details of the project you have created using the menu system. You can search based upon your username TODO View the list of created projects using the command line menu system 3.3.2 : The same project can be created through the simple menuing system for stack_ProjectManager. An example is given below. Comments on the right preceded by an arrow are for the users information. --------------------------------------------------------------------- Command: stack_ProjectManager -menu Stackpack Project Manager ========================= 1... List all projects 2... Create a project 9... Delete a project q... Exit project manager 2 <---select #2 Create a project Create Project -------------- 1... Project Owner: stackpack <---default owner is 2... Project Name: 'stackpack' 3... Project Description: c... Create Project q... Return to main menu > 1 <---select option 1 to enter project owner Project Owner: liza <---enter name of project owner Create Project -------------- 1... Project Owner: liza 2... Project Name: 3... Project Description: c... Create Project q... Return to main menu > 2 <---select option 2 to enter project name Project Name: testolf <---enter project name Create Project -------------- 1... Project Owner: liza 2... Project Name: testolf 3... Project Description: c... Create Project q... Return to main menu > 3 <---select option 3 to enter one-line description Project Description: clustering olfactory data <----enter description Create Project -------------- 1... Project Owner: liza 2... Project Name: testolf 3... Project Description: clustering olfactory data c... Create Project q... Return to main menu > c <----IMPORTANT: after entering details, you must now type "c" to create your project --------------------------------------------------------------------- Creating project: 'testolf'... Project created successfully Once your project has been created successfully, type "q" to exit the project manager. TODO Check that you can operate the command line interface so that you can create and delete projects. Now create a project that is unique to you for the next step/..... 3.4 The sequences from the input data file must be imported into stackPACK's database before the clustering engine can process them. Data in GenBank or a range of FASTA formats may be imported. Non-alphabetic characters (including '*', and digits) in the sequence lines are automatically stripped out when the file is read in. Some formats cause the system to die. These are usually hand edited sequences that contain weird characters or hidden control characters. TRY AND BE A CLEAN, WELL BEHAVED SEQUENCE SUBMITTER. TODO Read through the following paragraphs and then do the next exercise. --------------------------------------------------------------------- Command: stack_ImportGenbank Info: imports GenBank format input sequences Usage: stack_ImportGenbank [Project] [source file] [organism] Where: Project= Brief one-word alphanumeric project name. Source file= Input data file name and path. Organism= Organism under study; stackPACK will only import sequences with this organism designation. Example: stack_ImportGenbank yourprojectname ESTs.genbank "Homo sapiens" --------------------------------------------------------------------- The above example returns the following from the system: Importing Genbank data. Project: yourprojectname Filename: ESTs.genbank Organism: Homo sapiens ......................................................... ......................................................... stack_ImportGenbank completed. Imported: 1 sequences. Processed: 1 sequences in total. Open a web browser and point to http://amoeba.procc.fiocruz.br/stackpack Choose the webProbe option from the top of the page and provide a project name in the open selection box that matches the one you created earlier. Choose 'summary report' from the clickbox. You will be presented with a report thatcontains a single accession, the one you uploaded. Click on 'singletons'. Click on the accession number. You will see the sequence, BUT importantly, you will also be able to see a lot of relevant information wih regards to the sequence such as clone orientation etc. What is the Clone ID? TODO Now create a new project at the command line and import using the (fasta command) the file Command: stack_ImportFasta Info: imports FASTA format input sequences Usage: stack_ImportFasta [Project] [Source file] [Format=GUESS] Where: Project= Brief one-word alphanumeric project name. Source file= Input data file name and path. Format= Type of FASTA format file input. Options: GUESS|SIMPLE|STACK|NCBI Default format is GUESS. See section 2 above for more detailed description of each format. Example: stack_ImportFasta projectname smallolf.seq GUESS --------------------------------------------------------------------- The above example returns the following from the system: Importing Fasta data. Project: projectname Filename: smallolf.seq Format: SIMPLE .................................................... .................................................... stack_ImportFasta completed. Imported: 461 sequences. Processed: 461 sequences in total. TODO Import the test sequence file smallolf.seq for the class project. Remember that the file is in FASTA format, and that you must have created a project to import the file into. How do you know that the import was competed successfully? Pre-processing and Clustering the imported data 3.5 The clustering procedure is intended to group together those sequences that share identical regions. A common problem in EST clustering is contamination with a sequence common to several members of the input EST data set but not representing valid gene data. The masking step helps ensure that ESTs submitted for clustering are free of artifacts before clustering begins. StackPACK uses CrossMatch (Green, 1999) or RepeatMasker to mask input sequences against a database containing: - Repeat sequences. (For STACKdb production Electric Genetics uses RepBase (Jurka,1995). Your system administrator may have installed RepBase or another repeat database more pertinent to your work.) - Common vector sequences, distributed by NCBI. - Other potential contaminants such as rodent, mitochondrial and ribosomal DNA. The sequences are masked by replacing the contaminated portions of the sequence with x's, which are ignored by further steps in the clustering pipeline, ensuring only valid sequence data contributes to the associations made to generate a cluster. Masked regions are nonetheless retained during the clustering pipeline and are visible in the EST View, PHRAP Alignment View and CRAW Alignment View. Consensus sequence positions where all contributing sequences are masked (i.e,. are 'x') are calculated as "n". When stackPACK is installed, the system administrator performing the installation should place a repeat database for the generic, system wide configuration in /usr/local/stackpack/supporting/ This repeat database will then be used by default and needs not be specified in the command line. SETTING YOUR PARAMETERS INDEPENDENT OF SYSTEMWIDE SETTINGS Alternatively, any of the system wide configurations can be overridden in the user's home directory in the file .stackpackrc The stackPACK software has a system-wide configuration file located in the following file: /etc/stackpack Users wishing to configure stackPACK differently for their own use may do so through creation of an individual configuration file placed in their home directory named ".stackpackrc" Key parameters that can be adjusted by the user using .stackpackrc include repeat masking file, number of processors used for the clustering step and, for expert users, parameters for each of the programs called externally by stackPACK. stackPACK first sources /etc/stackpack for parameters. Then it will source ~/.stackpackrc in the users home directory to see if it overrides any of the settings declared in /etc/stackpack. Thus, the user can override any parameter in /etc/stackpack in ~/.stackpackrc The easiest way to create the .stackpackrc file is to copy /etc/stackpack to the user's home directory as .stackpackrc and further edit it. Example: cp /etc/stackpack ~/.stackpackrc vi .stackpackrc Pre-processing : MASKING the data --------------------------------------------------------------------- Command: stack_Mask Info: masks an input file of sequences against common Contaminants Requires: external.bin/cross_match FASTA file of sequences to mask against (by default found in /usr/local/stackpack/supporting Usage: stack_Mask [Project] [Repeat file] [Batchsize=250] Where: Project= brief one-word alphanumeric project name Repeat file= location of FASTA file of sequences to mask against Batchsize= The number of ESTs compared at a time to the repeat database. If the parameter is not specified, a default value of 250 is used. Example: stack_Mask testolf /edata2/repeat.seq --------------------------------------------------------------------- TODO Mask your input data now by running the stack_Mask programme according to the usage directions above. The above example returns the following from the system: Masking sequence data Project: projectname Mask file: /usr/local/stackpack/supporting/repeatseq Batch size: 250 sequences Processing: 597 sequences in total Parameters: minmatch=12 minscore=20 stack_Mask finished Processed 461 sequences FAQ: Why change the batchsize? Answer: The larger the batchsize, the more RAM is required to process the data. If your computer has limited RAM or you find the system running out of RAM during the stack_Mask process, re-run your data with a smaller batchsize. CLUSTERING the masked sequences. 3.6 The clustering step of stackPACK uses d2_cluster, a high-performance comparison algorithm that rapidly determines the relative similarity of large datasets of genetic sequences. (Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison. ; Hide W., Burke J., Davison D. ; Journal of Computational Biology 1 (3) 199-215). d2_cluster implements a loose approach to sequence clustering by identifying and counting matching n-length words (n=6), in contrast with the strict approach in which clusters are built up based on matching entire sequence fragments. While the strict methodology yields cluster members that are highly related, the loose approach presents the opportunity to detect clusters that are related by re-arrangement or alternative splicing. Although the resulting clusters are likely to be more 'noisy', the combination with verification tools for multiple sequence alignments eliminates this noise and produces networks of highly related sequences for further analysis. d2_cluster, a word-based, greedy clustering algorithm, is discrete from the assembly tool (PHRAP) and identifies ESTs that are greater than 96% identical over a window of 150 bases. d2-cluster is a word multiplicity comparison method that utilizes an agglomerative algorithm that has been specifically developed for rapidly and accurately partitioning transcript databases into index classes by clustering ESTs and full-length sequences according to minimal linkage or "transitive closure" rules. Agglomerative clustering method means that every sequence begins in its own cluster and the final clustering is constructed through a series of mergers that may be described in terms of minimal linkage, sometimes called single linkage or "transitive closure". The term transitive closure refers to the property that any two sequences with a given level of similarity will be in the same cluster, hence A and B are in the same cluster even if they share no similarity but there exists a sequence C with enough similarity to both A and B. --------------------------------------------------------------------- Command: stack_Cluster Info: runs d2_cluster on the input data file Requires: bin.ext/enc_db bin.ext/d2_cluster Usage: stack_Cluster [Project] Where: Project= brief one-word alphanumeric project name Example: stack_Cluster testolf --------------------------------------------------------------------- TODO Perform the clustering step by executing stack_Cluster correctly. How could you speed up the clustering step if you needed to? QS How do you know when this stage has completed? The above example returns the following from the system: Clustering sequence data Project: testolf Exporting: Clustering: 2600 sequences Using: 1 cpus Parameters: word_size=6 similarity_cutoff=096 minimum_sequence_size=50 window_size=100 Finished clustering Importing results stack_Cluster finished Created 281 clusters 921 sequences were members of a cluster Generating some statistics ============================================================ = CLUSTER STATISTICS = ============================================================ There are 1679 singletons There are 210 clusters with 2 sequences There are 37 clusters with 3 sequences There are 7 clusters with 4 sequences There are 6 clusters with 5 sequences There are 8 clusters with 6 sequences There are 4 clusters with 7 sequences There are 1 clusters with 8 sequences There are 2 clusters with 9 sequences There are 1 clusters with 10 sequences There are 1 clusters with 12 sequences There are 1 clusters with 13 sequences There are 1 clusters with 15 sequences There are 1 clusters with 53 sequences There are 1 clusters with 127 sequences ASSEMBLING THE CLUSTERED ESTs 3.7 To take advantage of the benefits of looser clustering, it is necessary to further align and analyze the clusters generated by d2_cluster. The related but loose clusters are thus subsequently processed by PHRAP to identify, characterize and isolate any sequence divergence. PHRAP aligns and assembles the ESTs grouped together by d2_cluster, and improves alignment quality by removing particularly distinct sequences as singletons. stackPACK retains the PHRAP alignment, even though it is further processed and regenerated by the stack_Analysis step. --------------------------------------------------------------------- Command: stack_Assemble Info: runs PHRAP on clusters generated by d2_cluster Requires: bin.ext/phrap bin.ext/ace2gde Usage: stack_Assemble [Project] Where: Project= brief one-word alphanumeric project name Example: stack_Assemble testolf --------------------------------------------------------------------- TODO Perform an assembly of your test dataset using the commands above. The above example returns the following from the system: Assembling cluster data Project: testolf Processing: 281 clusters Parameters: vector_bound=0 forcelevel=0 trim_score=150 penalty=-2 gap_init=-4 gap_ext=-3 ins_gap_ext=-3 del_gap_ext=-3 maxgap=30 Cluster: 241 generated 2 sub-contigs stack_Assemble finished Processed 281 clusters Total contigs generated: 282 Total clusters that had multiple contigs: 1 Total clusters that did not have a contig: 0 Note: The cluster generated by d2_cluster may be split into one or more contigs by PHRAP. ANALYZING THE ASSEMBLED CLUSTERS 3.8 Analyzing Data Aligned clusters, particularly those generated by a loose clustering engine, need to be further processed for errors, such as those inherent in single-pass sequences, and alignments analyzed for alternate forms of expressed sequences. Although PHRAP aligns sequences, these alignments are lacking information about variation within the cluster and do not help users distinguish alternative splice or other scientifically interesting events from alignment problems induced by low sequence quality or experimental artifacts. CRAW is thus employed to analyze alignments, partition sub-assemblies and provide a simple means to view clusters. After CRAW processing, stackPACK further analyzes clusters to refine consensus sequences, maximize consensus sequence length, create final alignments and to select the best consensus sequence. CRAW works by verifying agreement along the columns of a multiple sequence alignment, using the data to sort related sequences within each cluster and to generate IUPAC-conformant consensus sequences for each subcluster. A sub-cluster is generated if 50% or more of a 100 base window differs from the remaining sequences of a cluster, excluding the initial 100 bases of any read. The approach depends fundamentally on the alignment quality of each assembly. A poor alignment will yield erroneous sub-clusters and too low a gap penalty may yield too many columns in agreement and thus not create sub-clusters where they would be appropriate. --------------------------------------------------------------------- Command: stack_Analysis Info: runs CRAW on aligned sequence data; further analyses CRAW subassemblies Requires: bin.ext/craw Usage: stack_Analysis [Project] Example: stack_Analysis testolf --------------------------------------------------------------------- TODO Perform a CRAW process on your Assembled ESTs by running CRAW using Stack_Analysis. The above example returns the following from the system: Analysing contig data Project: testolf Processing: 282 contigs Parameters: sig=05 window_size=100 ignore_first=50 reassigning lone singleton 1871 from 0 to 1 reassigning lone singleton 2416 from 0 to 1 LINKING THE PROCESSED SINGLETONS AND CLUSTERS 3.9 All ESTs generated from the same cDNA clone correspond to a single gene. Each EST is searched for clone identification so that the transcripts corresponding to the same gene can be identified and linked. Only a proportion of ESTs in GenBank currently have documented clone information. This information is utilized to extend the length of the cluster consensus sequences by joining clusters that contain ESTs that share clone IDs. Thus only if the input sequences contained clone information, can the program create linked clusters. Given that the clone ID information is solely annotation-based and may have namespace overlaps depending on the data source(s), this step is best handled near the end of the processing pipeline. Furthermore, unless a specific 5'-3' pair can be identified as a seed for each gene consensus, the procedure is transitive in nature and may lead to extensive clone-linked networks whose biological significance remains to be explored. To avoid spurious linking, the program currently requires that at least two independent clone ID matches must be made before two clusters will link. Additionally, if the program detects a high sequence/cloneID ratio, it will not process the linking. Default setting for this max_seq_per_clone parameter is 2. When a closed set of clone-linked consensi has been identified, the program will attempt to order them as 5'-unassigned-3' based on a majority rule from the EST annotations in each cluster. To form a final consensus sequence, the non-redundant best cluster consensi are joined by linker segments of 20 Ns. This choice was made based on the word size employed by BLAST, so that alignment breaks would be preferentially inserted at these linker regions. --------------------------------------------------------------------- Command: stack_Link Info: creation of linked clusters Usage: stack_Link [Project] Example: stack_Link testolf --------------------------------------------------------------------- TODO Link your processed ESTS using the command structure above. The above example returns the following from the system: Linking cluster data Project: testolf Links: 2 Pass 1 - Reducing clusters Pass 2 - Identifying links and updating the Database VIEWING ANALYZING AND EXTRACTING data from the COMPLETED PROJECT TODO Open your web browser and point it to the suggested URL for this tutorial. Then go to the next section titled 'The Web Interface' 4. The stackPACK results are stored in a relational database and are viewed and exported by using the web interface components WebProbe(tm) and WebReport(tm). WebProbe provides viewing tools that link consensus sequences, alignments, expression analysis and external data sources like UniGene. WebReport provides access to a list of predefined reports that can be selected and downloaded for further data evaluation or to create searchable databases of your clustering results. The web-based interface is typically invoked by opening the following location in your browser: http:///stackpack/ The hostname can be confirmed by viewing the WEBPROBE entry in the file: /etc/stackpack. For example, if the hostname is "myhost.egenetics.com", the WEBPROBE entry will look like this: [WEBPROBE] HTTP_SERVER=http://'wotever has been placed here at installation' H(tm)L_LOCATION=/stackpack CGI_LOCATION=/cgi-bin/stackpack 5. The stackPACK software has a system-wide configuration file located in the following file: /etc/stackpack Users wishing to configure stackPACK differently for their own use may do so through creation of an individual configuration file placed in their home directory named ".stackpackrc" Key parameters that can be adjusted by the user using .stackpackrc include repeat masking file, number of processors used for the clustering step and, for expert users, parameters for each of the programs called externally by stackPACK. stackPACK first sources /etc/stackpack for parameters. Then it will source ~/.stackpackrc in the users home directory to see if it overrides any of the settings declared in /etc/stackpack. Thus, the user can override any parameter in /etc/stackpack in ~/.stackpackrc The easiest way to create the .stackpackrc file is to copy /etc/stackpack to the user's home directory as .stackpackrc and further edit it. Example: cp /etc/stackpack ~/.stackpackrc vi .stackpackrc 6. For more information about stackPACK or answers to technical questions, please contact the Electric Genetics team on: phone +27 (0)21 959-3964 fax +27 (0)21 959-2512 e-mail support@egenetics.com web www.egenetics.com The web interface The web interface gives access to the following programs: * WebPipe: Project creation and initialization of the clustering process. * WebProjectManager: Project manipulation and management. * WebProbe: Viewing of results. * WebReport: Output reporting for download and evaluation. * Detailed User Manual information is available under this web interface, with "help" links available from every page of the interface. As we have already completed the stackpacking using the command line, we will only use the viewing and extraction tools. Note that the entire process can be performed via the web interface if desired. TODO Go to the section: 'How to use the WebProjectManager ' Description of formats for the system: * GenBank flatfile format o GenBank flatfile format is defined as the format of sequence entries in the GenBank database or as downloaded from the NCBI web site (e.g., Entrez search results) when GenBank format is specified. The full GenBank format specification is found in section 3.4 of the GenBank release notes. * Simple FASTA format o >[accession].[direction] [clone ID] ( Where direction is either "r1" for a 3-Prime clone or "f1" for a 5-Prime clone ) o e.g. >37463.f1 g83244 ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA CTCAGTCGTACGTACGTACGT * stack FASTA format o >[accession] [gi] | [accession] CLONE: [clone] CLONE_LIB:[clonelib] LEN: [len] FILE [source file] [direction] DEFN:[descriptive text] o e.g. >T27877 g609975 | T27877 CLONE: 17194 CLONE_LIB: Human Eye LEN:505 bp FILE gbest3.seq 5-PRIME DEFN:EST19137 Homo sapiens cDNA 5'end ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA CTCAGTCGTACGTACGTACGT * NCBI FASTA format o As retrieved through NCBI's Entrez when selecting the FASTA option from the display menu. A basic description can be found at http://www.ncbi.nlm.nih.gov/BLAST/fasta.html o e.g. >gi|4468770|emb|AJ009167.1|TSAJ9167 Trypanosoma sp. 18S rRNA gene, isolate K&A ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA CTCAGTCGTACGTACGTACGT * Mixed or unknown FASTA formats o Files with mixed FASTA header line formats or files with FASTA header lines not described above can also be imported. o If stackPACK does not identify one of its pre-defined FASTA headers, the program determines an accession number for the sequence entry by extracting all valid characters (alphanumeric or "_" or ".") found between the > and the first space. o Sequences will ONLY be imported if there are 20 or less valid characters between the > and the first space. It should thus be ensured that these 20 characters of each sequence entered for processing are unique. o If possible, other details in the header may be parsed in as well. Otherwise, the remainder of the line is ignored. o The Mixed or unknown FASTA format option can NOT be used if the input data file contains sequences in NCBI FASTA format, as this format typically has more than 20 characters between the > and the first space. Sequences in NCBI FASTA format will thus ONLY be imported if the NCBI FASTA format option is specified. Minimum requirements for FASTA format input file to STACKPACK are: >[accession number] To view the results of the processing, access to the files is performed via the WebProjectManager: TODO How to use the WebProjectManager * Click on WebProjectManager in the menu bar for a list of the various projects. * Click on: o Project Name for a status report. o Info for a full summary report. o Delete for the deletion of a project. Click on the name of your project and wait for a description to be generated. The status report is as follows: The status report displays the following project information (from left to right): * Cluster data ? Total number of sequences processed ? Number of multi-sequence clusters ? Number of sequences in multi-sequence clusters ? Number of singletons (singles) * Clone linked cluster data ? Number of sequences in linked multi-sequence clusters (clonelinks) ? Number of clonelink consensus sequences (ln#) ? Number of unlinked singletons ? Number of unlinked multi-sequence clusters Web Project manager simply lists the data in the project and gives you access to its management. The completed processing pipeline results in sets of files and relationships that can be explored. The WebProbe tool allows for exploration of the project. How to use the WebProbe TODO Click on WebProbe in the menu bar. * Enter your project name, or click on "(..)" for a full list of projects. * Query the WebProbe: o by accession number. o for clusters with potential alternate expression forms. o for a summary report. * Click on "Go Ahead" to view the sequence data. TODO Accession number: You can query an accession based on the input file you used: Try: plasmodium (as the project name) And You can also use a STACKPACK accession such as cl1 (cluster 1). Try the accession number cl1 and project name plasmodium. Note that the system is case sensitive. The system will look for any cluster or singleton that contains the accession you have chosen. It will then return it to you. You can also generate a whole report of the project which returns a list of all clusters. Try project plasmodium and click the 'project summary' button. After a few moments, you will see something that looks like this: Summary Report for plsmodium The following is a summary of the project contents. Please note that you can click on any one of the listed items to view them in WebProbe Number of Input sequences for the project is 1115 Number of Clonelinks the in project is 23 * ln1 ( 2 ) * ln2 ( 2 ) * ln3 ( 2 ) * Number of Clusters in the project is Number of Singleton Est's in the project is What is the difference between a clone-linked cluster and a cluster? Go to prject smallolf or your own version of smallolf and click on a clonelinked cluster and view the first report that appears. How many sub-clusters make up the clonelinked cluster? Here is an example of the first view you may see: >ln1; COVERAGE:0.74; TOTAL_ESTS:4; LENGTH:863bp ; MAP: ; CATEGORIES: GGAGAGCAGGCCCTACTTCCAGGGAACAGGTTGAGATCTGGAGTCCCTGTAGGGTCAGA GCTAGAGGACCCAGAGGAGGAAGTCCTGGAAGGCCTTCCTGGAGGAGGGGCTGTCAGAG CTGAGTCCAAACTGAAGAGGCATTTGCAATCCAGGAGAAAGCGACCCCTGGTAGGGGnA GCTGnCAAGAGGAAAAGCTGAGAGATACCAAGAAATGCAAGGGACCTGCATCCCCATGC ATCCCTCTGCCCATCTGCAGGGGCACTTAGAAGTACACGGAGCCCTCGCTGTCTCCTTG GGTCATCGAATTTCTGGATCTGAGTCTTGAGATGCCTCAGTTTACCCTTCAGGTAGGTn GGCAGCGAGCCTGCTTnTCCAGGGAAGCCAGGGTnCCTAGGCAGGGCGAGACCCGGAAG TTTTnNNNNNNNNNNNNNNNNNNNNTCAAAACCAGCGCCCCCCGCCCTCCGTGCCAGCC CCAGCCGGGACCCCACAAGGCAAAGACCAAGAAGATTGTGTTTGAGGATGAGTTGCTCT CCCAGGCCCTCCTGGCGnCCAAGAAGCCTATTGGAGCCATCCCTAAGGGGCATAAGCCT AGGCCCCACCCAGTGCCCGACTATGAGCTTAAGTACCCGCCAGTGAGCAGTGAGAGGGA ACGGAGCCGCTATGTCGCAGTGTTCCAGGGACCAGTACGGAGAGTTCTTGGGAGCTCCA GCACGGAGGTGGGGGTGTTGCACAGGCAAAGTTCAGGGCAGCTGGGAGGCCCTGCTTGA GCTCCCTTGCCCCCACCCCAAAGCCAGAAGGGAGGGCCCAATTTGCAGCCCGGTTTTTG GAGGGATTTTTAGATTTGAAGnGATTGGTTGAnTTTT Note that to the left of the consensus sequence you are viewing, you will see folders that represent the subsequences and alignments that make up the consensus view. This sequence is named ln1. That means that it is a clonelinked sequence, and was the first one generated in this project. Coverage means that the consensus has an average makeup of 74% coverage by ESTs. Total ests means that in this case, 4 ESTs make up the consensus. The contiguous conensus is 863bp long. No MAP data is available in this entry. No categories are available either. A string of NNNNNNNNNNN characters join two parts of this consensus. They are an arbitrary string placed there to join the unknown distance between one end of the linked sequences and another. 'n' characters are placed into the consensus because of the lack of information for the base at that position. TODO Now select 'CRAW ANALYSIS VIEW'. This will provide a processed view of the alignment. Try buttons on the interface to find: A cluster sequence that has 2 or more ESTs A poor quality consensus region of a cluster An alternate form of a transcript Extract an alternate sequence from prject plasmodium and submit to a BLASt search. What species of plasmodium is it? Do you suspect alternate splicing? Go to project alts. This project is alternate splicing in a human gene. What gene is it and how do you find the exons structure to match the alternate splicing detected by Stack_Pack? TODO * Click on WebReport in the menu bar. * Enter the Project Name. * Choose the type of report you wish to generate. Here you have the opportunity to generate a report that is non-redundant. Using the 'non-redundant output of entire project' button, you can save the whole project in fasta format to your work area. Another useful form of this information is in comma delimited format, for subsequent analysis or extraction into a database or into a an excel spreadsheet. There are several options for data output, and these vary depending on what it is that you require from the project. COMPARISON VS TIGR AND UNIGENE> Download the consensus sequence for cl1 in alts project. Search against: BLAST data base at NCBI www.ncbi.nlm.nih.gov ENSEMBLE Database at www.sanger.ac.uk What chromosome is the gene from? What is the name of the gene? Can you tell which exons vary? Look at the viewer and decide upon the gene name of this sequence. How many unigene clusters seem to represent this sequence? Are they (Or is it) identified as the same function? Search this gene against TIGR THC database. How many THC results are returned? Are they assigned the same gene name? This exercise above should provide you with some indication of the role that algorithm choice and system design has in transcript reconstuction