BioParser Home Page

>> Introduction
>> Announcements
>> License
>> Distribution Structure
>> MySQL Database Configuration
>> BioParser Configuration
>> Installation
>> Input File Format
>> Parsing Options and Usage
>> Crash Recovery System
>> Filtering Options and Usage
>> BioParser Output Files
>> Reference
>> Requests

Introduction [top]

The widely used programs BLAST (both National Center for Biotechnology Information [NCBI] and Washington University versions) and FASTA for similarity searches in nucleotide and protein databases usually result in copious output. However, when large query sets are used, human inspection rapidly becomes impractical. BioParser is a Perl program for parsing sequence similarity analysis reports. Making extensive use of the BioPerl Toolkit, the program filters, stores and returns components of these reports in either ASCII or HTML format. BioParser is also capable of automatically feeding a local MySQL database with the parsed information, allowing subsequent filtering of hits and/or alignments with specific attributes.

Announcements [top]

[03-May-2006] BioParser 1.2.0 is now available

The new version is able to parse and analyze the results obtained with the sequence similarity search program HMMER (both HMMSEARCH and HMMPFAM). In addition, the BioParser Browser was improved with new search fields (QDesc, HDesc, QLength, HLength) and operators (like/not like, and/or), making BioParser even more flexible. Request a copy.

[03-May-2006] BioParser Web is launched

An on-line version of BioParser is now available. Parse and analyze your BLAST, FASTA, SSEARCH or HMMER result in our server freely!

[12-July-2006] BioParser 1.2.1

BioParser 1.2.0 has been updated with the Concurrent Versions System (CVS) of BioPerl. The current and past distributions of BioPerl are UNABLE to handle the new BLAST output style.

[16-January-2007] BioParser 1.2.2

BioParser has been updated with version 1.5.2 of BioPerl. The new BioPerl version introduces some useful features such as Bio::SearchIO speed up, and it contains many bug fixes since the 1.5.1 release;

The function accounting for the Run SQL field has been updated to avoid misinterpretation of some mathematical symbols;

Miscalculations of "Queries without Hits" and "Hits without HSPs" reports have been fixed;

Other small changes in the code have been made to improve the way BioParser deals with the sequence frame information in TBLASTN, TFASTX, and TFASTY reports, and also to accurately calculate the Ident(%) and Pos(%) in HMMSEARCH and HMMPFAM reports.

[05-February-2007] BioParser 1.2.3

Some numeric data fields in the BioParser MySQL database structure has been updated to account for sequence similarity reports containing huge number of queries/hits/HSPs and/or huge-sized sequences.

License [top]

BioParser v1.2.3
This software is licensed for non-commercial use only.
Distributed under the terms of:
Creative Commons Attribution-NonCommercial-NoDerivs 2.0 License.
http://creativecommons.org/licenses/by-nc-nd/2.0/legalcode

Distribution Structure [top]

bioparser-1.2.3/
Root directory; license file, installation instructions, version changes.
bioparser-1.2.3/parser/
BioParser GUI; necessary files to use the Perl/Tk parser interface.
bioparser-1.2.3/browser/
BioParser CGI browser; necessary files to use the web interface to analyse your parsed results; BioParser manual in HTML format.
bioparser-1.2.3/bioperl/unix/
BioPerl 1.5.2 (bioperl-1.5.2) compressed in .tar.gz.
bioparser-1.2.3/bioperl/windows
BioPerl 1.5.2 (bioperl-1.5.2) PPD for ActivePerl for Windows.
bioparser-1.2.3/sql/
BioParser MySQL database structure (SQL format).

MySQL Database Configuration [top]

BioParser needs only one working account to work with the database, but it can also be set up with two different working accounts (read-write/read-only) as well. For security purposes, for instance, the parser may be given a read-write account while the CGI interface may be given just a read-only account. To create the default BioParser database, fire up the MySQL client (mysql under Unix or mysql.exe under Windows), and issue the following command:

mysql> CREATE DATABASE bioparser;

This creates a database to import the data; you can choose any name you wish. Now you need to "enter" your database with this command:

mysql> USE bioparser;

To create the default tables, fire up the mysql client and issue the following command:

mysql> SOURCE /path/to/bioparser.sql;

NOTE: Under windows you can use both "/" or "\" to delimit directories. Importantly, the directory cannot contain spaces, otherwise, mysql may not even find the file. The default tables created by bioparser.sql are: bp_query, bp_hit, bp_hsp and bp_report (Figure 10). You can change the name of the tables running the following command:

mysql> RENAME TABLE tbl_name TO new_tbl_name;

BioParser Configuration [top]

BioParser depends on the following softwares:

Perl (tested under 5.8.8);
MySQL (tested under 4.1.22).

NOTE: MySQL 5 is not supported yet.

Perl - http://www.perl.org/
ActiveState Perl for Windows - http://www.activestate.com/Products/ActivePerl/
MySQL Database - http://dev.mysql.com/downloads/mysql/4.0.html

You also need the following Perl Modules:

Tk;
POE;
XML::Simple;
DBI;
DBD::mysql;
CGI::Simple;
Class::MakeMethods;
Bundle::BioPerl.

Unix users can find Perl Modules in the Comprehensive Perl Archive Network (CPAN) website at http://search.cpan.org/.

Windows users, using ActiveState Perl, should use the "ppm" utility to search and install the required modules.

It's highly recomended that you use the provided BioPerl distribution (version 1.5.2) found in the "bioperl/unix" or "bioperl/windows/" directory. The new BioPerl version introduces some useful features such as Bio::SearchIO speed up, and it contains many bug fixes since the 1.5.1 release. Windows users can install it using the command:

C:\path\to\bioperl.ppd> ppm install bioperl-1.5.2_100.ppd

You need to set up both the parser (BioParser GUI) and CGI (BioParser CGI browser) interfaces. The parser has a built-in configuration in its GUI to input the database information. The CGI interface can be set up with a regular text editor. Both programs use xml files to store their config, which is pretty straightforward.

To set up the CGI, edit the config.xml file in the "browser" directory, which has the following structure:

You also need to set up the available databases by editing the dbase.xml file located in the "browser" directory as follow:

Installation [top]

Installation is pretty simple, you just need to copy the contents of some directories found in the BioParser distribution to a new directory of your choice.

For the BioParser GUI, copy the contents of the "parser" directory to wherever you want to host those files (eg.: /usr/local/bioparser).

For the BioParser CGI browser, copy the contents of the "browser" directory to your web hosting directory (eg.: ~/public_html/bioparser/).

To work with BioParser browser, launch your web browser and type:

http://your.server.name/path/to/bioparser/index.pl

If you have followed each step correctly, you should see the main query page.

NOTE: You need a CGI-enabled web-server (preferably Apache) with perl and the required modules installed.

Input File Format [top]

BioParser accepts any BLAST (BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX) [NCBI or Washington University version], FASTA (FASTA, FASTX/FASTY, TFASTX/TFASTY), SSEARCH or HMMER (HMMSEARCH, HMMPFAM) output in ASCII (plain text) or XML (supported only for NCBI BLAST version) format. The program does not support any other input format. It parses single or multiple reports, i.e., searching results from one or several query sequences (or profiles) simultaneously. The files single_blast_report and multiple_blast_report are examples of BLASTP run outputs (ASCII format) for a single query and for multiple (three in this example) query sequences, respectively. Before parsing your sequence alignment report, be sure that it follows exactly one of these accepted formats so as to avoid BioParser crashing.

Parsing Options and Usage [top]

A schematic representation of the full system architecture is presented in Figure 1. Basically, BioParser takes a BLAST, FASTA, SSEARCH or HMMER report file as an input and uses the Bio::SearchIO module of the BioPerl library to parse most of the information in this file.

Three different parsing options are offered: saving the parsed information in (i) ASCII (plain text) or (ii) HTML format, in which the parsed elements are displayed as a table (columns representing the sequence similarity report attributes, and lines corresponding to distinct alignments) or a list (with the following nested structure: results -> query -> hit -> HSP; see Table 1), respectively, or (iii) transferring the parsed information to a local MySQL database, which is done automatically by BioParser. This last option is the preferred parsing option for the analysis of large data sets.

BioParser also includes a web-based interface (BioParser Browser), which offers a user-friendly environment to interact with the MySQL database, allowing the user to apply a number of selection criteria to the parsed data so as to filter out hits and/or alignments with specific features (see section Filtering Options and Usage).

Table 1 summarizes the available parsing options and all BLAST, FASTA, SSEARCH or HMMER attributes which can be parsed with BioParser.

Examples of ASCII and HTML outputs can be found in the BioParser Output Files section of this page.

Figure 1. Schematic representation of the BioParser system architecture.

Table 1. Summary of BioParser parsing options, showing all BLAST, FASTA, SSEARCH or HMMER attributes available for parsing with their corresponding description.

Parsing Options and Attributes
Report Section	HTML	ASCII	Database	Description
RESULTS	Algorithm	Algorithm	algorithm	Algorithm
RESULTS	Version	Version	version	Algorithm version
RESULTS	DB Name	DB Name	db_name	Database name
RESULTS	DB Letters	DB Letters	db_letters	Number of residues in database
RESULTS	DB Entries	DB Entries	db_entries	Number of database entries
RESULTS	Parameters	Parameters	parameters	Parameters used
QUERY	Name	QueryName	query_name	Query name
QUERY	Accession	-	query_accession	Query accession number
QUERY	Description	-	query_desc	Query description
QUERY	Length	QLength	query_len	Query length
QUERY	No. Hits	-	-	Number of hits
HIT	Name	HitName	hit_name	Hit name
HIT	Accession	-	hit_accession	Hit accession number
HIT	Description	-	hit_desc	Hit description
HIT	Length	HLength	hit_len	Hit length
HIT	Identities (%)	HIdent(%)	hit_fr_identities	Overall fraction of identical positions across all HSPs (aligned regions only)
HIT	Positives (%)	HPos(%)	hit_fr_positives	Overall fraction of conserved positions across all HSPs (aligned regions only)
HIT	Frac. Aligned Query (%)	HAlnQuery(%)	hit_fr_aln_query	Fraction of the query sequence which has been aligned across all HSPs (not including intervals between non-overlapping HSPs)
HIT	Frac. Aligned Hit (%)	HAlnHit(%)	hit_fr_aln_hit	Fraction of the hit sequence which has been aligned across all HSPs (not including intervals between non-overlapping HSPs)
HIT	No. HSPs	-	-	Number of HSPs for a given hit
HSP	HSP Rank	HSP	-	Rank of the HSP within a given hit
HSP	Query Frame	QFrame	hsp_query_frame	Frame of the query sequence (0 = -1/+1; 1 = -2/+2; 2 = -3/+3); not defined for (wu-)blastn, (wu-)blastp, fastn, fastp and (wu-)tblastn
HSP	Hit Frame	HFrame	hsp_hit_frame	Frame of the hit sequence (0 = -1/+1; 1 = -2/+2; 2 = -3/+3); not defined for (wu-)blastn, (wu-)blastp, fastn, fastp and (wu-)blastx
HSP	HSP Query Strand	QStrand	hsp_query_strand	Strand of the query (1 = Plus; -1 = Minus; 0 = not defined)
HSP	Hit Strand	HStrand	hsp_hit_strand	Strand of the hit (1 = Plus; -1 = Minus; 0 = not defined)
HSP	Score	Score	hsp_score	Raw score
HSP	Bits	Bits	hsp_bitscore	Bit score
HSP	E value	Evalue	hsp_evalue	Expectation value for the HSP (e-value)
HSP	Identities	Ident	hsp_identities	Number of identical residues
HSP	Identities (%)	Ident(%)	hsp_fr_identities	Fraction of identical positions for a given HSP
HSP	Positives	Pos	hsp_positives	Number of conserved residues
HSP	Positives (%)	Pos(%)	hsp_fr_positives	Fraction of conserved positions for a given HSP
HSP	Query Gaps	QGaps	hsp_query_gaps	Number of gaps in the query alignment
HSP	Hit Gaps	HGaps	hsp_hit_gaps	Number of gaps in the hit alignment
HSP	HSP Length	HSPLen	hsp_length	Length of HSP (full length of the alignment)
HSP	Query Overlap	QOverlap	hsp_query_overlap	Length of query participating in alignment minus gaps
HSP	Hit Overlap	HOverlap	hsp_hit_overlap	Length of hit participating in alignment minus gaps
HSP	Frac. Aligned Query (%)	AlnQuery(%)	hsp_fr_aln_query	Fraction of the query sequence which has been aligned within a given HSP
HSP	Frac. Aligned Hit (%)	AlnHit(%)	hsp_fr_aln_hit	Fraction of the hit sequence which has been aligned within a given HSP
HSP	-	-	hsp_query_start	Query start position from the alignment
HSP	-	-	hsp_query_end	Query end position from the alignment
HSP	-	-	hsp_hit_start	Hit start position from the alignment
HSP	-	-	hsp_hit_end	Hit end position from the alignment
HSP	Query range	QRange	-	Query start and end positions from the alignment
HSP	Hit range	HRange	-	Hit start and end positions from the alignment

Usage

Parsing to ASCII or HTML format

Select the appropriate input file format using the drop-down button of the Input Options Format field (Figure 2): blast, for any BLAST or WU-BLAST ASCII output file; blastxml, for BLAST XML output files; fasta, for any FASTA or SSEARCH output; hmmer, for HMMSEARCH or HMMPFAM reports. Inform the path to the input file in the Input Options File field or browse it;
Select the desirable output file format using the drop-down button of the Output Options Format field (ASCII or HTML) (Figure 2). Inform the path to the output file in the Output Options File field or browse it;
Press the Parse button.

Figure 2. Example of parsing to ASCII format.

Parsing to a local MySQL database

Select the appropriate input file format using the drop-down button of the Input Options Format field (Figure 3): blast, for any BLAST or WU-BLAST ASCII output file; blastxml, for BLAST XML output files; fasta, for any FASTA or SSEARCH output; hmmer, for HMMSEARCH or HMMPFAM reports. Inform the path to the input file in the Input Options File field or browse it;
Select Database using the drop-down button of the Output Options Format field (Figure 3);

Figure 3. Example of parsing to database.

Press the Config button (Figures 2 and 3);
If a new database is going to be created with the parsed information, please follow the database creation and the BioParser CGI browser configuration steps described in the MySQL Database Configuration and the BioParser Configuration sections of this page. After creating the local database, fill in the Database Config fields with the database configuration settings or apply the default options (Figure 4). The Server name, Username and Password must be supplied. Uncheck (mouse click) all the Database Options fields (Save Password can optionally be selected);

Figure 4. Database configuration.

If the database already exists, fill in the Database Config fields with the database configuration settings. If you want to delete any stored information before adding a new one, check the suitable Empty Table field(s) (Figure 5). On the other hand, if you want to update the database with new data, uncheck the appropriate Empty Table field(s) and, optionally, check the Optimize Tables field (recommended) (Figure 6). Be aware that, in this case, the Report Table will be overwritten and the previously stored information will be lost. In both cases, Save Password can optionally be selected;
Press the Parse button.

Figure 5. Empting all tables of an existing database before parsing the new information.

Figure 6. Updating all tables of an existing database with new parsed information.

Crash Recovery System [top]

BioParser has been empowered with a crash recovery system which permits the interruption of the parsing-to-database process, without data loss. The parsing process can be resumed according to the following instructions:

Input the original sequence alignment report file and database configuration settings;
In the Database Options section, check (mouse click) the Optimize Tables field (highly recommended) and uncheck (mouse click) all Empty Table fields;
Press the Resume button instead of the Parse one.

Filtering Options and Usage [top]

If the parsed information has been stored in a local MySQL database the user can select queries which are related to hits and/or alignments with particular attributes through the BioParser browser (Figure 7). At least thirteen different attributes can be used to filter out the parsed result: QueryName (name of the query sequence), HitName (name of the hit sequence), QDesc (description of the query sequence), HDesc (description of the hit sequence), QLength (length of the query sequence), HLength (length of the hit sequence), Score, Bits, Ident (%) (fraction of identical positions for a given HSP), AlnQuery(%) (fraction of the query sequence which has been aligned within a given HSP), AlnHit(%) (fraction of the hit sequence which has been aligned within a given HSP), Evalue (expectation value for the HSP), and SizeDiff (difference in length, expressed as a fraction, between the query and hit sequences). Users can choose one or a combination of attributes, connecting them with a logical AND or a logical OR.

Figure 7. An overview of the BioParser browser showing the filtering and display options, and the SQL field.

Usage

To demonstrate the applications of BioParser browser, the BLASTP output file multiple_blast_report will be used as an example. This file contains reports resulting from consecutive BLAST searches against the NCBI Protein Reference Sequences database (RefSeq) using three distinct hypothetical proteins encoded by the human parasite Trypanosoma cruzi. After parsing the multiple_blast_report file and loading it into a local MySQL database with BioParser, let us select only the records in which the fraction of identical positions in the HSP is greater than or equal to 90% AND the fraction of the query sequence which has been aligned within the HSP is greater than or equal to 90% AND the difference in length between the query and hit sequences is within 20% (for example, if a query sequence has 100 residues, any hit sequence ranging from 80 residues to 120 residues would satisfy this constraint, and vice-versa). Also, let us select only a few attributes to be displayed in the BioParser browser web output: QueryName, QLength, HitName, HLength, Score, Bits, Evalue, Ident(%), Pos(%), QGaps, HGaps, HSPLen, AlnQuery(%), and AlnHit(%). The procedure is straightforward (Figure 8):

Launch the BioParser browser;
Click check the Identity, AlnQuery(%), and SizeDiff fields of the BioParser browser Filtering Options section. Select the “greater than or equal to” symbol (>=) on the right hand side of the Identity and AlnQuery(%) fields with the corresponding drop-down button. Fill in the blanks in the right hand side of the Identity, AlnQuery(%), and SizeDiff fields with the following values, respectively: 90, 90, and 20;
Click check the following fields in the BioParser browser Display Options section: QueryName, QLength, HitName, HLength, Score, Bits, Evalue, Ident(%), Pos(%), QGaps, HGaps, HSPLen, AlnQuery(%), and AlnHit(%);
Optionally, the order in which the records should be displayed in the database searching result can be changed by checking the Order by field and using the drop-down buttons in it (Filtering Options section). The number of records that should be displayed per page can also be altered by filling in the blank in the List records per page field (Filtering Options section);
Press the Query button in the Filtering Options section. The result will be displayed in the same window.

Figure 8. Screenshot of the BioParser browser showing an example of filtering usage and the corresponding database searching result (see text for explanation).

The database searching result is displayed as a table in which the columns represent the selected sequence similarity attributes, and each line corresponds to a different alignment (Figure 8).

In this example, only 5 alignments out of 335 (total number of HSPs presented in the BLASTP report; press the Report Info button for details) satisfy the constraints. Records 1, 2 and 4 (from the top) correspond to self-alignments, whereas records 3 and 5 indicate two possible duplications in the genome of T. cruzi (the sequences in each pair are nearly identical).

Selected records can be exported to a plain text file. Just click check the records you would like to export (or Check All) and click the Export selected to ASCII link (Figure 8). Press the Download button if you want to export all records retrieved to a plain text file in one single step (Figure 8).

More experienced users can refine their selection using the SQL search field (Run SQL) employing Structured Query Language (see the syntax for SQL statements supported in MySQL at http://dev.mysql.com/doc/mysql/en/SQL_Syntax.html). The required information to SQL search the database created by BioParser is displayed in Figure 9 and Figure 10 (the default database structure and its tables and fields, respectively).

The proposed BioParser database structure is simple and intuitive: for each aligned pair present in the sequence similarity report, the attributes related to the query and hit sequences are stored (without redundancy) in the bp_query and bp_hit tables, respectively. The attributes that characterize each alignment (HSP) are stored in the bp_hsp table, which is linked to the query and hit tables by two foreign keys: query_id and hit_id, respectively (Figure 9).

Figure 9. Entity-Relationship Diagram showing the relational structure of the default database created by BioParser. PK - primary key; FK - foreign key.

Figure 10. An outline of the BioParser default database tables and fields. The var field in the bp_report table contains the following information: algorithm, version, database (db) name, db letters, db entries, parameters, total queries, total hits, total hsps, queries without hits, and hits without hsps.

BioParser Output Files [top]

The following files are parsed results obtained with the previous BLASTP outputs (single_blast_report and multiple_blast_report) using BioParser:

Reference [top]

Please cite the following article when using BioParser:

Catanho M, Mascarenhas D, Degrave W, de Miranda AB. BioParser: A Tool for Processing of Sequence Similarity Analysis Reports. Applied Bioinformatics. 5(1):49-53, 2006. [PubMed] [PDF]

Requests [top]

To obtain a copy of BioParser, please contact bioinfoteam@fiocruz.br.
Comments, questions, suggestions and problem reports are also welcome.