The Genome Comparison Project

Help Map Protein Relationships

World Community Grid and the Oswaldo Cruz Institute, Fiocruz, will be comparing genomic information to improve the quality and interpretation of biological data and our understanding of biological systems, host-pathogen and environmental interactions. This information can play a critical role in the development of better drugs and vaccines, and improved diagnostic procedures.

Site em Português



>> Overview
>> About the Project
>> Project FAQs
>> Additional questions and answers about the project and results
>> Project Progress
>> Research Participants

Overview [top]

The Genome Comparison Project: Improving protein functional annotation in databases

Over the years, a rather large body of secondary information (structural, functional, similarities to other entries and a variety of cross-references) has been attached to protein database entries. Once such information is entered, it rarely gets updated or corrected. Thus, annotation of predicted protein function is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous, incorrectly annotated sequences. Additionally, many proteins are composed of several structural and/or functional domains (modules comprising distinct evolutionary, functional and structural units), which can be overlooked by automated annotation procedures. Moreover, the comparative information available today is huge when compared to the early days of genomics.

The main objective of the Genome Comparison Project is to perform a complete pairwise comparison between all predicted protein sequences, obtaining similarity indices that will be used, together with standardized Gene Ontology (http://www.geneontology.org), as a reference repository for the annotator community, providing an invaluable data source for biologists. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman [1981] J. Mol. Biol. 147:195-197) (algorithm is an organized procedure for performing a given type of calculation or solving a given type of problem), which finds the mathematically best local alignment between pairs of sequences.

As a result, precise annotation, correction of inconsistencies, and assignment of possible functions to hypothetical proteins of unknown function will be possible. Moreover, proteins with multiple domains and functional elements will be correctly spotted. Even distant relationships will be detected.


About the Project [top]

Genome Comparison Project: A Layperson's Explanation

Genes, genomes and genomic data

Genes are the hereditary units in all living organisms. They comprise essential components of the genome (the complete set of genetic information) of those organisms, and are responsible for the physical development, the metabolism and (in some extent) the behavior of those organisms. The majority of the genes encode proteins, large molecules that are made of long chains of smaller molecules called amino acids, accounting for most of the biochemical reactions carried out by the cell. Although most of the genes encode for proteins, some produce very important RNA molecules; other genes do not encode any molecule at all but are important from a structural or regulatory point of view. In any case, the molecules produced as the result of the activity of a given gene are known as gene products.

Genomes are DNA or RNA molecules stored and organized in one or more linear or circular chromosomes. Bacteria, for example, store their genome in the cytoplasm; eukaryotic organisms (organisms that consist of one or more cells, each of which has a nucleus and other well-developed intracellular compartments) organize their genome in the nucleus, as well as in specialized organelles such as mitochondria, and chloroplasts (in plants).

Since the 1990's, international efforts have led to the determination of the complete genetic code of more than 400 organisms (http://www.genomesonline.org/), such as bacteria, yeasts, protozoan parasites, invertebrates and vertebrates, including Homo sapiens, and plants. More than 1,500 genome investigations are currently ongoing, representing medical, commercial, environmental and industrial interests or important research models. As this work continues, new genome sequences are becoming available at an ever faster pace, adding to the fragmentary data available from thousands of organisms, including viruses. Resulting data have the potential to disclose the principles underlying the genetics, biochemistry and evolutionary aspects of these organisms, as well as enable the development of new prognostic markers, better drugs and vaccines, and improved diagnostic procedures, amongst others.

Different parts of a genome (from close to 100% in bacteria to less than 2% in humans) encode for the proteins that dictate structural and functional cellular activities. Computer analysis has predicted which regions of the genome encode for the proteins (several hundreds or thousands of proteins in bacteria to about 30,000 proteins and their variants in humans). However, the prediction of the cellular functions of those derived proteins (structural, enzymes, transporter and signaling functions, etc.) is mostly hypothetical. The vast majority of probable functions have been attributed by in silico (computer) analysis, using sequence comparison with proteins in databases. However, thus far, only a small fraction of predicted proteins have had their functions confirmed by laboratory experiments.

Protein coding genes and their annotation

Release 19 of RefSeq (September 2006), a reference sequence collection (http://www.ncbi.nlm.nih.gov/RefSeq/), registers more than 2.8 million predicted protein coding genes, from 3,774 organisms, including viruses. Most of the identifications of putative protein encoding genes and their associated protein sequences together with their functional annotation (the assignment of predicted biological functions and structural features to raw sequence data) have been done using bioinformatics tools and database comparisons. Such structural and functional annotation has been building up over the years, based on cross-referencing between the growing databases. While several efforts are under way to construct a carefully verified reference set of proteins where attributed function has been experimentally verified, using a reference set of nomenclature for gene, protein and cellular function (called Gene Ontology - GO [http://www.geneontology.org]) and standardized annotation rules, such a database does not yet exist.

The Genome Comparison Project: Improving protein functional annotation in databases

Over the years, a rather large body of secondary information (structural, functional, similarities to other entries and a variety of cross-references) has been attached to protein database entries. Once such information is entered, it rarely gets updated or corrected. Thus, annotation of predicted protein function is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous, incorrectly annotated sequences. Additionally, many proteins are composed of several structural and/or functional domains (modules comprising distinct evolutionary, functional and structural units), which can be overlooked by automated annotation procedures. Moreover, the comparative information available today is huge when compared to the early days of genomics.

The main objective of the Genome Comparison Project is to perform a complete pair-wise comparison between all predicted protein sequences, obtaining similarity indices that will be used, together with standardized Gene Ontology (http://www.geneontology.org), as a reference repository for the annotator community, providing an invaluable data source for biologists. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman [1981] J. Mol. Biol. 147:195-197), which finds the mathematically best local alignment between pairs of sequences.

As a result, precise annotation, correction of inconsistencies, and assignment of possible functions to hypothetical proteins of unknown function will be possible. Moreover, proteins with multiple domains and functional elements will be correctly spotted. Even distant relationships will be detected.

Biological functions, complex systems and biodiversity

The biological systems within a cell are of great complexity, and our understanding of the whole protein content of a cell, protein interactions, biochemical pathways and their regulation is only very partial. A database reflecting all the primary sequence relationships between the corresponding proteins from all known organisms at the genomic level will be invaluable to improve our understanding of this complexity.

Additionally, the database will benefit many experimental approaches to the analysis of the biodiversity on our planet. Scientists investigating environmental samples or fragmentary analysis of new organisms will be able to use the results of the Genome Comparison analysis to investigate different aspects of the genetics and biochemistry of these organisms. Moreover, the description and analysis of evolutionary relationships between proteins (and microorganisms) based on such genome analysis will be a major step forward towards our understanding of the evolution of genome structure and the biochemical and structural organization of organisms. Large scale initiatives such as the description of the Tree of Life and cataloging the Biodiversity will greatly benefit from the Genome Comparison database.

New drugs, vaccines and diagnostics

Scientific research and (bio) technological development based on genomics are making increasing progress towards new diagnostics, as well as the development of new drugs and vaccines. Comparative genomics and the knowledge of biochemical pathways and cellular processes are of utmost importance in this field. On the other hand, functional analysis and protein interaction studies are of key importance to understand how microorganisms, cells in a multi-cellular organism, and pathogens interact with their environment (and/or hosts), opening up the way for the design of new control strategies for infectious and parasitic diseases, as well as metabolic and chronic or degenerative diseases.

World Community Grid and genome functional annotation

Stringent pairwise sequence comparisons are quite computing-intensive operations, and an all against all comparison of predicted proteins from all completely sequenced genomes today is a task almost impossible to achieve without supporting from World Community Grid's very large grid structure. The resulting information matrix will form an invaluable database that can be continuously incremented as new genome sequences become available and will form the basic material for many functional studies within the scientific community at large.


Project FAQs [top]

* What is Fiocruz Genome Comparison?
* Why is the Genome Comparison Project comparing protein sequences?
* How proteins are compared in the Genome Comparison Project?
* What are the potential benefits of the Genome Comparison Project?
* How do I join the Genome Comparison Project?
* What computers can run Genome Comparison?
* What do those circles, symbols and letters in the Genome Comparison agent application window mean?

What is Fiocruz Genome Comparison? [FAQs]

Fiocruz Genome Comparison is a project of the Bioinformatics Team at the Department of Biochemistry and Molecular Biology of Fiocruz that uses distributed computing to contribute your computer's idle resources to calculate the sequence similarity level among the whole protein content encoded in completely sequenced genomes of hundreds of organisms, including humans and several other species of medical, commercial, industry, or research importance. The calculated similarity indices will be used, together with standardized Gene Ontology, as a reference repository for the annotator community, providing an invaluable data source for biologists.

Why is the Genome Comparison Project comparing protein sequences? [FAQs]

Only a fraction of the predicted protein content encoded in completely sequenced genomes has actually had their biological function and expression confirmed through laboratory analysis. The assignment of predicted biological functions and structural features to raw sequence data is called annotation, and is accomplished mostly by comparing them to predicted proteins or protein coding genes with information stored in different public domain databases around the world. However, annotation is often incomplete, uses non-standardized nomenclature or can be incorrect when inferred from previous incorrectly annotated sequences. Thus, an all against all controlled comparative database would be of great use as a reference.

How proteins are compared in the Genome Comparison Project? [FAQs]

Biological sequences (DNAs, RNAs, and proteins) are mostly compared in pairs through a process called pairwise sequence alignment, which consists in put two sequences side-by-side in such a way that the number of identical positions between them is maximized. The sequences can be globally (taking the whole sequences) or locally (taking parts of the sequences) aligned, depending on the context and the purpose. The sequence similarity comparison program used in the Genome Comparison Project is called SSEARCH (W.R. Pearson [1991] Genomics 11:635-650), a freely available implementation of the Smith-Waterman rigorous algorithm (T. F. Smith and M. S. Waterman, [1981] J. Mol. Biol. 147:195-197) (algorithm is an organized procedure for performing a given type of calculation or solving a given type of problem), which finds the mathematically best local alignment between pairs of sequences.

What are the potential benefits of the Genome Comparison Project? [FAQs]

How do I join the Genome Comparison Project? [FAQs]

All you need to do to join Genome Comparison is download and install the free software provided by the World Community Grid. Once the software has been installed, your computer is then automatically set to work and you can continue using your computer as usual.

How does the Genome Comparison software work? [FAQs]

The software automatically downloads small pieces of data (predicted protein sequences) and performs sequence comparisons to accurately calculate the similarity level among them. After your computer processes the information, the results are sent by World Community Grid to Fiocruz where they are analyzed by the Bioinformatics Team at the Department of Biochemistry and Molecular Biology. Large-scale comparative analysis applying Smith-Waterman algorithm is computationally intensive and demands exceptionally huge computational power, which is why the World Community Grid needs you (and your friends!) to participate in Genome Comparison.

What computers can run Genome Comparison? [FAQs]

Currently, the system requirements for this project are:

What do those circles, symbols and letters in the Genome Comparison agent application window mean? [FAQs]

The panel presented in the Genome Comparison agent application window represents the entities involved in the comparison process and a summary of the result achieved for a pair of them.

The small circles on the left side symbolize two different genes, pertaining to two distinct genomes or to a single genome. Inside of each circle we can see the unique number that identifies the predicted protein sequence encoded by the gene in the source database.

The large circle on the right side of the panel shows the corresponding protein sequences, their descriptions, and the abbreviated name of the similarity scores and their calculated values for this particular pair of sequences.

The protein sequences are represented by an ordered string of letters (as encoded in their respective genes). Each of those letters stands for a different amino acid (M for methionine, S for serine, and so on) in the protein.

Most protein sequences are hypothetical or putative, which means that their existence have been computationally predicted but their expression by the respective cell or organism have not been experimentally confirmed yet.

Overall, one can infer a low level of similarity between this particular pair of sequences, based on the values achieved for the computed similarity scores:

Parameter

Value

Short description

s-w

[91]

Smith-Waterman score. The raw score obtained for aligning two sequences, according to a particular substitution matrix

bits

[29.5]

Bit score. The normalized raw score

E()

[0.2]

Expected value or E-value. Represents the number of alignments with the same score or higher expected by chance

%_id

[0.304]

Fraction of identical positions for a given alignment

alen

[79]

Alignment length

an0

[8]

Start position of the query sequence in the alignment

ax0

[85]

End position of the query sequence in the alignment

an1

[243]

Start position of the subject sequence in the alignment

ax1

[321]

End position of the subject sequence in the alignment

gapq

[1]

Number of gaps introduced in the query during the alignment

gapl

[0]

Number of gaps introduced in the subject during the alignment