The Genome Comparison Project

Project Progress

(updated August 02nd, 2007)

In the first phase of the project on the World Community Grid, more than 2.8 million protein sequences from 3,774 organisms, including viruses and more than 400 organisms which the complete genome sequence had been deciphered, were ALL against ALL compared. Most of those protein sequences had been predicted after computer analysis of the genetic code, determined by many research groups since the sixties and deposited in public databases, together with their mostly putative functional annotation.

For the Genome Comparison analysis, sequences were grouped in blocks of 2,000 each, and more that 1 million block-to-block comparisons were done. Starting on December 20th, 2006, 4 million block comparisons were carried out (including redundancy and verification), and this phase was completed by March 31st, 2007.

For the second phase of the project, the initial dataset was updated with newly published predicted protein sequences mostly from genomic data, adding 393,999 new sequences. Additionally, a fully curated reference dataset was added (SwissProt – with 254,609 protein sequences), contributing to controlled annotation and data cross-referencing. This part of the Genome Comparison finished on May 14th, 2007.

Finally, an experimental dataset of about 3 million potential protein sequences derived from Open Reading Frames (ORFs) lacking a classical computational coding prediction are now being analyzed. This is an attempt to discover additional non-classical coding patterns in genome sequences. This final phase of the project is expected to take an additional 4 months of World Community Grid processing.

The Genome Comparison Project (detailed statistics) is running concomitantly with several other projects on the World Community Grid.

Statistics for the Genome Comparison Project (all phases) on August 02nd, 2007: