Bioinfo Helpdesk & On-line Training

Characteristics of Biological Data

Biological data possess many special characteristics that makes biological data management challenging.

Biological data is highly complex when compared with most other applications. Definitions of such biological data must be able to represent a complex substructure of data as well as the relationships and to ensure that no information is lost during the biological data modeling. The data model must be able to represent any level of complexity in any data schema, relationship, or schema substructure and not just in a hierarchical, binary, or tabular data format. For example, the NCBI biological data model treats a biological sequence as a simple integer coordinate system with which diverse data can be associated. A wide range of data is closely linked to the coordinate system such as the sequence of the amino acids. (Ostell & Kans 2001)

The amount and range of variability in data is high. Thus, there must be flexibility in handling data types and values. Frequent exception to biological data structures may require a choice of data types for a given piece of data. Also, there are often overlap in the data types between the different organisms and the different genome projects.

Schemas in biological databases change at a rapid pace. It is currently not possible in most relational and object database systems to extend the schema. So, nucleotide sequence databases like GenBank re-release (currently the NCBI-GenBank FlatFile Release 123.0 on April15, 2001) the entire database with new schemas instead of incrementally changing the system as changes become necessary, though this may seem transparent to the user.

Representations of the same data by different biologists will most likely be different (even when using the same system). Given the complexity of biological data, there are many ways to model any given entity generally with the results dependent upon the emphasis of the designer. Despite two individuals may produce different data models if asked to interpret the same entity, the models will have numerous points in common. It is extremely useful for scientists to be able to run queries across these common points to understand the connection between seemingly unrelated concepts. This can be accomplished by linking data elements in a network of schemas.

Most biologist will not care or know about the data structure or the schema design.

Thus, the interface to the biological database/resource should display information to the user in a manner appropriate for the problem being addressed and that reflects the underlying data structures. That is, database access should be a transparent, intuitive user interface. A transparent system hides the implementation details from the users (e.g. in a centralized system the data = storage system is shielded from the user while in a distributed approach both the data and the network is shielded from the user). A fully transparent system (whether centralized or distributed) provides a high level of support for the development of complex applications which is especially necessary for biological data. An intuitive system reduces the demand for documentation and reduces the time to learn. Often the view into the database is via a web interface (for the casual/general user) usually with preset search forms which may limit access into the database since there is a finite combination of preset search forms. A command line interface (for the power user) often provide additional combination of queries, especially in conjunction with a scripting language. It is generally accepted that 95% of the users use 5% of the computing resource (usually via a web interface) while 5% of the users use 95% of the computing resources (usually via a command-line interface).

The context of data provides additional meaning for its use in biological applications. The more contexts integrated together can provide a richer interpretation of a biological data value, however isolated data values are meaningless in biological systems. Similarly, "145" is meaningless with out units and context, e.g. gene 145, gene begins at position145, gene ends at position 145, taxonomy identification 145, number of bases 145, etc.

Defining and representing complex queries are extremely important to the biologist. Complex queries must be available without knowledge of the data structure. The average user will NOT be able to construct complex queries, so a tool for building these queries will be necessary beyond predefined query templates.

Users of biological information often require access to previous versions of existing data. For example, GenBank uses the Accession.version number for protein sequences within the flat file records. The version also corresponding to a gi number representing a specific sequence. The Accession remains the same during updates, but every update protein sequence would receive a new gi. GenBank is a nucleotide-centric public database, but proteins are translated products of nucleotides.

All of these characteristics reveals that today's DBMS do not fully solve the requirement of complex biological data and further research in database management is necessary (Commission on Physical Sciences, Mathematics, and Application 1999). (Elmasri & Navathe 2000)

References:

Ostell, JM., S.J. Wheelan, J.A. Kans. 2001. The NCBI Data Model in Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 2ne Edition. John Wiley & Sons Publishing. ISBN: 0471383910 pp. 19-44.

Elmasri, R.A. and Navathe, S.B. 2000. Fundamentals of Database Systems 3rd Edition. Addision-Wesley Pubishing. ISBN: 0805317554

Genetic Sequence Data Bank April 15, 2001 NCBI-GenBank Flat File Release 123.0 Distribution Release Notes ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt