|
Biological
data possess many special characteristics that makes biological data
management challenging.
Biological
data is highly complex when compared with most other applications. Definitions
of such biological data must be able to represent a complex substructure
of data as well as the relationships and to ensure that no information
is lost during the biological data modeling. The data model must be
able to represent any level of complexity in any data schema, relationship,
or schema substructure and not just in a hierarchical, binary, or tabular
data format. For example, the NCBI biological data model treats a biological
sequence as a simple integer coordinate system with which diverse data
can be associated. A wide range of data is closely linked to the coordinate
system such as the sequence of the amino acids. (Ostell & Kans 2001)
The
amount and range of variability in data is high. Thus, there must be
flexibility in handling data types and values. Frequent exception to
biological data structures may require a choice of data types for a
given piece of data. Also, there are often overlap in the data types
between the different organisms and the different genome projects.
Schemas
in biological databases change at a rapid pace. It is currently not
possible in most relational and object database systems to extend the
schema. So, nucleotide sequence databases like GenBank re-release (currently
the NCBI-GenBank FlatFile Release 123.0 on April15, 2001) the entire
database with new schemas instead of incrementally changing the system
as changes become necessary, though this may seem transparent to the
user.
Representations
of the same data by different biologists will most likely be different
(even when using the same system). Given the complexity of biological
data, there are many ways to model any given entity generally with the
results dependent upon the emphasis of the designer. Despite two individuals
may produce different data models if asked to interpret the same entity,
the models will have numerous points in common. It is extremely useful
for scientists to be able to run queries across these common points
to understand the connection between seemingly unrelated concepts. This
can be accomplished by linking data elements in a network of schemas.
Most
biologist will not care or know about the data structure or the schema
design.
Thus, the interface to the biological database/resource should display
information to the user in a manner appropriate for the problem being
addressed and that reflects the underlying data structures. That is,
database access should be a transparent, intuitive user interface. A
transparent system hides the implementation details from the users (e.g.
in a centralized system the data = storage system is shielded from the
user while in a distributed approach both the data and the network is
shielded from the user). A fully transparent system (whether centralized
or distributed) provides a high level of support for the development
of complex applications which is especially necessary for biological
data. An intuitive system reduces the demand for documentation and reduces
the time to learn. Often the view into the database is via a web interface
(for the casual/general user) usually with preset search forms which
may limit access into the database since there is a finite combination
of preset search forms. A command line interface (for the power user)
often provide additional combination of queries, especially in conjunction
with a scripting language. It is generally accepted that 95% of the
users use 5% of the computing resource (usually via a web interface)
while 5% of the users use 95% of the computing resources (usually via
a command-line interface).
The
context of data provides additional meaning for its use in biological
applications. The more contexts integrated together can provide a richer
interpretation of a biological data value, however isolated data values
are meaningless in biological systems. Similarly, "145" is meaningless
with out units and context, e.g. gene 145, gene begins at position145,
gene ends at position 145, taxonomy identification 145, number of bases
145, etc.
Defining
and representing complex queries are extremely important to the biologist.
Complex queries must be available without knowledge of the data structure.
The average user will NOT be able to construct complex queries, so a
tool for building these queries will be necessary beyond predefined
query templates.
Users
of biological information often require access to previous versions
of existing data. For example, GenBank uses the Accession.version number
for protein sequences within the flat file records. The version also
corresponding to a gi number representing a specific sequence. The Accession
remains the same during updates, but every update protein sequence would
receive a new gi. GenBank is a nucleotide-centric public database, but
proteins are translated products of nucleotides.
All
of these characteristics reveals that today's DBMS do not fully solve
the requirement of complex biological data and further research in database
management is necessary (Commission on Physical Sciences, Mathematics,
and Application 1999). (Elmasri & Navathe 2000)
References: Ostell,
JM., S.J. Wheelan, J.A. Kans. 2001. The NCBI Data Model in Bioinformatics:
A Practical Guide to the Analysis of Genes and Proteins 2ne Edition.
John Wiley & Sons Publishing. ISBN: 0471383910 pp. 19-44.
Elmasri,
R.A. and Navathe, S.B. 2000. Fundamentals of Database Systems 3rd Edition.
Addision-Wesley Pubishing. ISBN: 0805317554
Genetic
Sequence Data Bank April 15, 2001 NCBI-GenBank Flat File Release 123.0
Distribution Release Notes ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt
|