Lange, Matthias(1);Köhler, Jacob(2);Schweizer, Patrick(1) - Algebraic Concepts for Data Domains in Life Science

ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: Algebraic Concepts for Data Domains in Life Science	P88
Lange, Matthias(1); Köhler, Jacob(2); Schweizer, Patrick(1) lange@ipk-gatersleben.de 1: Institute of Plant Genetics and Crop Plant Research (IPK) 2: Bioinformatics / Medical Informatics, Technical Faculty, Bielefeld

Introduction
Relational databases use generic domains, i.e. datatypes [3] such as String, Integer or Boolean, although the actual data stored use a more specific vocabulary (Enzyme Number, Systematic Species Name, English species name, CAS registry number etc.). Relational DBMS provide operators and functions for String comparison and conversion. However, EC numbers for example are only a subset of the datatype String, and EC number specific comparison operators are missing. For example it might be useful to compare enzymes by their subclass, but this is not possible by generic string operators such as <>= etc.
Therefore, in SEMEDA [2] besides the generic domain of a database attribute (String, Integer etc.), the actual vocabulary can be defined. Based on these definitions, vocabulary specific functions and operators could be defined for comparison and conversion between vocabularies.
Subsequently the term domain applies to the generic domains of relational databases, whereas the term vocabulary applies to the actual vocabulary used. Thus vocabularies are subsets of domains.

Concepts of Data Domain Algebra

Formally, a vocabulary Vn can be defined as a finite or infinite set of data elements E. Thus, the overall set V, which includes all data elements, can be defined as:

[formula1.gif]

Furthermore, a bijective function dom is defined, which assigns a data element to a vocabulary: [formula2.gif]

Vocabularies may overlap. They can be arranged in specialization hierarchies, where the roots are generic data domains as used in relational databases, and the vocabularies are defined as sub-domains of the database domains.
To work with these specialized vocabularies relations, operators and convert functions are defined:

-relations : [formula3.gif]

-operators: [formula4.gif]

-convert functions: [formula5.gif]

Thus, for each vocabulary a algebraic structure S can be given, which includes also a function: [formula6.gif]

(For the formulas see web representation of the poster abstract)

S applies to a vocabulary, and all sub-vocabularies inherit the relations, operators or convert functions of their super-vocabularies, unless the sub-vocabulary redefines an element of the parent vocabulary (polymorphism). In addition, vocabulary specific relations, operators and functions may be defined.

Discussion
For example, the most general data domain are simple Strings. In the next level numbers, EC numbers, enzymatic reactions, substance names etc. might be defined, where children of numbers are integers and floating point numbers etc.
An example usage for relations is the implementation of a comparison operation to compare subclasses of EC numbers or to compare enzymatic catalytic reactions in different data formats.
Example usages for operators are the semantic merging of literature abstracts or the clustering of EST fragments to build consensus sequences.
Example usages for convert functions are conversion of a DNA sequence from Genbank format to FASTA format, conversion of temperature from F to C°, conversion of concentrations of chemical substances etc. Convert functions are especially useful to convert data between different databases. Often different databases use different formats for equivalent data.
In contrast to OODBMS, in relational databases new domains cannot be defined. Date [1] criticizes this fact, since this is not an inherent property of relational databases [3], but rather a result of how relational DBMS were implemented. Therefore, the introduced algebraic definitions could expand relational life science databases to provide more appropriate and powerful data queries. Furthermore, the task of data integration can be fundamentally supported. Data integration approaches as described in [4] could use these principles to integrate data and discover new relationships.
Two possibilities for the implementation of the suggested data-structure exist. On the one hand, the data-structure itself might be implemented. On the other hand, it is possible to implement this data structure within a relational DBMS. Many relational DBMSs such as Postgres and Oracle enable the implementation of user-defined functions within the DBMS by using proprietary built in scripting languages or external native libraries. Thus it only remains to model the tree-structure of the vocabularies and define within this tree-structure which operators and functions can be applied to which vocabulary.

[1] Date, C. J. 2000. An introduction to database systems, 7th edition. Addison-Wesley, Reading, Mass.
[2] Köhler, J. and S. Schulze-Kremer. 2001. The Semantic Metadatabase (SEMEDA): Ontology Based Integration of Federated Molecular Biological Data Sources. in GCB, German Conference on Bioinformatics.
[3] Louis, G. and A. Pirotte. 1982. A Denotational Definition of the Semantics of DRC, A Domain Relational Calculus. Pages 348-356 in Eigth International Conference on Very Large Data Bases. Morgan Kaufmann, Mexico City, Mexico.
[4] A. Freier, R. Hofestädt, M. Lange, U. Scholz. and A. Stephanik. 2002. BioDataServer: A SQL-based service for the online integration of life science data. In Silico Biology, 2(0005).