ECCB 2002 Poster sorted by: Author | Number

Next | Previous poster (in order of the view you have selected)

Title: GeneRule: A Tool for Extraction and Validation of Knowledge for Rule - Based Expert Systems
P107
Monossov, Vladimir; Guthke, Reinhard

Monossov@pmail.hki-jena.de, rguthke@pmail.hki-jena.de
Hans Knöll Institute for Natural Products Research, Beutenbergstr. 11, D-07745 Jena, Germany

GeneRule is an interactive data processing system and is developed for building knowledge bases on the basis of processing of databases with missing and faulty values, e.g. from DNA micro arrays. GeneRule is a pure Java 2 based software.
The input data for the system represents learning and testing sets. The objects of the sets must be divided into two classes, e.g. malign and normal cells. Each object is described through attribute values. The information on some objects may be incomplete, i.e. some attribute values may be missing. Based upon the sets, the system generates the rules which permit to refer an object to one of two classes. The rules are of the form "IF C THEN A", where C is a conjunction of attribute values and A is the conclusion.
The programm consists of the DataReader, RulesGenerator, RulesReader, ControlRulesBlock and ControlDataBlocks as shown in the Figure 1. DataReader tests and transforms input data set into encoded data sets with discret values of the features. RulesGenerator builds up the decision rules as disjunctive normal form (DNF). Each of the conjunctions in DNF can be presented by RulesReader in a pattern of professional language and can be edited by a text processor. Same (wrong) rules can be deleted and the processed DNF can be later used by GeneRule. The performance of the rules of DNF can be tested on the data sets by ControlRulesBlock. The low efficiency rules can be automatic excluded from DNF. The recognition of the objects of the data set can be tested by the ControlDataBlock.
GeneRule was successfully tested with an artificial data set of 100 objects of two classes x 50000 values x 10 attributes of values (it takes 6-8 hours).

Fig. 1 Flowchart of the program GeneRule (for the figure see web representation of the poster abstract)

The performace of this tool is illustrated on the example of the AML and ALL leukaemia classification (AML - acute myeloid leukaemia and ALL - acute leucoblastic leukaemia) based on an gene expression monitoring by Affymetrix micro arrays from http://www-genome.wi.mit.edu/cancer. The micro arrays contained probes for 6817 human genes. The data set consisted of the learning data set of 38 (27 ALL and 11 AML) and the examination data set of 34 (20 ALL and 14 AML) leukaemia samples.
On the first stage of the data processing the micro arrays data were transformed (by DataReader) into the logic (high/low) form. In the second step the decision rules for AML and ALL were created by RulesGenerator (20124 decision rules for ALL and 235875 for AML). It takes about 3 hours of the computer time (Athlon 1800, 2G Memory). On the third stage the decision rules were revised by ControlRulesBlock on the examination set and the best (7 for ALL and 55 for AML) rules were selected and included into the knowledge base. The conjunction range, i.e. the number of values in any rule was 2 up to 9.
Any of the decision rules for ALL recognizes at least 55% of ALL in the examination set and 77% in the training set. The results for AML rules are 35% in the examination set and 81% in the training set. All rules of the total knowledge base recognize all ALL and AML samles without errors. In accordance with this tool the rulebase can be revised by user to process the rule base into a knowledge base.
The implementation of the next tool version is supposed to make use of automatic rule processing by existing knowledge bases.