Reasoning in uncertain knowledge bases with soft and hard rules
Despite the vast improvements in the field of information extraction in recent years, the very nature of the applied extraction techniques entails that the resulting knowledge bases may exhibit a high degree of uncertainty or even inconsistency. Often, these inconsistencies are not obvious even for a human and can only be uncovered through intricate and potentially expensive inference steps. In large RDF knowledge bases with millions of facts, an automated search for inconsistencies can be accomplished only through complicated, logic-based inference techniques, and an overly eager removal of inconsistencies can even lead to the loss of correct data. Automatic reasoning in large knowledge bases therefore requires a high robustness with respect to incomplete, imprecise, or even inconsistent data, which we will here summarize under the term uncertain data. While traditional query processing techniques for RDF (based on the SPARQL query language) are limited to deterministic data, the research in our latest project, URDF, particularly focuses on efficient, rule-based, and statistical inference techniques for uncertain RDF knowledge bases.
The image shows the visualization of such a knowledge base automatically extracted from Wikipedia containing facts about Max Planck. The edges show entity relations associated with Max Planck extracted as well as deduced by inference.
In URDF, we can express, for example, that many people live in the same place as their spouses. Being a rather “soft” form of an inference rule, this rule will also be violated by a number of instances (in this case, people) in the real world, which means that any knowledge derived from such a soft rule will also be uncertain. On the other hand, we can surely exclude that a person could have been born in different geographic locations. This represents a strict constraint, i.e., a “hard” rule, which may not be violated by any instance in our knowledge base. In a logic-based representation, these rules are often formulated as implications, so-called Horn clauses, where a conjunctive condition of facts implies a new fact:
marriedTo(person1, person2) livesIn(person2, location1) livesIn(person1, location1)
The above rule is formulated in first-order predicate logic. More generally, our URDF framework allows for identifying common patterns between the instances in the knowledge base, and we can generalize (i.e., learn) first-order rules from these patterns in an inductive way. Conversely, we can then deductively reason about individual instances in our knowledge base by applying these first-order rules. In this case, we can, for example, apply the above first-order rule to the persons “Angela Merkel”, “Joachim Sauer”, and the place “Berlin-Mitte”. This form of uncertain reasoning allows us to infer that “Joachim Sauer” might live in “Berlin-Mitte”, given that we know that his wife, “Angela Merkel”, also lives in “Berlin-Mitte”:
marriedTo(Joachim_Sauer, Angela_Merkel) livesIn(Angela_Merkel, Berlin-Mitte) livesIn(Joachim_Sauer, Berlin-Mitte)
Moreover, a special type of Horn clause can be formulated by grouping purely negated facts into a disjunction, as in the following example:
¬ bornIn(Angela_Merkel, Hamburg) ¬ bornIn(Angela_Merkel, Bremen)
With this Horn clause, we can express that “Angela Merkel” cannot have been born both in “Hamburg” and also in “Bremen”. By using negations as in the above example, we now have a formal means to identify inconsistencies. If, in the extraction step, both facts bornIn(Angela_Merkel, Hamburg) and bornIn(Angela_Merkel, Bremen) have been inserted erroneously into the knowledge base, we can formulate that only one of the two facts may be true by using negations of the above form.
Efficient evaluation strategies
URDF supports both logic-based (propositional) and probabilistic evaluation strategies to answer user queries. In propositional reasoning, the user is guaranteed to obtain a consistent overview of a potentially inconsistent knowledge base in response to a query. This problem is reduced to the maximum satisfiability problem (Max-Sat), a classic problem in propositional logics. For URDF, we have developed a highly efficient method for solving a generalization of the Max-Sat problem, which is specifically tailored to the combination of soft and hard rules described above. In probabilistic reasoning, on the other hand, URDF does not only assign binary true/ false values to facts, but also confidence weights, which correspond to a probabilistic interpretation of the derivation of the facts via the rules.
This enormous expressivity of rule-based and probabilistic inference techniques certainly poses major challenges to the development of new and efficient evaluation strategies for user queries. Both logic-based and probabilistic evaluation strategies underlie a much higher combinatorial complexity than traditional query evaluation strategies in relational databases. Therefore, exact evaluation algorithms cannot be applied to large volumes of data, which we obtain through information extraction from the World Wide Web. Our research concentrates on efficient approximation algorithms with good approximation guarantees, which are specifically tailored to these evaluation strategies.
About the author:
Martin Theobald studied Informatics in Saarbrücken, finishing his Diploma in 2002. He graduated 2006 in Informatics at Saarland University. Since that time he is employed as Postdoctoral Researcher with Dept. 5. After a Postdoc stay at Stanford University from 2006 to 2008 he was appointed as Senior Researcher. His research interests include
contact: mtb (at) mpi-inf.mpg.de
|Created by:||Bertram Somieski/MPI-INF, 06/06/2012 12:17 PM||Last modified by:||Bertram Somieski/MPI-INF, 06/18/2012 02:26 PM|