MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Scoring Search Results in the Presence of Overlapping Data Sources

Silke Trißl
Humbold-Universität zu Berlin
Talk
AG 1, AG 3, AG 5, RG2, AG 2, AG 4, RG1, SWS  
AG Audience
English

Date, Time and Location

Friday, 3 August 2007
14:00
90 Minutes
E1 4
433 (Rotunda 4th floor)
Saarbrücken

Abstract

Data integration projects in the life sciences often gather data on a particular subject from multiple sources. Some of these sources overlap to a certain degree. Therefore, integrated search results may be supported by one, few, or all data sources. To reflect these differences, results of a query should be ranked according to the number of data sources that support them.

In my talk I will discuss how such a ranking scheme should look like, as it is not clear per se. Either, results supported by only few sources are ranked high because this information is potentially new, or such results are ranked low because the strength of evidence supporting them is limited. We defined a surprisingness score, preferring results supported by few sources, and a confidence score, preferring frequently encountered information. Unlike many other scoring schemes our proposal is purely data-driven and does not require users to specify preferences among sources. Both scores take the concrete overlaps of data sources into account and do not presume statistical independence. In my talk I will present some results using the Columba database. Columba is an integrated database on protein structures from the PDB and their annotations such as fold, function, or sequence.

Contact

Ralf Schenkel
+49 681 9325 504
--email hidden
passcode not visible
logged in users only

Ralf Schenkel, 07/26/2007 12:55 -- Created document.