MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Design and Evaluation of an IR-Benchmark for SPARQL Fulltext Queries

ArunavMishra
Fachrichtung Informatik - Saarbrücken
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
AG Audience
English

Date, Time and Location

Monday, 27 May 2013
11:00
60 Minutes
E1 4
24
Saarbrücken

Abstract

In this thesis, we design a new IR-benchmark that aims to bridge the prevailing gap between traditional keyword-based retrieval techniques and semantic web-based retrieval techniques. We present a unique, entity-centric data collection, coined Wikipedia-LOD, that aims to combine the benefits of both text-oriented and structured retrieval settings. This collection combines RDF data from DBpedia and YAGO2 structured Knowledge Bases (KBs), and textual data from the contents Wikipedia articles into XML-ified documents, called the Wiki-XML documents, corresponding to every Wikipedia entity. To evaluate such a collection, we introduce a new query format, called SPARQL-fulltext (SPARQL-FT) queries. We design the SPARQL-FT query format by extending the W3C standard SPARQL with additional FTContains operator that constraints an entity by a set of keywords representing a fulltext condition. We design a query benchmark of 90 queries by manually translating Jeopardy-style Natural Language (NL) questions into the SPARQL-FT queries. We present the Wikipedia-LOD (v1.1) as a core collection for the newly introduced INEX 2012-LOD track, which defines three tasks over the collection, namely, Ad-hoc retrieval task, Faceted retrieval and the new Jeopardy task. For the Jeopardy task, we provide the query benchmark designed in this thesis to evaluate participating engines.

In this thesis, we further describe indexing, ranking, and query processing techniques that we implement in order to process the new kind of SPARQL-FT queries, provided in the context of Jeopardy task of the INEX 2012 Linked Data track, by introducing the SPAR-Key engine. For the rapid development of the new query engine that could handle this particular combination of XML mark-up and RDF-style resource/property-pairs, we decide to opt for a relational-DBMS as storage back-end, which allows us to index the collection and to retrieve both the SPARQL- and keyword-related conditions of the Jeopardy queries under one common application layer. Additionally, our engine comes with a rewriting layer that translates the SPARQL-based query patterns into unions of conjunctive SQL queries, thus formulating joins over both the DBpedia triples and the keywords extracted from the XML articles.
Finally, we perform a detailed evaluation of the effectiveness of our query engine by processing the benchmark queries. We present the results from the official INEX'12 evaluations for the Jeopardy task that was performed with Ad-hoc search style relevance assessments, obtained with the help of crowd sourcing. However, we show that such an evaluation does not truly comply with the task definition, and hence a re-evaluation with a QA-style assessment is required. For the re-evaluation, we create gold result set, or ground truth, by mapping already known correct answers of the NL questions to the Wikipedia entities. By outperforming our competitors in terms of MRR and NDCG, we show definite advantages of exploiting both structured information and unstructured information to improve Question-Answering and Entity-retrieval tasks.

Contact

--email hidden
passcode not visible
logged in users only

Aaron Alsancak, 05/21/2013 14:09
Aaron Alsancak, 05/21/2013 14:08 -- Created document.