MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Adapting Named Entity Disambiguation for Arabic Text

Mohamed Gad-Elrab
International Max Planck Research School for Computer Science - IMPRS
PhD Application Talk
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 4 May 2015
09:15
75 Minutes
E1 4
024
Saarbrücken

Abstract

Named Entity Disambiguation (NED) is the problem of mapping mentions of ambiguous names in natural language text onto canonical entities like people or places, registered in a knowledge base. Recent advances in this field enables semantically understanding content in different types of text. While the problem had been extensively studied for the English text, the support for other languages and -in particular- Arabic is still in its infancy.  In addition, Arabic Web content (e.g. in social media) has been increasing dramatically over the last years. Therefore, we see a great potential for endeavors that support an entity-level analytics of these data. AIDArabic is the first work in that direction that used evidences from  both English and Arabic Wikipedia to enrich existing AIDA system and allowing the disambiguation of Arabic content to an automatically generated knowledge base from Wikipedia.

The contributions in this work are three fold: 1) we introduce techniques for automatically augmenting AIDArabic’s entities catalog and disambiguation ingredients using information beyond interwiki links. We achieved that by fusing the output of a lightweight machine translation, transliteration and web external sources. 2) We introduced a language-specific input processing module to handle the language specific differences in the Arabic language. 3) We automatically built test corpus from other parallel corpora to overcome the absence of standard benchmarks Arabic NED systems.  We evaluated single components as well as the full pipeline using a mix of manual and automatic assessment. Initial enrichment statistics show that our system can disambiguate mentions to one of 2.4 M entities instead of only 140K in the original AIDArabic.

Contact

Jennifer Gerling
1800
--email hidden
passcode not visible
logged in users only

Jennifer Gerling, 05/02/2015 18:55 -- Created document.