MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Knowledge-driven Entity Recognition and Disambiguation in Biomedical Text

Amy Siu
MMCI
Promotionskolloquium
AG 1, AG 2, AG 3, AG 4, AG 5, SWS, RG1, MMCI  
Public Audience
English

Date, Time and Location

Monday, 4 September 2017
16:00
60 Minutes
E1 4
024
Saarbrücken

Abstract

Entity recognition and disambiguation (ERD) for the biomedical domain
are notoriously difficult problems due to the variety of entities and
their often long names in many variations. Existing works focus heavily
on the molecular level in two ways. First, they target scientific
literature as the input text genre. Second, they target single, highly
specialized entity types such as chemicals, genes, and proteins.
However, a wealth of biomedical information is also buried in the vast
universe of Web content. In order to fully utilize all the information
available, there is a need to tap into Web content as an additional
input. Moreover, there is a need to cater for other entity types such as
symptoms and risk factors since Web content focuses on consumer health.
The goal of this thesis is to investigate ERD methods that are
applicable to all entity types in scien-tific literature as well as Web
content. In addition, we focus on under-explored aspects of the
bio-medical ERD problems -- scalability, long noun phrases, and
out-of-knowledge base (OOKB) enti-ties.
This thesis makes four main contributions, all of which leverage
knowledge in UMLS (Unified Med-ical Language System), the largest and
most authoritative knowledge base (KB) of the biomedical domain. The
first contribution is a fast dictionary lookup method for entity
recognition that maximiz-es throughput while balancing the loss of
precision and recall. The second contribution is a semantic type
classification method targeting common words in long noun phrases. We
develop a custom set of semantic types to capture word usages; besides
biomedical usage, these types also cope with non-biomedical usage and
the case of generic, non-informative usage. The third contribution is a
fast heu-ristics method for entity disambiguation in MEDLINE abstracts,
again maximizing throughput but this time maintaining accuracy. The
fourth contribution is a corpus-driven entity disambiguation method that
addresses OOKB entities. The method first captures the entities
expressed in a corpus as latent representations that comprise in-KB and
OOKB entities alike before performing entity disam-biguation.

Contact

Daniela Alessi
5000
--email hidden
passcode not visible
logged in users only

Daniela Alessi, 08/25/2017 10:06 -- Created document.