Toward Deep Semantic Understanding: Event-Centric Multimodal Knowledge Acquisition
University of Illinois Urbana Champaign
Manling Li is a Ph.D. candidate at the Computer Science Department of University of Illinois Urbana-Champaign. Her work on multimodal knowledge extraction won the ACL'20 Best Demo Paper Award, and the work on scientific information extraction from COVID literature won NAACL'21 Best Demo Paper Award. She was a recipient of Microsoft Research PhD Fellowship in 2021. She was selected as a DARPA Riser in 2022, and a EE CS Rising Star in 2022. She was awarded C.L. Dave and Jane W.S. Liu Award, and has been selected as a Mavis Future Faculty Fellow. She led 19 students to develop the UIUC information extraction system and ranked 1st in DARPA AIDA evaluation in 2019 and 2020. She has more than 30 publications on multimodal knowledge extraction and reasoning, and gave tutorials about event-centric multimodal knowledge at ACL'21, AAAI'21, NAACL'22, AAAI'23, etc. Additional information is available at https://limanling.github.io/.
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI
Please note that this is a virtual talk which will be video casted to Saarbrücken and Kaiserslautern.
Traditionally, multimodal information consumption has been entity-centric with a focus on concrete concepts (such as objects, object types, physical relations, e.g., a person in a car), but lacks ability to understand abstract semantics (such as events and semantic roles of objects, e.g., driver, passenger, mechanic). However, such event-centric semantics are the core knowledge communicated, regardless whether in the form of text, images, videos, or other data modalities.
At the core of my research in Multimodal Information Extraction (IE) is to bring such deep semantic understanding ability to the multimodal world. My work opens up a new research direction Event-Centric Multimodal Knowledge Acquisition to transform traditional entity-centric single-modal knowledge into event-centric multi-modal knowledge. Such a transformation poses two significant challenges: (1) understanding multimodal semantic structures that are abstract (such as events and semantic roles of objects): I will present my solution of zero-shot cross-modal transfer (CLIP-Event), which is the first to model event semantic structures for vision-language pretraining, and supports zero-shot multimodal event extraction for the first time; (2) understanding long-horizon temporal dynamics: I will introduce Event Graph Model, which empowers machines to capture complex timelines, intertwined relations and multiple alternative outcomes. I will also show its positive results on long-standing open problems, such as timeline generation, meeting summarization, and question answering. Such Event-Centric Multimodal Knowledge starts the next generation of information access, which allows us to effectively access historical scenarios and reason about the future. I will lay out how I plan to grow a deep semantic understanding of language world and vision world, moving from concrete to abstract, from static to dynamic, and ultimately from perception to cognition.
Please contact the Office team for Zoom link information.