Audiovisual Data Processing for Robust Human-Machine-Communication and Media Retrieval
Björn Schuller
Lehrstuhl für Mensch-Maschine-Kommunikation, TU München, Germany
Talk
Björn Schuller received his diploma (1999) and his doctoral degrees (2006) in electrical engineering and information technology for his works in Automatic Speech and Emotion Recognition from TUM (Munich University of Technology), one of Germany's first three Excellence Universities, where he currently stays as senior researcher and lecturer in Pattern Recognition and Speech Processing as officially acknowledged candidate for the PD Dr.-Ing. habil. degree. He is a member of the ACM, IEEE, and ISCA, and authored and co-authored more than 100 publications in books, journals and peer reviewed conference proceedings in the field of signal processing, and machine learning. Best known are his works advancing Speech Processing, Affective Computing, and Music Information Retrieval. He served as reviewer for several scientific journals, and as invited speaker, session and challenge organizer and chairman, and programme committee member of numerous international conferences. Project steering board activity and involvement in current and past research projects include SEMAINE funded by the European Community's Seventh Framework Programme, the HUMAINE CEICES initiative, and projects funded by companies as BMW, Continental, Daimler, Siemens, Toyota, and VDO. Advisory board activities comprise his membership as invited expert in the W3C Emotion Incubator and Emotion Markup Language Incubator Groups, and his election into the Executive Committee of the HUMAINE Association where he chairs the Special Interest Group on Emotion Recognition from Speech.
Audiovisual signal processing approaches are widely agreed to be superior to their unimodal counterparts with respect to robustness, fail-safeness or comfort to users in a multiplicity of Human-Computer Interaction and Multimedia Retrieval tasks. Typical application scenarios comprise both synergistic and concurrent multimodality. The main problem of integration thereby usually is the asynchrony of the audio and video cues, or textual information. In this respect this talk aims at provision of a short introduction to early, late and hybrid integration strategies. Emphasis is laid on preservation of utmost available knowledge within synchronization and integration of streams. To this aim diverse machine learning approaches comprising Graphical Models, Multidimensional Dynamic Time Warping, and Meta-Classification are discussed. Insight in their effectiveness is given by a number of recent applications scenarios as multimodal Emotion and Behaviour Recognition, Meeting Segmentation, and Music Retrieval, selected for coverage of the named types.