Making Meaning: Predictive Processing of Narratives in Humans and Machines
Viktor Kewenig
UCL
Talk
My research intersects AI, Cognitive Neuroscience and Philosophy, with a focus on language. Naturally, I am a big fan of Wittgenstein. After studying logic, philosophy of science and mind at Cambridge, I transitioned to cognitive neuroscience of language comprehension during my MSc at UCL. Currently, as part of the Eco-Brain Leverhulme DTP my PhD research employs multimodal computational and ecologically valid research methods to investigate alignment principles between AI and human cognition. I also collaborate with Microsoft Research Cambridge to find out more about how generative AI impacts human cognition in knowledge work and education.
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI
In this talk, I examine how humans and artificial systems process narratives by integrating information from multiple modalities. First, I describe a behavioural study in which participants predicted upcoming words in short film clips while their eye movements were recorded. We compared unimodal (text-only) and multimodal (visual+text) computational models that differ in architecture, finding that models with cross-modal attention more closely matched human word predictions and gaze patterns—especially when the film clips provided meaningful visual context. Second, I discuss a neuroimaging study where participants listened to extended stories during fMRI. Using unimodal and multimodal variants of a large language model, we predicted brain activity (encoding) and decoded semantic content from recorded signals. The multimodal model substantially outperformed the unimodal version, predicting widespread brain activation and improving semantic decoding. Notably, only the multimodal model significantly benefited from including brain data, suggesting that multimodal embeddings are more biologically plausible. Taken together, these findings indicate that narrative comprehension relies on distributed, multimodal processes, and that incorporating non-linguistic cues (e.g., vision, audition) into large language models yields richer, more human-like representations of meaning.