MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Large-Scale Language Patterns across Social Groups

Lucy Li
University of California, Berkeley
Talk

Lucy is a PhD student at the University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research (BAIR) and the School of Information. Her research intersects natural language processing, computational social science, and AI fairness. Lucy has been recognized by EECS Rising Stars, Rising Stars in Data Science, an American Educational Research Association (AERA) Best Paper Award, and a NSF Graduate Research Fellowship. During her PhD, she interned at Microsoft Research and the Allen Institute for AI, of which the latter awarded her Outstanding Intern of the Year in 2022.
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI  
AG Audience
English

Date, Time and Location

Thursday, 26 September 2024
10:00
60 Minutes
E1 5
029
Saarbrücken

Abstract

NLP methods can measure large-scale language patterns across social groups, and these measurements can in return answer sociolinguistic questions and inform model development. First, I'll present studies quantifying community-specific words and meanings across two domains: online discussion forums and scholarly literature. In these studies, I leverage perspectives from sociolinguistics to relate communities’ use of distinctive language to various social factors, such as community behavior and cross-community impact. Second, I’ll present a study analyzing the effects of large language model (LLM) pretraining data practices on different social groups. Model development begins with data curation, but decisions around whose data is retained or removed during this initial stage is under-scrutinized. Using a new dataset of 10.3 million website creators' self-descriptions, we investigate how pretraining data filters affect web pages spanning a range of social and geographic origins. Our experiments illuminate a range of implicit preferences in data curation: we show that some "quality" classifiers act like topical domain filters, and English language ID filters can overlook English content from some regions of the world.

Contact

Claudia Richter
+49 681 9303 9103
--email hidden

Virtual Meeting Details

Zoom
925 3862 0453
passcode not visible
logged in users only

Claudia Richter, 09/13/2024 13:59 -- Created document.