Large-Scale Language Patterns across Social Groups
Lucy Li
University of California, Berkeley
Talk
Lucy is a PhD student at the University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research (BAIR) and the School of Information. Her research intersects natural language processing, computational social science, and AI fairness. Lucy has been recognized by EECS Rising Stars, Rising Stars in Data Science, an American Educational Research Association (AERA) Best Paper Award, and a NSF Graduate Research Fellowship. During her PhD, she interned at Microsoft Research and the Allen Institute for AI, of which the latter awarded her Outstanding Intern of the Year in 2022.
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI
NLP methods can measure large-scale language patterns across social groups, and these measurements can in return answer sociolinguistic questions and inform model development. First, I'll present studies quantifying community-specific words and meanings across two domains: online discussion forums and scholarly literature. In these studies, I leverage perspectives from sociolinguistics to relate communities’ use of distinctive language to various social factors, such as community behavior and cross-community impact. Second, I’ll present a study analyzing the effects of large language model (LLM) pretraining data practices on different social groups. Model development begins with data curation, but decisions around whose data is retained or removed during this initial stage is under-scrutinized. Using a new dataset of 10.3 million website creators' self-descriptions, we investigate how pretraining data filters affect web pages spanning a range of social and geographic origins. Our experiments illuminate a range of implicit preferences in data curation: we show that some "quality" classifiers act like topical domain filters, and English language ID filters can overlook English content from some regions of the world.