Campus Event Calendar: Lucy Li (09/26/2024 in E1 5/029)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

Large-Scale Language Patterns across Social Groups

Lucy Li

University of California, Berkeley

Talk

Lucy is a PhD student at the University of California, Berkeley, affiliated with Berkeley Artificial Intelligence Research (BAIR) and the School of Information. Her research intersects natural language processing, computational social science, and AI fairness. Lucy has been recognized by EECS Rising Stars, Rising Stars in Data Science, an American Educational Research Association (AERA) Best Paper Award, and a NSF Graduate Research Fellowship. During her PhD, she interned at Microsoft Research and the Allen Institute for AI, of which the latter awarded her Outstanding Intern of the Year in 2022.

AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI

AG Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Thursday, 26 September 2024

10:00

60 Minutes

E1 5

029

Saarbrücken

Abstract

NLP methods can measure large-scale language patterns across social groups, and these measurements can in return answer sociolinguistic questions and inform model development. First, I'll present studies quantifying community-specific words and meanings across two domains: online discussion forums and scholarly literature. In these studies, I leverage perspectives from sociolinguistics to relate communities’ use of distinctive language to various social factors, such as community behavior and cross-community impact. Second, I’ll present a study analyzing the effects of large language model (LLM) pretraining data practices on different social groups. Model development begins with data curation, but decisions around whose data is retained or removed during this initial stage is under-scrutinized. Using a new dataset of 10.3 million website creators' self-descriptions, we investigate how pretraining data filters affect web pages spanning a range of social and geographic origins. Our experiments illuminate a range of implicit preferences in data curation: we show that some "quality" classifiers act like topical domain filters, and English language ID filters can overlook English content from some regions of the world.

Contact

Claudia Richter

+49 681 9303 9103

--email hidden

Virtual Meeting Details

System used:

Zoom

Meeting URL:

https://zoom.us/j/96681414048

Meeting ID:

925 3862 0453

Passcode:

passcode not visible

Code Visible for:

logged in users only

Claudia Richter, 09/13/2024 13:59 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis