Campus Event Calendar: Nhi Pham (01/27/2023 in Virtual talk/zoom)

Campus Event Calendar

Campus Event Calendar:
- All Upcoming:
  - only for D1
  - only for D2
  - only for INET
  - only for D4
  - only for D5
  - only for D6
  - only for RG1
  - Mailing Lists
  - by Speaker
  - by Type
  - by Category
  - by Title
  - Calendar
  - RSS Feed
- History of Events:

Event Entry

What and Who

A Twitter corpus of linguistically and geographically diverse varieties of English

Nhi Pham

New York University, Abu Dhabi Campus

PhD Application Talk

AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI

AG Audience

English

Note: We use this to send email in the morning.

Date, Time and Location

Friday, 27 January 2023

14:30

30 Minutes

Virtual talk

zoom

Abstract

The increasing prevalence of social media presents a growing opportunity to collect and analyze

examples of varieties of English. Whilst usage of these varieties was – and, in many cases, still is
– used only in spoken contexts or hard-to-access private messages, social media sites like Twitter
provide a platform for users to communicate informally in a publicly-available and scrapeable
format. Notably, varieties like Indian English (Hinglish), Singaporean English (Singlish), and
African-American English (AAE) can be commonly found online. These varieties present a
challenge to existing natural language processing (NLP) tools as they often differ
orthographically, syntactically and semantically from standard English for which the vast
majority of these tools are built.

It is observed that NLP models trained on standard English texts produced biased
outcomes for users of underrepresented varieties (Blodgett and O’Connor, 2017). Some research
has aimed to overcome the inherent biases caused by unrepresentative data through techniques
like data augmentation or adjusting training models. We aim to address the issue of bias at its
root - the data itself - by producing an annotated dataset of tweets from countries with high
proportions of underserved English variety speakers. We label each tweet using six categorical
classifications along a pseudo-spectrum that measures the degree of standard English and that
thereby indirectly aims to surface the manifestations of English varieties in these tweets.

Following best annotation practices and leveraging the international student community
at New York University, our growing corpus currently features 122000 tweets taken from five
countries, labeled by annotators who are from those countries and can communicate in a
regionally-dominant variety of English. We hope to contribute to the growing literature
identifying and reducing the implicit demographic discrepancies in NLP.

References
Blodgett, S. L. and O’Connor, B. (2017). Racial disparity in natural language processing: A case
study of social media african-american english. CoRR, abs/1707.00061

Contact

Jennifer Gerling

+49 681 9325 1801

--email hidden

Virtual Meeting Details

System used:

Zoom

Meeting URL:

https://zoom.us/j/93723568507

Meeting ID:

Passcode:

passcode not visible

Code Visible for:

logged in users only

Jennifer Gerling, 01/26/2023 17:26 -- Created document.

Imprint / Impressum | Data Protection / Datenschutzhinweis