MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

A Twitter corpus of linguistically and geographically diverse varieties of English

Nhi Pham
New York University, Abu Dhabi Campus
PhD Application Talk
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI  
AG Audience
English

Date, Time and Location

Friday, 27 January 2023
14:30
30 Minutes
Virtual talk
zoom

Abstract

The increasing prevalence of social media presents a growing opportunity to collect and analyze

examples of varieties of English. Whilst usage of these varieties was – and, in many cases, still is
– used only in spoken contexts or hard-to-access private messages, social media sites like Twitter
provide a platform for users to communicate informally in a publicly-available and scrapeable
format. Notably, varieties like Indian English (Hinglish), Singaporean English (Singlish), and
African-American English (AAE) can be commonly found online. These varieties present a
challenge to existing natural language processing (NLP) tools as they often differ
orthographically, syntactically and semantically from standard English for which the vast
majority of these tools are built.

It is observed that NLP models trained on standard English texts produced biased
outcomes for users of underrepresented varieties (Blodgett and O’Connor, 2017). Some research
has aimed to overcome the inherent biases caused by unrepresentative data through techniques
like data augmentation or adjusting training models. We aim to address the issue of bias at its
root - the data itself - by producing an annotated dataset of tweets from countries with high
proportions of underserved English variety speakers. We label each tweet using six categorical
classifications along a pseudo-spectrum that measures the degree of standard English and that
thereby indirectly aims to surface the manifestations of English varieties in these tweets.

Following best annotation practices and leveraging the international student community
at New York University, our growing corpus currently features 122000 tweets taken from five
countries, labeled by annotators who are from those countries and can communicate in a
regionally-dominant variety of English. We hope to contribute to the growing literature
identifying and reducing the implicit demographic discrepancies in NLP.

References
Blodgett, S. L. and O’Connor, B. (2017). Racial disparity in natural language processing: A case
study of social media african-american english. CoRR, abs/1707.00061

Contact

Jennifer Gerling
+49 681 9325 1801
--email hidden

Virtual Meeting Details

Zoom
passcode not visible
logged in users only

Jennifer Gerling, 01/26/2023 17:26 -- Created document.