examples of varieties of English. Whilst usage of these varieties was – and, in many cases, still is
– used only in spoken contexts or hard-to-access private messages, social media sites like Twitter
provide a platform for users to communicate informally in a publicly-available and scrapeable
format. Notably, varieties like Indian English (Hinglish), Singaporean English (Singlish), and
African-American English (AAE) can be commonly found online. These varieties present a
challenge to existing natural language processing (NLP) tools as they often differ
orthographically, syntactically and semantically from standard English for which the vast
majority of these tools are built.
It is observed that NLP models trained on standard English texts produced biased
outcomes for users of underrepresented varieties (Blodgett and O’Connor, 2017). Some research
has aimed to overcome the inherent biases caused by unrepresentative data through techniques
like data augmentation or adjusting training models. We aim to address the issue of bias at its
root - the data itself - by producing an annotated dataset of tweets from countries with high
proportions of underserved English variety speakers. We label each tweet using six categorical
classifications along a pseudo-spectrum that measures the degree of standard English and that
thereby indirectly aims to surface the manifestations of English varieties in these tweets.
Following best annotation practices and leveraging the international student community
at New York University, our growing corpus currently features 122000 tweets taken from five
countries, labeled by annotators who are from those countries and can communicate in a
regionally-dominant variety of English. We hope to contribute to the growing literature
identifying and reducing the implicit demographic discrepancies in NLP.
References
Blodgett, S. L. and O’Connor, B. (2017). Racial disparity in natural language processing: A case
study of social media african-american english. CoRR, abs/1707.00061