Seminars and Events
Socially Responsible Data for Large Multilingual Language Models
Event Details
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving for models to accommodate languages of communities outside of the Global North, which include many languages that have been historically underrepresented in digital realms. These languages have been coined as “low resource languages” or “long-tail languages”, and LLMs performance on these languages is generally poor. While expanding the use of LLMs to more languages may bring many potential benefits, such as assisting cross-community communication and language preservation, great care must be taken to ensure that data collection on these languages is not extractive and that it does not reproduce exploitative practices of the past. Collecting data from languages spoken by previously colonized people, indigenous people, and non-Western languages raises many complex sociopolitical and ethical questions, e.g., around consent, cultural safety, and data sovereignty. Furthermore, linguistic complexity and cultural nuances are often lost in LLMs. This position paper builds on recent scholarship, and our own work, and outlines several relevant social, cultural, and ethical considerations and potential ways to mitigate them through qualitative research, community partnerships, and participatory design approaches. We provide twelve recommendations for consideration when collecting language data on underrepresented language communities outside of the Global North.
February 4, 2025
Zoom link: https://usc.zoom.us/j/95143932471?pwd=lfKAdRoLzLnawp4lu8wdT0ibc2CyT8.1
Meeting ID: 951 4393 2471
Passcode: 2025
This presentation will not be recorded.
Host: Rebecca Dorn
POC: Maura Covaci
Speaker Bio
Erin has worked at the crossroads of linguistics, social sciences and tech for over 15 years. She has been an ontologist, linguist and researcher at Google for nearly a decade. Her current work with Google Research's Impact Lab team focuses on how communities understand the intersection of culture, identity and technology with the goal of building community-driven evidence-based ontologies that serve to represent a plurality of identities across Google’s products, including LLMs. In 2020, Erin had the honor of being a Google Fellow as part of the Trevor Project Fellowship for the award winning Crisis Contact Simulator, featured as the Time's 100 Best Inventions of 2021. She later brought her experience to a second Fellowship focusing on peer-to-peer mental health conversations for Veterans with Reflex AI. Before Google Research, Erin worked on ontology design for Google’s Knowledge Graph and text/image classification on Ads Privacy and Safety teams. When she isn't busy with tech - you can find Erin racing sailboats on the Santa Monica Bay!