Seminars and Events

ISI Natural Language Seminar

NL Seminar-Fair Comparisons and Fundamental Ideas for Open-Vocabulary Generative Language and Translation Models

Event Details

How can we fairly compare the performance of generative language and translation models on multiple languages? We will see how to use probabilistic and information theory-based measures, first to evaluate (monolingual) open-vocabulary language models by total bits and then, considering the case of Translationese, pondering the meaning of “information” and how to use it to compare machine translation models. In both cases, we get a little glimpse at what linguistic and non-linguistic factors might make languages easier or harder for models. The last part of the talk will (if time permits) propose some somewhat opinionated guidelines for open-vocabulary language modeling, and show work-in-progress in taxonomizing tokenization methods and the literature around open-vocabulary modeling.

REMINDER Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you’re highly encouraged to use your USC account to sign into Zoom. If you’re an outside visitor, please inform nlg DASH seminar DASH admin2 AT isi.edu beforehand so we’ll be aware of your attendance and let you in.

Speaker Bio

Sabrina is a PhD student at the Johns Hopkins University and a part-time research intern at HuggingFace, currently researching open-vocabulary language modeling for unit discovery in a variety of typologically varying languages. While her pre-PhD work focused on formal language theory applied to parsing and translation, during her PhD she published on morphology, fair language model comparison, stochastic romanization (at Google AI), and metacognition and calibration for chatbots (at Facebook AI Research), co-organized workshops and shared tasks around morphology and typology, and is currently involved in the BigScience summer of large language models workshop.