Seminars and Events
NL Seminar-Manipulating Large Language Model Predictions Through Data
Event Details
This talk will be a live presentation only, it will not be recorded.
Speaker: Alexander Wan, University of Cal-Berkeley
Conference Rm Location: ISI-MDR #689 in-person attendance will be permitted for USC/ISI faculty, staff, students only. Open to the public virtually via Zoom
REMINDER:
If you do not have access to the 6th Floor, please check in at the main reception desk on 10th floor and someone will escort you to the conference room location prior to the start of the talk.
Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you’re highly encouraged to use your USC account to sign into Zoom.
If you’re an outside visitor, please provide your: Full Name, Title and Name of Workplace to (nlg-seminar-host(at)isi.edu) beforehand so we’ll be aware of your attendance. Also, let us know if you plan to attend in-person or virtually.
For more information on the NL Seminar series and upcoming talks, please visit:
https://nlg.isi.edu/nl-seminar/
Large language models use large amounts of unmoderated data at each stage of the training and deployment pipeline. In this talk, I will show how these lax requirements enable adversaries to manipulate both training and test data, allowing a myriad of possible attacks. First, during training time, I will show that adversaries can modify instruction-tuning datasets to systematically manipulate predictions across a range of tasks or induce degenerate outputs across hundreds of arbitrary tasks, using as few as 100 poison examples. At inference time, additional data is often used in retrieval- or tool-augmented models. Naturally, these models will face information from a wide variety of sources that have varying degrees of quality. Humans are also faced with this same range of sources but can make judgements of trustworthiness based on factors like the style of argumentation or the recency of information. We show that not only do model predictions differ significantly from human credibility judgements, but also that gaps in this judgement creates opportunities for adversaries to manipulate answers to user queries.
Speaker Bio
Alexander Wan is a third-year undergraduate at UC Berkeley majoring in Computer Science, Statistics, and Mathematics. He works closely with folks at the Berkeley NLP Group and the MSU Heterogeneous Learning and Reasoning lab, with a focus on improving the robustness and interpretability of large language models. He's also more broadly interested in the intersection of machine learning and cognitive science: using current ML models to better understand human cognition and building more robust models through cognitively inspired architectures and training.
Subscribe here to learn more about upcoming seminars: https://www.isi.edu/events/
Host: Jon May and Justin Cho