NL Seminar-Red Teaming Language Model Detectors with Language Models

ISI Natural Language Seminar

NL Seminar-Red Teaming Language Model Detectors with Language Models

When

Thursday, February 22, 2024 11:00am - 12:00pm PDT

Add to calendar:

Presenter

Presented by:

Yihan Wang, UCLA

Location

Conference Rm #689 in-person attendance will be permitted for USC/ISI faculty, staff, students only. Open to the public virtually via the zoom link.

Virtual URL

Online Link

Virtual Recording

This event is open to:

Everyone

Event Details

Speaker: Yihan Wang, UCLA

Conference Rm Location: ISI-MDR #689 in-person attendance will be permitted for USC/ISI faculty, staff, students only. Open to the public virtually via Zoom

REMINDER:

If you do not have access to the 6th Floor, please check in at the main reception desk on 10th floor and someone will escort you to the conference room location prior to the start of the talk.

Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you’re highly encouraged to use your USC account to sign into Zoom.

If you’re an outside visitor, please provide your: Full Name, Title and Name of Workplace to (nlg-seminar-host(at)isi.edu) beforehand so we’ll be aware of your attendance. Also, let us know if you plan to attend in-person or virtually.

For more information on the NL Seminar series and upcoming talks, please visit:

https://nlg.isi.edu/nl-seminar/

Hosts: Jon May and Justin Cho

The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent works have proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. This talk may also introduce some of our other recent works on trustworthy and ethical LLMs.

Speaker Bio

Yihan is a PhD candidate in the department of Computer Science at UCLA advised by Prof. Cho-Jui Hsieh. Her research interest lies in trustworthy and generalizable machine learning. She is one of the recipients of 2023 UCLA-Amazon Fellowship. More detail can be found at https://yihanwang617.github.io.

If speaker approves to be recorded for this NL Seminar talk, it will be posted on our USC/ISI YouTube page within 1-2 business days: https://www.youtube.com/user/USCISI.

Subscribe here to learn more about upcoming seminars: https://www.isi.edu/events/

Hosts: Jon May and Justin Cho

Information Sciences Institute

Seminars and Events

NL Seminar-Red Teaming Language Model Detectors with Language Models

Event Details

Speaker Bio