Why Did the Chicken Cross the Road? To Evaluate Your LLM

A USC ISI researcher and his collaborators propose separating the AI makers from the AI testers to create more robust systems.

by Julia Cohen

July 1, 2024

The rapid advancement of artificial intelligence, especially with the development of large language models like ChatGPT, raises an intriguing question: can AI handle genuinely unexpected situations?

For example, if an autonomous vehicle encounters a squirrel on the road, that might be anticipated and manageable by current systems. However, a chicken appearing unexpectedly in an urban setting presents a more complex challenge. A human driver would be surprised by the chicken, but would still likely make a reasonable decision, such as stopping the car or avoiding the chicken. What would the autonomous vehicle do? Researchers aren’t sure.

Mayank Kejriwal, Principal Scientist at USC Viterbi Information Sciences Institute (ISI), recently co-authored Challenges, Evaluation, and Opportunities for Open World Learning, which was published on June 24, 2004 in Nature Machine Intelligence. Kejriwal, who was joined by researchers from the University of Texas at Dallas and the University of Maryland for this four-year effort, proposed new, radical, and rigorous evaluation methods to the AI research community. Because, he said, “until we know that an AI model can handle genuinely unexpected events, we can’t fully trust it.”.

Learning to Expect the Unexpected

Current AI models, including ChatGPT, rely on extensive training data to function. The idea is that with enough data, these models can anticipate and manage most real-world scenarios by generalizing — applying what they’ve learned in one place to new, unfamiliar settings.

Perhaps the autonomous vehicle would generalize the city-dwelling chicken as an animal and avoid it as it would avoid a squirrel. But perhaps this model has never, in all of its training, seen an image of a walking bird. The chicken could be interpreted as a significant, unforeseen alteration in the environment. Will the vehicle use general principles, like humans do, to adapt and respond to this alteration? That’s the key question for AI generalization, and it’s where OWL comes in.

Open world learning (OWL) is an approach to machine learning that deals with the presence of new, unknown classes by recognizing and adapting to them. These models operate under the assumption that they may encounter whole new categories of things that were not available during training.

Kejriwal and his co-authors propose that the AI community separate OWL into progressively difficult versions: weak, semi-strong, and strong. A system with “weak OWL” can detect unexpected events. A “semi-strong OWL” system must be able to characterize the unexpected event. While “strong OWL” algorithms must be able to adapt to the events in real-time.

Stronger OWL is crucial in domains that have significant human consequences, such as self-driving cars, cybersecurity, healthcare, AI-assisted education, space exploration, and climate modeling and simulation.

Evaluating OWL with the Red Team-Blue Team Approach

To accurately characterize OWL and the robustness of a system, the researchers suggest a radical shift in AI evaluation methodology. Kejriwal, who is also a Research Assistant Professor in the USC Viterbi School of Engineering Department of Industrial and Systems Engineering summarized the method: “We are proposing that the people who are building the AI should not be evaluating the AI,” This involves a “Red Team-Blue Team” approach, where the AI developers (Blue Team) are distinct from the evaluators (Red Team). The Red Team would design and administer tests without the developers knowing what those tests would be, ensuring that the AI is assessed on its ability to handle genuinely unexpected situations.

Kejriwal elaborated on the critical issue of “benchmark overfitting,” when researchers, knowingly or unknowingly, design their systems to perform well on specific benchmarks. This happens because developers are familiar with these benchmarks and tailor their systems accordingly. “The principles you are incorporating, the strategies you are using, whether you know it or not, are designed for that benchmark,” he explained.

To combat this, Kejriwal advocates for a system similar to clinical trials in medicine, where rigorous, blind evaluations are standard. “We need to have unexpected things in a test benchmark,” he said. In such a system, Red Teams would test models without revealing the benchmarks in advance to the Blue Teams. While DARPA has implemented similar practices, Kejriwal argues they are not widespread or stringent enough.

A Call to Action

“Our hope is that people will take our evaluation suggestions and start to incorporate them in their work,” Kejriwal explained. Specifically, he envisions the use of red team-blue team exercises and the introduction of new, unseen benchmarks.

“Conferences, reviewers, and funding agencies like the National Science Foundation (NSF) need to start imposing these requirements,” he said. The NSF, being a major funding agency, has a strong influence on the direction of research. Kejriwal suggests that just as researchers must include sections on broader impacts and data management in their proposals, they should also be required to include plans for strengthening the robustness of their studies. “It should be compulsory,” he insists.

So, why did the chicken cross the road? Because, unbeknownst to the Blue Team, the Red Team was using a chicken crossing an urban street as a genuinely unexpected event in a benchmark test of the Blue Team’s algorithm!

Published on

Last updated on

This article may feature some AI-assisted content for clarity, consistency, and to help explore complex scientific concepts with greater depth and creative range.