Seminars and Events

Artificial Intelligence Seminar

Evaluating Sparse Autoencoders with Board Game Models

Event Details

Abstract:

Sparse Autoencoders (SAEs) have recently become one of the most popular approaches in interpretability. As a result, there has been a flurry of new proposed SAE approaches. However, we struggle to evaluate these new approaches because there isn’t an underlying ground truth in natural language that we can use to create objective metrics for interpretability. We examine the setting of board games, using OthelloGPT and ChessGPT, and create two supervised metrics: “coverage” to assess individual feature quality and “board reconstruction” to measure overall state capture. Additionally, we propose a new SAE training approach called “p-annealing”. Our metrics reveal improvements that were hidden by existing proxy metrics, and the p-annealing approach performs the best on our metrics. While SAEs achieve high performance on board reconstruction (F1 scores of 0.85 and 0.95 on Chess and Othello), they don’t match the performance of linear probes, suggesting current techniques may not capture all of a model’s board state information.

Papers:

Intro to Sparse Autoencoders: What are SAEs? How do they work? What are the next steps for the field to take? Similar to this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html

Board Game Models: Covers this paper: https://arxiv.org/abs/2408.00113 and this blog post: https://adamkarvonen.github.io/machine_learning/2024/06/12/sae-board-game-eval.html

 

February 21, 2025

Join Zoom Meeting
https://usc.zoom.us/j/94409584905?pwd=Sm5LVkd0bndUdEluM3piK0NWTUQrUT09

Meeting ID: 944 0958 4905
Passcode: 822247

Host: Abel Salinas
POC: Justina Gilleland

Speaker Bio

I am mostly interested in machine learning and software engineering. Lately, a lot of my focus has been on Large Language Models - both in using them as a tool when combined with formal methods, and in understanding and interpreting them.

Outside of work, I race dirt bikes. I race A class in hard enduro, and B class in regular enduro and hare scrambles.

This AI Seminar presentation will be recorded and posted on our USC/ISI YouTube page within 1-2 business days: https://www.youtube.com/user/USCISI.

Subscribe here to learn more about upcoming seminars: https://www.isi.edu/events/