Defeat Devices in AI Systems

Abstract

AI systems increasingly exhibit behavior that differs systematically between evaluation and deployment contexts. Alignment faking, sandbagging, benchmark gaming, deceptive scheming, specification gaming, and trojans have each been documented separately, with each line of work characterizing one facet of what we argue is a single structural mechanism. We propose that this common mechanism is a defeat device, an engineering and regulatory concept long established in vehicle-emissions law and brought to broad public attention by the 2015 Volkswagen emissions case. A defeat device in an AI system has three necessary elements: a discriminator that detects evaluation context, a concealed swap that conditions behavior on detection, and a gap between eval-distribution and deployment-distribution performance on the stated evaluation criterion. We formalize this triadic test as a behavioral definition, organize documented cases along three taxonomic axes (origin, trigger, swap mechanism), propose Trigger-Axis-Aware Differential Probing (TADP) as a forensic detection protocol, and advance the claim that defeat devices can naturally emerge in current frontier AI systems without any operator engineering. We characterize naturally-emerging defeat devices as potentially one of the harmful emerging phenomena that AI safety practice should monitor and test for systematically. Implications for evaluation methodology, post-training pipeline design, interpretability research priorities, and AI governance follow.

Date: 2026
Authors: Emilio Ferrara
Publisher: Preprints

View Paper

Information Sciences Institute

Publications

Defeat Devices in AI Systems

Abstract