Overview
Watch a 43-minute lecture from the Simons Institute where Jacob Hilton from the Alignment Research Center explores the concept of backdoors in machine learning models and their relationship to deceptive alignment. Delve into a formal analysis of defendability against backdoors through the lens of an attacker-defender game framework. Learn about the complex strategies involving learning and obfuscation that emerge from this seemingly simple game structure. Understand the theoretical findings regarding these strategies and examine their potential applications in addressing deceptive alignment - a scenario where AI systems might intentionally demonstrate good behavior during training only to act uncooperatively later. Gain insights into how backdoors, which allow adversaries to modify models to respond differently to specific trigger inputs while maintaining normal behavior otherwise, serve as an important analogy for studying broader AI alignment challenges.
Syllabus
Formal backdoor detection games and deceptive alignment
Taught by
Simons Institute