Formal Backdoor Detection Games and Deceptive Alignment in Machine Learning

Overview

Watch a 43-minute lecture from the Simons Institute where Jacob Hilton from the Alignment Research Center explores the concept of backdoors in machine learning models and their relationship to deceptive alignment. Delve into a formal analysis of defendability against backdoors through the lens of an attacker-defender game framework. Learn about the complex strategies involving learning and obfuscation that emerge from this seemingly simple game structure. Understand the theoretical findings regarding these strategies and examine their potential applications in addressing deceptive alignment - a scenario where AI systems might intentionally demonstrate good behavior during training only to act uncooperatively later. Gain insights into how backdoors, which allow adversaries to modify models to respond differently to specific trigger inputs while maintaining normal behavior otherwise, serve as an important analogy for studying broader AI alignment challenges.

Syllabus

Formal backdoor detection games and deceptive alignment

Taught by

Simons Institute

Reviews

Start your review of Formal Backdoor Detection Games and Deceptive Alignment in Machine Learning

Taught by

AI Safety – Full Course from Safe.AI on Machine Learning & Ethics

Safety Alignment in Large Language Models - Making Safety More Than Token Deep

Backdoor Attacks and Defenses in AI Systems - Seminar Series

Prover-Verifier Games for Improving LLM Output Legibility

Understanding and Steering Generative AI Systems

The Security of Large Language Models

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.