Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Formal Backdoor Detection Games and Deceptive Alignment in Machine Learning

Simons Institute via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Watch a 43-minute lecture from the Simons Institute where Jacob Hilton from the Alignment Research Center explores the concept of backdoors in machine learning models and their relationship to deceptive alignment. Delve into a formal analysis of defendability against backdoors through the lens of an attacker-defender game framework. Learn about the complex strategies involving learning and obfuscation that emerge from this seemingly simple game structure. Understand the theoretical findings regarding these strategies and examine their potential applications in addressing deceptive alignment - a scenario where AI systems might intentionally demonstrate good behavior during training only to act uncooperatively later. Gain insights into how backdoors, which allow adversaries to modify models to respond differently to specific trigger inputs while maintaining normal behavior otherwise, serve as an important analogy for studying broader AI alignment challenges.

Syllabus

Formal backdoor detection games and deceptive alignment

Taught by

Simons Institute

Reviews

Start your review of Formal Backdoor Detection Games and Deceptive Alignment in Machine Learning

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.