Scalably Understanding AI with AI - Using AI Systems for Model Behavior Analysis

Overview

Watch a 46-minute lecture from UC Berkeley professor Jacob Steinhardt at the Simons Institute exploring how AI can be used to understand and analyze other AI systems. Learn about behavior elicitation techniques that use investigator agents to automatically prompt specific model behaviors through reinforcement learning and supervised fine-tuning. Discover improved methods for neuron description that generate high-quality natural language explanations of neural network activations using 8B-parameter open-weight models. Explore practical applications through the Monitor observability interface to understand puzzling model behaviors, including investigating why language models make certain numerical comparison errors. Gain insights into the complex pipeline from training data to learned representations and observed behaviors in AI systems, with a focus on using AI tools to better understand and steer these systems.