Scalable Extraction of Training Data from Production Language Models
Overview
Watch a technical lecture from Google DeepMind researcher Nicholas Carlini exploring methods to extract training data from production language models. Learn about two novel attacks that successfully extract megabytes of training data from ChatGPT, despite its alignment training designed to prevent such extraction. Discover how the first attack exploits repetitive word patterns to cause model divergence and reveal training data, while the second attack leverages fine-tuning APIs to bypass safety measures. Gain insights into the implications of these findings for alignment strategies and privacy-preserving machine learning, with specific focus on how production models handle training data reproduction and the effectiveness of current safety mechanisms. Explore the broader context of language model security, alignment challenges, and the balance between model capabilities and data privacy.
Syllabus
Scalable Extraction of Training Data from (Production) Language Models
Taught by
Simons Institute