Watch a 41-minute research talk from OpenAI's Yining Chen at the Simons Institute exploring how prover-verifier games can enhance the legibility and verifiability of Large Language Model outputs. Discover an innovative training algorithm inspired by prover-verifier games that aims to improve the clarity and checkability of LLM solutions, particularly in grade-school math problems. Learn how the algorithm trains small verifiers to predict solution correctness while simultaneously developing "helpful" provers for accurate solutions and "sneaky" provers to test system robustness. Examine how this training approach transfers to human verification tasks, with results showing increased human accuracy in checking legitimate solutions and better detection of deceptive ones. Understand the broader implications for AI alignment and the potential for using legibility training with small verifiers to enhance the interpretability of large language models for human users.
Overview
Syllabus
Prover-Verifier Games Improve Legibility of LLM outputs
Taught by
Simons Institute