Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore cutting-edge language model research using Python in this 40-minute conference talk from Conf42 Python 2024. Dive into mechanistic interpretability, learning about causal interventions and their toolkit. Examine real-world examples of interpretability research and discover essential libraries and packages. Gain insights into the nnsight architecture and learn how to perform interventions on language models. Understand the anatomy of interventions, information flow within models, and techniques for analyzing activation vectors. Conclude with valuable resources to further your knowledge in this rapidly evolving field of AI research.
Syllabus
intro
preamble
intro to mechanistic interpretability
mech interp
mech interp toolkit: causal interventions
example mech interp research
interpretability libraries and packages
nnsight architecture
example intervention
anatomy of an intervention
model internal i/o are nodes on the intervention graph
information flow
average-out information vectors
getting the average activation vector
adding the average to one-shot activation vector
resources
Taught by
Conf42