Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore Direct Preference Optimization (DPO), a cutting-edge technique in machine learning, through this comprehensive 43-minute video tutorial by Trelis Research. Learn how DPO differs from traditional fine-tuning methods and compares to RLHF. Dive into practical applications using datasets like UltraChat and Anthropic's Helpful and Harmless. Follow along with a detailed DPO notebook run-through, interpret evaluation results using Weights and Biases, and set up Runpod for a one-epoch training run. Gain access to valuable resources, including Google Slides, datasets, and scripts to enhance your understanding and implementation of DPO in advanced fine-tuning projects.
Syllabus
Direct Preference Optimisation
Video Overview
How does “normal” fine-tuning work?
How does DPO work?
DPO Datasets: UltraChat
DPO Datasets: Helpful and Harmless
DPO vs RLHF
Required datasets and SFT models
DPO Notebook Run through
DPO Evaluation Results
Weights and Biases Results Interpretation
Runpod Setup for 1 epoch Training Run
Resources
Taught by
Trelis Research