Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Learn about an innovative approach to Large Language Model training in this 24-minute technical presentation that introduces ORPO (Odds Ratio Preference Optimization), a groundbreaking "reference model-free" monolithic optimization algorithm. Explore the theoretical physics perspective behind this new preference-aligned Supervised Fine-Tuning (SFT) method, examining parallels between regularization terms methodologies and Lagrange Multipliers. Delve into how ORPO eliminates the need for a separate preference alignment phase while comparing its performance metrics against LLama 2 and Mistral 7B models. Based on research from the paper "ORPO: Monolithic Preference Optimization without Reference Model," gain insights into this streamlined approach that combines preference alignment directly into the training process.
Syllabus
ORPO: NEW DPO Alignment and SFT Method for LLM
Taught by
Discover AI