ORPO: A New Preference-Aligned Training Method for Large Language Models

Overview

Coursera Plus Flash Sale: All Certificates & Courses 40% Off. 72 Hours Only!

Grab it

Learn about an innovative approach to Large Language Model training in this 24-minute technical presentation that introduces ORPO (Odds Ratio Preference Optimization), a groundbreaking "reference model-free" monolithic optimization algorithm. Explore the theoretical physics perspective behind this new preference-aligned Supervised Fine-Tuning (SFT) method, examining parallels between regularization terms methodologies and Lagrange Multipliers. Delve into how ORPO eliminates the need for a separate preference alignment phase while comparing its performance metrics against LLama 2 and Mistral 7B models. Based on research from the paper "ORPO: Monolithic Preference Optimization without Reference Model," gain insights into this streamlined approach that combines preference alignment directly into the training process.

Syllabus

ORPO: NEW DPO Alignment and SFT Method for LLM

Taught by

Discover AI

Reviews

Start your review of ORPO: A New Preference-Aligned Training Method for Large Language Models

Taught by

ORPO: Monolithic Preference Optimization without Reference Model

Combined Preference and Supervised Fine-Tuning with ORPO

FASTER Code for Supervised Fine-Tuning and DPO Training with UNSLOTH

Direct Preference Optimization (DPO) vs RLHF - Understanding Language Model Training

Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning

Aligning Language Models with LESS Data and Simple Preference Optimization (SimPO)

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.