Distillation of Transformer Models - Tutorial and Code Walk-through

Overview

Dive into an extensive 1-hour 21-minute video tutorial on the distillation of transformer models. Explore various distillation techniques, including layer and width pruning, applied to models like Whisper, Flux, and Minitron. Learn how to initialize student models, compare pre-training and distillation approaches, and understand the differences between cross-entropy loss and KL-divergence. Follow along with a detailed code walk-through for pruning, distillation, and instruction fine-tuning of a SmolLM 135M model to a 99M version. Gain insights into multi-GPU setups, performance monitoring with tensorboard, and dataset selection for instruction fine-tuning. Evaluate distillation performance through various model comparisons and receive valuable tips on pruning approaches, learning rates, and batch sizes. Access additional resources, including slides, research papers, and datasets, to further enhance your understanding of transformer model distillation.

Syllabus

AI model distillation Whisper, Flux, Minitron, gpt-4o-mini?
Video Overview - Distillation Tutorial and Code Walk-through
Distillation Examples Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron
How distillation works
Student model initialization
Layer / depth pruning
Width pruning
Pre-training versus distillation
Cross-entropy loss vs KL-divergence
Instruction fine-tuning
Distilling SmolLM 135M to a 99M model
Code walk-through setup.
Pruning Notebook
Layer Pruning
Width Pruning
Why pruning works?
Distillation Script - Multi-GPU Setup
Distillation Script Walk-through
Distillation Configuration File Walk-through
Distillation Startup and Performance Monitoring with tensorboard
Instruction fine-tuning and dataset selection
Instruction FT Startup and Performance Monitoring with tensorboard
Running inference to evaluate distillation performance
Teacher model performance base SmolLM 135M
SmolLM Instruct model performance
Raw pruned model performance layer pruned 99M
Width + Layer pruning performance raw 99M
Distilled model performance before instruction tuning 99M
Instruction tuning performance evaluation
SmolLM 135M Instruct performance
Instruction tuned distilled model performance 99M model
Final Tips best pruning approach, learning rate, batch size and model size effects
Video Resources

Taught by

Trelis Research

Reviews

Start your review of Distillation of Transformer Models - Tutorial and Code Walk-through

Taught by

Improving Accuracy of LLM Applications

Fine-Tuning a Diffusion Model with Your Photos

OpenAI Fine-tuning vs Distillation - Techniques and Implementation

Creating Self-Instruct Data Sets for LLM Fine-Tuning with ChatGPT

Fine-tuning LLMs - Every Step Explained for Memorization Tasks

NEFTune - Improving LLM Performance Through Noisy Embeddings Fine-Tuning

Never Stop Learning.