Overview
Syllabus
AI model distillation Whisper, Flux, Minitron, gpt-4o-mini?
Video Overview - Distillation Tutorial and Code Walk-through
Distillation Examples Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron
How distillation works
Student model initialization
Layer / depth pruning
Width pruning
Pre-training versus distillation
Cross-entropy loss vs KL-divergence
Instruction fine-tuning
Distilling SmolLM 135M to a 99M model
Code walk-through setup.
Pruning Notebook
Layer Pruning
Width Pruning
Why pruning works?
Distillation Script - Multi-GPU Setup
Distillation Script Walk-through
Distillation Configuration File Walk-through
Distillation Startup and Performance Monitoring with tensorboard
Instruction fine-tuning and dataset selection
Instruction FT Startup and Performance Monitoring with tensorboard
Running inference to evaluate distillation performance
Teacher model performance base SmolLM 135M
SmolLM Instruct model performance
Raw pruned model performance layer pruned 99M
Width + Layer pruning performance raw 99M
Distilled model performance before instruction tuning 99M
Instruction tuning performance evaluation
SmolLM 135M Instruct performance
Instruction tuned distilled model performance 99M model
Final Tips best pruning approach, learning rate, batch size and model size effects
Video Resources
Taught by
Trelis Research