Multi-GPU Fine-tuning with DDP and FSDP

Overview

Dive into the world of multi-GPU fine-tuning with this comprehensive tutorial on Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP) techniques. Learn how to optimize VRAM usage, understand the intricacies of the Adam optimizer, and explore the trade-offs between various distributed training methods. Gain practical insights on choosing the right GPU setup, implementing LoRA and quantization for VRAM reduction, and utilizing tools like DeepSpeed and Accelerate. Follow along with code examples for Model Parallel, DDP, and FSDP implementations, and discover how to set up and use rented GPUs via SSH. By the end of this tutorial, you'll be equipped with the knowledge to efficiently fine-tune large language models across multiple GPUs.

Syllabus

Multi-GPU Distributed Training
Video Overview
Choosing a GPU setup
Understanding VRAM requirements in detail
Understanding Optimisation and Gradient Descent
How does the Adam optimizer work?
How the Adam optimiser affects VRAM requirements
Effect of activations, model context and batch size on VRAM
Tip for GPU setup - start with a small batch size
Reducing VRAM with LoRA and quantisation
Quality trade-offs with quantisation and LoRA
Choosing between MP, DDP or FSDP
Distributed Data Parallel
Model Parallel and Fully Sharded Data Parallel FSDP
Trade-offs with DDP and FSDP
How does DeepSpeed compare to FSDP
Using FSDP and DeepSpeed with Accelerate
Code examples for MP, DDP and FSDP
Using SSH with rented GPUs Runpod
Installation
slight detour Setting a username and email for GitHub
Basic Model Parallel MP fine-tuning script
Fine-tuning script with Distributed Data Parallel DDP
Fine-tuning script with Fully Shaded Data Parallel FSDP
Running ‘accelerate config’ for FSDP
Saving a model after FSDP fine-tuning
Quick demo of a complete FSDP LoRA training script
Quick demo of an inference script after training
Wrap up

Taught by

Trelis Research

Reviews

Start your review of Multi-GPU Fine-tuning with DDP and FSDP

Taught by

Full Fine-tuning LLMs with Lower VRAM: Optimizers, GaLore, and Advanced Techniques

Optimizing LLM Fine-Tuning with PEFT and LoRA Adapter-Tuning for GPU Performance

QLoRA - How to Fine-tune an LLM on a Single GPU with Python Code

PyTorch NLP Model Training and Fine-Tuning on Colab TPU Multi-GPU with Accelerate

Fine-tuning Llama 2 for Tone or Style Using Shakespeare Dataset

Fine-Tuning Llama 3 on a Custom Dataset for RAG Q&A - Training LLM on a Single GPU

Never Stop Learning.