Multi-GPU Fine-tuning with DDP and FSDP

Multi-GPU Fine-tuning with DDP and FSDP

Trelis Research via YouTube Direct link

Distributed Data Parallel

13 of 29

13 of 29

Distributed Data Parallel

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Multi-GPU Fine-tuning with DDP and FSDP

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Multi-GPU Distributed Training
  2. 2 Video Overview
  3. 3 Choosing a GPU setup
  4. 4 Understanding VRAM requirements in detail
  5. 5 Understanding Optimisation and Gradient Descent
  6. 6 How does the Adam optimizer work?
  7. 7 How the Adam optimiser affects VRAM requirements
  8. 8 Effect of activations, model context and batch size on VRAM
  9. 9 Tip for GPU setup - start with a small batch size
  10. 10 Reducing VRAM with LoRA and quantisation
  11. 11 Quality trade-offs with quantisation and LoRA
  12. 12 Choosing between MP, DDP or FSDP
  13. 13 Distributed Data Parallel
  14. 14 Model Parallel and Fully Sharded Data Parallel FSDP
  15. 15 Trade-offs with DDP and FSDP
  16. 16 How does DeepSpeed compare to FSDP
  17. 17 Using FSDP and DeepSpeed with Accelerate
  18. 18 Code examples for MP, DDP and FSDP
  19. 19 Using SSH with rented GPUs Runpod
  20. 20 Installation
  21. 21 slight detour Setting a username and email for GitHub
  22. 22 Basic Model Parallel MP fine-tuning script
  23. 23 Fine-tuning script with Distributed Data Parallel DDP
  24. 24 Fine-tuning script with Fully Shaded Data Parallel FSDP
  25. 25 Running ‘accelerate config’ for FSDP
  26. 26 Saving a model after FSDP fine-tuning
  27. 27 Quick demo of a complete FSDP LoRA training script
  28. 28 Quick demo of an inference script after training
  29. 29 Wrap up

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.