Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

CNCF [Cloud Native Computing Foundation] via YouTube Direct link

AL/DL: Models, Frameworks, Hardware

3 of 19

3 of 19

AL/DL: Models, Frameworks, Hardware

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Intro
  2. 2 Deep Learning Applications
  3. 3 AL/DL: Models, Frameworks, Hardware
  4. 4 Trends: Big Data, Larger Models
  5. 5 Sample Multi-GPU Node: DGX-1
  6. 6 Distributed Training Applications Multi-GPU, Multi-node
  7. 7 K8s Challenges & Outline
  8. 8 Kes Orchestration Flow
  9. 9 Sample PyTorch Job Launch
  10. 10 Array Jobs and MPI Operator
  11. 11 SRIOV CNI for K8s Multi-Rail
  12. 12 Gang Scheduling Multi-Node Pods
  13. 13 PodGroup Queue and Manager
  14. 14 Demo
  15. 15 Sample Job Real-Time Telemetry
  16. 16 Sample BERT K8s Scaling
  17. 17 Shared K8s Cluster for Multi-node
  18. 18 Scheduler Dashboard
  19. 19 Summary and Future Work

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.