Completed
AL/DL: Models, Frameworks, Hardware
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA
Automatically move to the next video in the Classroom when playback concludes
- 1 Intro
- 2 Deep Learning Applications
- 3 AL/DL: Models, Frameworks, Hardware
- 4 Trends: Big Data, Larger Models
- 5 Sample Multi-GPU Node: DGX-1
- 6 Distributed Training Applications Multi-GPU, Multi-node
- 7 K8s Challenges & Outline
- 8 Kes Orchestration Flow
- 9 Sample PyTorch Job Launch
- 10 Array Jobs and MPI Operator
- 11 SRIOV CNI for K8s Multi-Rail
- 12 Gang Scheduling Multi-Node Pods
- 13 PodGroup Queue and Manager
- 14 Demo
- 15 Sample Job Real-Time Telemetry
- 16 Sample BERT K8s Scaling
- 17 Shared K8s Cluster for Multi-node
- 18 Scheduler Dashboard
- 19 Summary and Future Work