Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Training Foundation Model Workloads on Kubernetes at Scale with MCAD

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Explore how IBM Research built Vela, a cloud-native AI supercomputer, to train foundational models on Kubernetes at scale. Learn about the challenges faced in supporting multiple frameworks like PyTorch, Ray, and Spark for diverse research teams. Discover the role of Multi-Cluster App Dispatcher (MCAD) in queuing custom resources for large-scale AI training, and its interaction with the underlying Kubernetes scheduler. Gain insights into the implementation of gang priority, gang preemption, and fault tolerance for training processes that span hundreds of GPUs and run for extended periods. This conference talk provides valuable information on scaling AI workloads in a Kubernetes environment for researchers and developers working with foundation models.

Syllabus

Training Foundation Model Workloads on Kubernetes at Scale W... Abhishek Malvankar & Olivier Tardieu

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Training Foundation Model Workloads on Kubernetes at Scale with MCAD

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.