Take a Peek Under the Hood of Cloud-Native AI at Scale
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Explore a keynote presentation that delves into the challenges and solutions of running large-scale AI workloads on Kubernetes infrastructure. Learn how to manage clusters with thousands of interconnected GPUs, handle hardware failures, and optimize GPU scheduling for maximum performance. Discover why CNCF projects and Kubernetes serve as the ideal platform for managing complex generative AI infrastructure, with practical insights on implementing effective monitoring and observability tools. Gain valuable knowledge about building resilient, scalable systems specifically designed for training foundation models, drawing from real-world experiences and proven solutions in cloud-native AI deployments.
Syllabus
Keynote: Take a Peek Under the Hood of Cloud-Native AI at Scale - Chen Goldberg & Peter Salanki
Taught by
CNCF [Cloud Native Computing Foundation]