Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Learn how to effectively run AI model training on Kubernetes through this conference talk that addresses the complexities of distributed training and LLM fine-tuning. Explore the new Kubeflow TrainJob API and its integration with Kubernetes JobSet, designed to simplify the challenges of handling large training datasets and various distributed strategies including data and model parallelism. Discover how to configure essential elements like failure handling, success criteria, and gang-scheduling for large-scale distributed training while maintaining fault tolerance and elasticity. Gain insights into reusable and extendable training runtimes within TrainJob, and understand how these features create a more streamlined Python experience for Data Scientists. Master the techniques that make Kubernetes more accessible for ML development and enable rapid iteration in the machine learning ecosystem.
Syllabus
Democratizing AI Model Training on Kubernetes with Kubeflow Train... Andrey Velichkevich & Yuki Iwai
Taught by
CNCF [Cloud Native Computing Foundation]