Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Learn how to effectively run AI model training on Kubernetes through this conference talk that addresses the complexities of distributed training and LLM fine-tuning. Explore the new Kubeflow TrainJob API and its integration with Kubernetes JobSet, designed to simplify the challenges of handling large training datasets and various distributed strategies including data and model parallelism. Discover how to configure essential elements like failure handling, success criteria, and gang-scheduling for large-scale distributed training while maintaining fault tolerance and elasticity. Gain insights into reusable and extendable training runtimes within TrainJob, and understand how these features create a more streamlined Python experience for Data Scientists. Master the techniques that make Kubernetes more accessible for ML development and enable rapid iteration in the machine learning ecosystem.

Syllabus

Democratizing AI Model Training on Kubernetes with Kubeflow Train... Andrey Velichkevich & Yuki Iwai

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Democratizing AI Model Training on Kubernetes with Kubeflow TrainJob and JobSet

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.