Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Discover how to enhance GPU utilization and speed up model training using Kubernetes scheduling framework and Node Resource Interface (NRI) in this 24-minute conference talk by He Cao from ByteDance. Learn about the limitations of vanilla Kubernetes in managing heterogeneous resources and explore solutions implemented at ByteDance. Gain insights into GPU-sharing scheduling techniques that allow for fine-grained resource allocation, improving GPU utilization in AI inference scenarios. Understand the implementation of topology-aware scheduling and customized GPU-RDMA affinity strategies at the root complex level to accelerate large model training using GPUDirect RDMA. This talk provides valuable knowledge for optimizing resource management and performance in AI workloads on Kubernetes clusters.

Syllabus

Improving GPU Utilization and Accelerating Model Training with Scheduling Framework and NRI - He Cao

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.