Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Overview

Discover how to enhance GPU utilization and speed up model training using Kubernetes scheduling framework and Node Resource Interface (NRI) in this 24-minute conference talk by He Cao from ByteDance. Learn about the limitations of vanilla Kubernetes in managing heterogeneous resources and explore solutions implemented at ByteDance. Gain insights into GPU-sharing scheduling techniques that allow for fine-grained resource allocation, improving GPU utilization in AI inference scenarios. Understand the implementation of topology-aware scheduling and customized GPU-RDMA affinity strategies at the root complex level to accelerate large model training using GPUDirect RDMA. This talk provides valuable knowledge for optimizing resource management and performance in AI workloads on Kubernetes clusters.

Syllabus

Improving GPU Utilization and Accelerating Model Training with Scheduling Framework and NRI - He Cao

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Taught by

Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

Building a Fine-Grained and Intelligent Resource Management System on Kubernetes

ML Training Acceleration with Heterogeneous Resources in ByteDance

Precision Matters: Scheduling GPU Workloads on Kubernetes

Improving GPU Utilization Using Kubernetes

Maximizing GPU Utilization Over Multi-Cluster - Challenges and Solutions for Cloud-Native AI Platform

9 Best Kubernetes Courses for 2024

Never Stop Learning.