ML Training Acceleration with Heterogeneous Resources in ByteDance

Overview

Explore machine learning training acceleration techniques using heterogeneous resources at ByteDance in this 19-minute conference talk from KubeCon + CloudNativeCon Europe 2022. Delve into strategies for maximizing GPU utilization through sharing mechanisms, optimizing resource allocation with NUMA affinity, and implementing high-throughput network communication using RDMA CNI and Intel SRIOV technology. Gain insights into empowering model training, enhancing performance for large-scale distributed models, and effectively managing diverse CPU/GPU resources. Cover topics including GPU offline training for network and scheduling, GPU online serving, unified GPU scheduling, and future developments in the field.

Syllabus

Intro
GPU Offline Training (Network)
GPU Offline Training (Scheduling).
GPU Online Serving
GPU Unified Scheduling
Future Work

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of ML Training Acceleration with Heterogeneous Resources in ByteDance

Taught by

Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

Scale and Accelerate Distributed Model Training in Kubernetes Clusters

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training

GÖDel Scheduler: A Unified Scheduler for Online and Offline Workloads

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

9 Best Kubernetes Courses for 2024

Never Stop Learning.