Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training

Overview

Explore how to leverage topology modeling and topology-aware scheduling to accelerate Large Language Model (LLM) training in this 45-minute conference talk by William Wang from Huawei at CNCF. Delve into the shift from computing to network bottlenecks in the LLM training and inference era, examining high-throughput and low-latency interconnect technologies like nvlink and nvswitch used in hyper-computers. Analyze the impact of inter-node communication and intra-node resource interconnects on AI workload performance, particularly for large language model training. Learn how to model topology on underlying resources such as NUMA, Rack, Super Pod, and Hyper Computer. Discover techniques for making schedulers topology-aware to optimize resource allocation and performance. Investigate methods to coordinate topology-aware scheduling with Device Resource Aggregation (DRA) on nodes, addressing Kubernetes' current limitations in efficiently handling topology awareness for AI workloads.

Syllabus

Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training - William Wang

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Leverage Topology Modeling and Topology-Aware Scheduling to Accelerate LLM Training

Taught by

Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

Predictable Performance Through Prometheus and Topology Aware Scheduling

Optimize LLM Workflows with Smart Infrastructure Enhanced by Volcano

Improving GPU Utilization and Accelerating Model Training with Kubernetes Scheduling Framework and NRI

Training Large Language Models on Kubernetes

Unleashing the Power of Dynamic Resource Allocation for Just-in-Time GPU Slicing

9 Best Kubernetes Courses for 2024

Never Stop Learning.