Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

CNCF [Cloud Native Computing Foundation]

Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

CNCF [Cloud Native Computing Foundation] via YouTube

Overview

Learn how ByteDance optimizes LLM workload performance through enhanced topology-aware scheduling in this technical conference talk. Explore solutions for managing high-density processors, including die-level affinity implementation and anti-affinity configuration between memory bandwidth-intensive pods. Discover techniques for achieving inter-RDMA affinity at ToR level to prevent switch congestion, implementing GPU-RDMA affinity at PCIe switch level for accelerated communication via GPUDirect RDMA, and establishing job-level topology affinity within Kubernetes scheduler's pod-level operations. Gain insights into addressing K8s topology management limitations for new-generation processors and shifting performance bottlenecks from computation to networking, with practical approaches for handling heterogeneous resources like GPU and RDMA.

Syllabus

Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous... He Cao

Taught by

CNCF [Cloud Native Computing Foundation]

Reviews

Start your review of Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.