Overview
Explore a technical conference talk from Ray Summit 2024 where Huawei engineers Boyuan Chen, Chong Yin Tan, and Xiaoshuang Liu present their groundbreaking journey of integrating 10,000 Ascend NPUs into a Ray cluster. Discover the technical challenges and innovative solutions developed while migrating existing business cases to Ray and implementing Huawei Ascend NPU support. Learn about their custom full-stack Ray-observability engine designed for debugging and optimizing massive clusters, and understand the implementation of seamless NPU and GPU task scheduling within the same infrastructure. Gain valuable insights into strategies for maximizing resource utilization and maintaining stability in large-scale AI deployments, including the successful migration of a hyperscale inference pipeline to Ray. Perfect for organizations and engineers interested in scaling distributed computing and AI infrastructure to unprecedented levels.
Syllabus
Scaling Ray to 10K NPUs: Huawei's Hyperscale Journey | Ray Summit 2024
Taught by
Anyscale