Optimize Your AI Cloud Infrastructure: A Hardware Perspective

Overview

Explore the intricacies of GPU Cloud infrastructure optimization in this technical conference talk that delves deep into hardware-level considerations for AI systems. Learn how to fine-tune various machine learning models using an H100 Cluster, with detailed analysis of critical components like POD Scheduler, Device Plugin, GPU/NUMA topology, and ROCE/NCCL Stack. Gain valuable insights from first-hand experimental results demonstrating the relationship between model performance and device operator configurations in nodes, focusing particularly on CNN, RNN, and Transformer models from MLPerf. Master the often-overlooked hardware aspects of AI infrastructure that can significantly impact distributed machine learning performance and efficiency.