Overview
Learn how to build and manage GPU clusters for AI workloads in this technical presentation from Cisco at AI Field Day 5. Explore the challenges and solutions of setting up GPU clusters, including inter-GPU networking optimization using Cisco Nexus 9000 Series switches. Discover how to reduce cluster setup time from weeks to hours using validated solutions, and understand critical network designs like "Rails Optimized" and "Fly" for efficient GPU communication. Master concepts of collective communication protocols, dynamic load balancing, and static pinning for optimal data flow between GPUs. Gain insights into creating lossless networks using priority-based flow control and leveraging Nexus Dashboard for monitoring and anomaly detection. Follow along as a machine learning engineer demonstrates building a generative AI application using on-premises GPU infrastructure, showing how to process billions of tokens efficiently while maintaining data security. See real-world applications of AI/ML infrastructure in network engineering through practical examples of real-time insights and anomaly detection.
Syllabus
Demystifying Artificial Intelligence and Machine Learning Infrastructure for a Network Engineer
Taught by
Tech Field Day