Overview
Explore a technical presentation from Cloud Field Day 19 that delves into Broadcom's high-performance Ethernet NIC solutions for AI/ML clusters. Distinguished Engineer and Architect Hemal Shah demonstrates how the increasing complexity of AI/ML workloads demands robust networking capabilities, focusing on the Thor 2 400 gig NIC's advanced features. Learn about RDMA over Converged Ethernet (RoCE), sophisticated congestion control mechanisms, and the importance of end-to-end fabric management in large-scale networks. Discover a reference architecture designed for AI/ML clusters that can scale to thousands of GPUs, incorporating Broadcom switches and NICs for optimal performance. Gain insights into key technical specifications including PCIe Gen 5 by 16 host interface compatibility, hardware root of trust security features, and bi-directional line rates with low latency capabilities. Understand how these networking solutions contribute to efficient job completion times and overall cluster performance in demanding AI/ML environments.
Syllabus
Broadcom Thor 2: High Performance Ethernet NIC for AI/ML
Taught by
Tech Field Day