Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

MLOps World: Machine Learning in Production via YouTube

Overview

Explore PANAMA, a groundbreaking in-network aggregation framework designed for distributed machine learning training on shared clusters. Delve into the two key components of this innovative system: a custom in-network hardware accelerator supporting floating-point gradient aggregation at line rate without compromising accuracy, and a lightweight load-balancing and congestion control protocol. Discover how PANAMA exploits unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources while ensuring high throughput for long-running jobs and low latency for short jobs and latency-sensitive traffic. Examine the feasibility of PANAMA through an FPGA-based prototype with 10 Gbps transceivers and large-scale simulations. Learn how this framework decreases the average training time of large jobs by up to a factor of 1.34 and significantly benefits non-aggregation flows by reducing their 99%-tile completion time by up to 4.5x. Gain insights from Nadeen Gebara, a Ph.D. Student at Imperial College of London, as she presents this cutting-edge research in machine learning infrastructure optimization.

Syllabus

PANAMA In network Aggregation for Shared Machine Learning Clusters

Taught by

MLOps World: Machine Learning in Production

Reviews

Start your review of PANAMA: In-Network Aggregation for Shared Machine Learning Clusters

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.