Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

USENIX via YouTube

Overview

Explore a conference talk on MAST, a global scheduler for ML training workloads across geo-distributed datacenters at hyperscale. Learn about the challenges of manual datacenter region selection in public clouds and how MAST addresses these issues in Meta's private cloud. Discover the three key design principles enabling MAST to schedule complex ML training workloads globally: temporal decoupling, scope decoupling, and exhaustive search. Understand how MAST successfully balances load across global regions, reducing the GPU demand-to-supply ratio for high-priority workloads from 2.63 to 0.98 in the most overloaded region. Gain insights into the global-scheduling abstraction provided by MAST and its impact on hardware utilization and profitability.

Syllabus

OSDI '24 - MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

Taught by

USENIX

Reviews

Start your review of MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.