Explore a groundbreaking conference talk on CASSINI, a network-aware job scheduler for machine learning clusters. Delve into the innovative geometric abstraction introduced to consider communication patterns of different jobs during network link placement. Learn about the Affinity graph technique that finds time-shift values to interleave communication phases of jobs sharing the same network link. Discover how CASSINI improves average and tail completion times of jobs by up to 1.6x and 2.5x respectively, compared to state-of-the-art ML schedulers. Examine experimental results from 13 common ML models on a 24-server testbed, showcasing CASSINI's ability to reduce ECN marked packets in the cluster by up to 33x. Gain insights into advanced network-aware scheduling techniques for optimizing machine learning cluster performance.
Overview
Syllabus
NSDI '24 - CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
Taught by
USENIX