Overview
Explore a 22-minute conference talk from NSDI '23 that introduces SYNDICATE, a novel framework designed to optimize communication bottlenecks in large-scale machine learning training. Delve into the challenges posed by emerging ML training deployments, including larger models and hybrid-parallel training techniques. Discover how SYNDICATE addresses these issues through a new abstraction called "motif" and joint optimization of scheduling and execution planning. Learn about the framework's ability to break down large communication tasks into smaller pieces, maximizing network utilization and overlap with compute. Understand how this approach can significantly improve the speed of training state-of-the-art large models by 21-74%. Gain insights into the future of ML training optimization and its potential impact on the field of machine learning and distributed systems.
Syllabus
NSDI '23 - Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning...
Taught by
USENIX