Bamboo - Making Preemptible Instances Resilient for Affordable Training of Large DNNs

Overview

Explore a 15-minute conference talk from USENIX NSDI '23 that introduces Bamboo, an innovative distributed system designed to significantly reduce the costs of training large Deep Neural Network (DNN) models. Learn how Bamboo leverages preemptible instances and introduces redundant computations into the training pipeline to achieve resilience and efficiency in the face of frequent preemptions. Discover how this approach outperforms traditional checkpointing techniques, resulting in 3.7× improvement in training throughput and 2.4× reduction in costs compared to using on-demand instances. Gain insights into the challenges of training increasingly large DNN models and the novel solutions proposed to make this process more affordable for organizations and research labs of all sizes.

Syllabus

NSDI '23 - Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs

Taught by

USENIX

Reviews

Start your review of Bamboo - Making Preemptible Instances Resilient for Affordable Training of Large DNNs

Taught by

Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

ModelKeeper - Accelerating DNN Training via Automated Training Warmup

TopoOpt - Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs

Zeus - Understanding and Optimizing GPU Energy Consumption of DNN Training

Check-N-Run - A Checkpointing System for Training Deep Learning Recommendation Models

Skyplane - Optimizing Transfer Cost and Throughput Using Cloud-Aware Overlays

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.