SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training

Overview

Explore a groundbreaking approach to optimizing distributed deep learning training (DLT) in this conference talk from FAST '23. Dive into SHADE, a novel DLT-aware caching system that addresses the I/O performance bottleneck in accelerator-driven environments. Learn how SHADE leverages importance sampling to detect fine-grained variations at the per-sample level, making informed caching decisions for distributed DLT jobs. Discover the innovative rank-based approach that captures relative importance across different minibatches and dynamically updates importance scores during training. Examine the significant improvements in cache hit ratio and overall training performance achieved by SHADE, particularly in computer vision models. Gain insights into the challenges posed by exponentially growing dataset sizes and the unique I/O workload behaviors of DLT applications, and understand how SHADE's techniques can revolutionize storage system design for deep learning.

Syllabus

FAST '23 - SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

Taught by

USENIX

Reviews

Start your review of SHADE - Enable Fundamental Cacheability for Distributed Deep Learning Training

Taught by

DEPART - Replica Decoupling for Distributed Key-Value Storage

Deep Learning for Fundamental Sciences Using High-Performance Computing

Distributed Deep Learning on Apache Mesos with GPUs and Gang Scheduling

BGL - GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

Stanford Seminar - HPC Opportunities in Deep Learning - Greg Diamos, Baidu

Never Stop Learning.