Overview
Explore a groundbreaking approach to optimizing distributed deep learning training (DLT) in this conference talk from FAST '23. Dive into SHADE, a novel DLT-aware caching system that addresses the I/O performance bottleneck in accelerator-driven environments. Learn how SHADE leverages importance sampling to detect fine-grained variations at the per-sample level, making informed caching decisions for distributed DLT jobs. Discover the innovative rank-based approach that captures relative importance across different minibatches and dynamically updates importance scores during training. Examine the significant improvements in cache hit ratio and overall training performance achieved by SHADE, particularly in computer vision models. Gain insights into the challenges posed by exponentially growing dataset sizes and the unique I/O workload behaviors of DLT applications, and understand how SHADE's techniques can revolutionize storage system design for deep learning.
Syllabus
FAST '23 - SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
Taught by
USENIX