Overview
Discover strategies to optimize I/O performance and maintain GPU utilization during machine learning model training in the cloud. This 27-minute conference talk explores the challenges of data-intensive training processes, focusing on the frequent I/O requirements of small files like images and audio. Learn about a novel architecture designed to enhance the entire data pipeline and sustain the high throughput demanded by GPUs. Gain insights into implementing this architecture for PyTorch workloads on Kubernetes in public cloud environments, addressing the unique data access patterns and I/O challenges specific to model training compared to traditional data analytics.
Syllabus
How to Eliminate the I/O Bottleneck and Continuously Feed the GPU While Training in the... - Lu Qiu
Taught by
Linux Foundation