Explore a conference talk on developing high-performance storage solutions for large-scale machine learning systems. Discover how I/O bottlenecks can significantly impact training time and system scalability, especially when moving data from global filesystems. Learn about innovative approaches to address these challenges, including the adoption of high-speed hardware and software improvements such as thread models, load balancing SDKs, read/write splitting, and read path optimization. Gain insights into achieving lower latency and higher throughput for more efficient ML model training and data processing.
High Performance Storage Solution for Large-scale ML Systems
CNCF [Cloud Native Computing Foundation] via YouTube
Overview
Syllabus
High Performance Storage Solution for Large-scale ML Systems - Hongjian Yu & Pengfei Zheng
Taught by
CNCF [Cloud Native Computing Foundation]