Explore the benefits of using Ray to build a heterogeneous training cluster for deep learning models at Netflix. Learn how to set up a cluster with a mix of CPU and GPU instances, run distributed training jobs, and leverage Ray's automatic resource allocation for scheduling different types of workers. Discover best practices for configuring and managing persistent clusters using Ray, while addressing challenges in building and maintaining such systems. Gain insights into how Netflix's Machine Learning Platform team optimizes infrastructure for various use cases, including recommendations, content understanding, and artwork generation. Understand the importance of reliable, scalable, and robust training and deployment of machine learning models in the entertainment industry.
Overview
Syllabus
Heterogeneous Training Cluster with Ray at Netflix
Taught by
Anyscale