Best Practices for Productionizing Distributed Training with Ray Train

Overview

Learn best practices for productionizing distributed training with Ray Train in this 30-minute tutorial from Anyscale. Explore techniques for enabling fault tolerance in large-scale machine learning workloads, including experiment restoration, recovery from node failures, using persistent cloud storage for experiment state snapshots, and performing large model checkpointing. Discover simple additions to incorporate into Ray Train applications to leverage the benefits of fault-tolerant model training. Gain insights into handling issues like out-of-memory errors and storage failures in multi-node distributed training environments, particularly relevant for training large language models. Understand how fault tolerance can help reduce costs through the use of spot instances while preserving training progress in case of failures.

Syllabus

Best Practices for Productionizing Distributed Training with Ray Train

Taught by

Anyscale

Reviews

Start your review of Best Practices for Productionizing Distributed Training with Ray Train

Taught by

Overcoming Distributed ML Challenges with Ray Train

Ray Train: A Production-Ready Library for Distributed Deep Learning

Train compute-intensive models with Azure Machine Learning

Fast and Scalable Model Training with PyTorch and Ray

Scaling Ray Train to 10,000 Kubernetes Nodes on GKE

Alpa: Simple Large Model Training and Inference on Ray

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.