Alpa: Simple Large Model Training and Inference on Ray

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!

Grab it

Explore the capabilities of Alpa, a Ray-native library designed for automated training and serving of large models like GPT-3. Discover how Alpa simplifies model-parallel training of complex deep learning models by generating execution plans that unify data, operator, and pipeline parallelism. Learn about Alpa's innovative approach to distributing training across two hierarchical levels of parallelism: inter-operator and intra-operator. Understand how Alpa constructs a new hierarchical space for massive model-parallel execution plans and uses compilation passes to derive optimal parallel execution plans. Examine Alpa's efficient runtime that orchestrates two-level parallel execution on distributed compute devices. Compare Alpa's performance to hand-tuned model-parallel training systems and explore its versatility in handling models with heterogeneous architectures. Delve into both the algorithmic aspects and the engineering/system implementation, with a focus on Ray's crucial role as a building block of the Alpa runtime. This 31-minute talk from Anyscale at Ray Summit provides valuable insights into advanced techniques for scaling out complex deep learning models on distributed computing environments.