Overview
Learn about accelerating foundation model (FM) training in this AWS re:Invent 2023 conference session that introduces Amazon SageMaker HyperPod. Explore how to conduct uninterrupted FM training over extended periods of weeks and months using this purpose-built solution. Discover the system's intelligent cluster health monitoring capabilities that automatically repair and replace faulty nodes while maintaining training progress. Gain insights into the preconfigured SageMaker distributed training libraries that optimize FM training performance by efficiently splitting training data and models into smaller segments for parallel processing across cluster nodes, maximizing compute and network infrastructure utilization.
Syllabus
AWS re:Invent 2023 - [LAUNCH] Introducing Amazon SageMaker HyperPod (AIM362)
Taught by
AWS Events