Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Checkpoint Offloading SSD - Enhancing Performance and Scalability in LLM Training

Open Compute Project via YouTube

Overview

Learn about innovative storage solutions for Large Language Model (LLM) training in this technical presentation from SK Hynix experts. Explore how checkpoint offloading SSD technology addresses performance bottlenecks and enhances scalability in LLM training environments. Discover methods for managing model states, including parameters, momentums, and variances, while reducing data movement between GPUs and storage. Examine experimental results demonstrating how AI storage solutions can optimize GPU memory usage and improve overall training efficiency by offloading optimizer operations to storage. Gain insights into practical approaches for handling interruptions and failures during LLM training through persistent storage strategies.

Syllabus

Checkpoint Offloading SSD Enhancing Performance and Scalability in LLM Training

Taught by

Open Compute Project

Reviews

Start your review of Checkpoint Offloading SSD - Enhancing Performance and Scalability in LLM Training

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.