Check-N-Run - A Checkpointing System for Training Deep Learning Recommendation Models

Check-N-Run - A Checkpointing System for Training Deep Learning Recommendation Models

USENIX via YouTube Direct link

High Performance Training at Meta

4 of 15

4 of 15

High Performance Training at Meta

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Check-N-Run - A Checkpointing System for Training Deep Learning Recommendation Models

Automatically move to the next video in the Classroom when playback concludes

  1. 1 Intro
  2. 2 Recommendation Models are important . Use cases include
  3. 3 Recommendation Model Architecture
  4. 4 High Performance Training at Meta
  5. 5 The Criticality of Checkpointing • Failure recovery ensure progress
  6. 6 Checkpoint Challenges
  7. 7 Check-n-Run
  8. 8 Checkpointing Workflow
  9. 9 Reducing WB with Differential Checkpointing
  10. 10 Approaches for Differential Checkpointing • One-Shot Differential Checkpoint . Consecutive Incremental Checkpoint - Intermittent Differential Checkpoint
  11. 11 Checkpoint Quantization Compress checkpoint without degrading training accuracy
  12. 12 Comparing Quantization Strategies . Uniform quantization . Non-uniform quantization using kmeans • Adaptive uniform quantization
  13. 13 Quantization Bit-width Selection
  14. 14 Overall Reduction
  15. 15 Summary

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.