Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Handling Multi-Terabyte LLM Checkpoints - MLOps Podcast #228

MLOps.community via YouTube

Overview

Explore the intricacies of handling multi-terabyte LLM checkpoints in this insightful podcast episode featuring Simon Karasik, Machine Learning Engineer at Nebius AI. Delve into the challenges of LLM checkpointing, including checkpoint sizes and various techniques for saving and loading massive datasets. Gain valuable insights on selecting appropriate cloud storage options for checkpointing. Learn about Simon's diverse background in machine learning, covering areas such as ads, speech, and tax. Discover key topics like zombie model garbage collection, the evolution of LLMs, and the importance of confidence in AI training. Examine the differences between Slurm and Kubernetes, storage choice lessons, and essential components for setting up LLM infrastructure. Explore Argo workflows, Kubernetes node troubleshooting, and the complexities of fine-tuning, storage, and networking in LLM development. Benefit from practical advice on starting simple before advancing to more complex setups, and understanding model-specific needs in the rapidly evolving field of large language models.

Syllabus

[] Simon preferred beverage
[] Takeaways
[] Simon's tech background
[] Zombie models garbage collection
[] The road to LLMs
[] Trained models Simon worked on
[] LLM Checkpoints
[] Confidence in AI Training
[] Different Checkpoints
[] Checkpoint parts
[] Slurm vs Kubernetes
[] Storage choices lessons
[] Paramount components for setup
[] Argo workflows
[] Kubernetes node troubleshooting
[] Cloud virtual machines have pre-installed mentoring
[] Fine-tuning
[] Storage, networking, and complexity in network design
[] Start simple before advanced; consider model needs.
[] Join us at our first in-person conference on June 25 all about AI Quality

Taught by

MLOps.community

Reviews

Start your review of Handling Multi-Terabyte LLM Checkpoints - MLOps Podcast #228

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.