Overview
Explore task migration at scale using CRIU in this Linux Plumbers Conference talk. Dive into Google's experience implementing Checkpoint/Restore in Userspace (CRIU) for migrating container workloads between machines without losing application state. Learn about the challenges of supporting production workloads, integrating with existing container infrastructure, and managing migratable containers at scale. Discover the impact on efficiency and utilization in Google's computing infrastructure, which manages millions of simultaneous jobs in data centers worldwide. Gain insights into the current state of the CRIU project, new requirements for large-scale implementation, and lessons learned from practical application. Explore topics such as networking, storage, task environment, performance, user experience, and adoption challenges. Discuss potential improvements in CRIU, including performance, security, and time handling. Consider the future direction of CRIU and task migration in Linux as a whole, including migration time optimization, weight feed, scheduling, remote storage, and persistent disk implementation.
Syllabus
Introduction
Borg
Tasks
Evictions
Effects of Evictions
Transparent Migration
Networking
Migration Workflow
Networking at Google
Storage at Google
Task Environment
Performance
User Experience
Adoption Challenges
CRIU Improvements
Performance Security
Handling Time
Future Work
Questions
Migration time
Weight feed
Scheduling
Remote Storage
Persistent Disk
Taught by
Linux Plumbers Conference