Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

iCheck - Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

OpenFabrics Alliance via YouTube

Overview

Learn about a novel checkpointing framework called iCheck in this technical presentation from researchers at the Technical University of Munich. Explore how RDMA and malleable multilevel application-level checkpointing can address the critical challenge of system failures in exascale supercomputers. Discover the implementation details of iCheck's RDMA-enabled configurable multi-agent-based checkpoint transfer mechanism that minimizes application resource usage. Examine how libfabric library enables RDMA support, allowing remote data access of preregistered memory regions without CPU interference, resulting in improved throughput and reduced latency. Understand the two checkpoint and restart operation methods based on RDMA read and write operations, along with push and pull transfer techniques. See real-world performance improvements demonstrated through integration with applications like ls1 mardyn, LULESH, Jacobi 2D heat simulation, and synthetic applications, achieving up to 5000x better performance compared to traditional in-house checkpointing mechanisms.

Syllabus

iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

Taught by

OpenFabrics Alliance

Reviews

Start your review of iCheck - Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.