Learn about a novel checkpointing framework called iCheck in this technical presentation from researchers at the Technical University of Munich. Explore how RDMA and malleable multilevel application-level checkpointing can address the critical challenge of system failures in exascale supercomputers. Discover the implementation details of iCheck's RDMA-enabled configurable multi-agent-based checkpoint transfer mechanism that minimizes application resource usage. Examine how libfabric library enables RDMA support, allowing remote data access of preregistered memory regions without CPU interference, resulting in improved throughput and reduced latency. Understand the two checkpoint and restart operation methods based on RDMA read and write operations, along with push and pull transfer techniques. See real-world performance improvements demonstrated through integration with applications like ls1 mardyn, LULESH, Jacobi 2D heat simulation, and synthetic applications, achieving up to 5000x better performance compared to traditional in-house checkpointing mechanisms.
iCheck - Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
OpenFabrics Alliance via YouTube
Overview
Syllabus
iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems
Taught by
OpenFabrics Alliance