Designing High-Performance Scalable Middleware for HPC, AI, and Data Science in Exascale Systems and Clouds

Overview

Explore the design of high-performance scalable middleware for HPC, AI, and Data Science in exascale systems and clouds in this comprehensive conference talk. Delve into the challenges of supporting programming models for multi-petaflop and exaflop systems, and learn about the MVAPICH2 project's architecture and features. Discover performance improvements in startup, collectives, and applications using MVAPICH2 and TAU. Examine the benefits of new protocols and designs, including DC transport, cooperative rendezvous, and shared address space collectives. Investigate MVAPICH2-GDR's capabilities for HPC, deep learning, and data science, with a focus on CUDA-aware MPI support and on-the-fly compression. Analyze performance benchmarks for distributed TensorFlow, PyTorch, Horovod, and DeepSpeed at scale, as well as Dask architecture and cuDF merge operations. Gain insights into upcoming features and funding acknowledgments for cutting-edge middleware development.

Syllabus

Intro
Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges
Designing (MPX) Programming Models at Exascale
Overview of the MVAPICH2 Project
MVAPICH2 Release Timeline and Downloads
Architecture of MVAPICH2 Software Family for HPC, DL/ML, and Data Science
Highlights of MVAPICH2 2.3.6-GA Release
Startup Performance on TACC Frontera
Performance of Collectives with SHARP on TACC Frontera
Performance Engineering Applications using MVAPICH2 and TAU
Overview of Some of the MVAPICH2-X Features
Impact of DC Transport Protocol on Neuron
Cooperative Rendezvous Protocols
Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand
Shared Address Space (XPMEM)-based Collectives Design
MVAPICH2-GDR 2.3.6
Highlights of some MVAPICH2-GDR Features for HPC, DL, ML and Data Science
MVAPICH2-GDR with CUDA-aware MPI Support
Performance with On-the-fly Compression Support in MVAPICH2-GDR
Collectives Performance on DGX2-A100 - Small Message
MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training
Distributed TensorFlow on ORNL Summit 1,536 GPUS
Distributed TensorFlow on TACC Frontera (2048 CPU nodes)
PyTorch, Horovod and DeepSpeed at Scale: Training ResNet-50 on 256 V100 GPUs
Dask Architecture
Benchmark #1: Sum of cupy Array and its Transpose (12)
Benchmark #2: cuDF Merge (TACC Frontera GPU Subsystem)
MVAPICH2-GDR Upcoming Features for HPC and DL
Funding Acknowledgments

Taught by

Linux Foundation

Reviews

Start your review of Designing High-Performance Scalable Middleware for HPC, AI, and Data Science in Exascale Systems and Clouds

Taught by

Tags

10 Best Data Science Courses

10 Best Machine Learning Courses for 2024: Scikit-learn, TensorFlow, and more

Never Stop Learning.