Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Optcast: Open-Source Aggregation Offloading for Distributed Deep Learning

Open Compute Project via YouTube

Overview

Learn about an innovative open-source prototype for in-network aggregation in this 18-minute technical talk from NTT researcher Nariaki Tateiwa. Explore how distributed deep learning techniques using multiple GPUs can be optimized through improved Allreduce collective performance. Discover a breakthrough approach that achieves 1.72x faster performance compared to standard NCCL Allreduce, while supporting various transport protocols including Socket, InfiniBand, and RoCE without requiring specialized hardware. Understand the advantages of in-network aggregation on switches or servers, which saves double the bandwidth compared to traditional algorithms, and learn why standardizing these techniques through OCP is crucial for broader datacenter implementation. Gain insights into how this NCCL plugin implementation democratizes high-performance distributed AI model training across different computing environments.

Syllabus

Optcast: Open-Source Aggregation Offloading for Distributed Deep Learning

Taught by

Open Compute Project

Reviews

Start your review of Optcast: Open-Source Aggregation Offloading for Distributed Deep Learning

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.