Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Hardware-aware Algorithms for Language Modeling - FlashAttention and Mamba Architectures

Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Watch a technical lecture where Tri Dao presents groundbreaking developments in hardware-optimized algorithms for language modeling at SPCL_Bcast #50. Dive deep into the latest advancements in FlashAttention, particularly focusing on optimizations for Hopper GPUs through techniques like warp-specialization, block-wise operations integration, and block quantization with FP8 support. Learn how FlashAttention-3 achieves remarkable performance improvements on H100 GPUs, demonstrating 1.5-2.0× speedup with FP16 and reaching up to 1.3 PFLOPs/s with FP8. Explore the evolution of subquadratic-time architectures, specifically structured state space models (SSMs), and discover how the innovative Mamba and Mamba-2 architectures address content-based reasoning limitations through a novel selection mechanism. Understand how these simplified neural network architectures match or surpass modern Transformers' performance in language modeling while delivering 5x higher inference throughput and linear sequence length scaling, validated at 3B parameter scales.

Syllabus

[SPCL_Bcast #50] Hardware-aware Algorithms for Language Modeling

Taught by

Scalable Parallel Computing Lab, SPCL @ ETH Zurich

Reviews

Start your review of Hardware-aware Algorithms for Language Modeling - FlashAttention and Mamba Architectures

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.