Hardware-aware Algorithms for Language Modeling - FlashAttention and Mamba Architectures

Overview

Watch a technical lecture where Tri Dao presents groundbreaking developments in hardware-optimized algorithms for language modeling at SPCL_Bcast #50. Dive deep into the latest advancements in FlashAttention, particularly focusing on optimizations for Hopper GPUs through techniques like warp-specialization, block-wise operations integration, and block quantization with FP8 support. Learn how FlashAttention-3 achieves remarkable performance improvements on H100 GPUs, demonstrating 1.5-2.0× speedup with FP16 and reaching up to 1.3 PFLOPs/s with FP8. Explore the evolution of subquadratic-time architectures, specifically structured state space models (SSMs), and discover how the innovative Mamba and Mamba-2 architectures address content-based reasoning limitations through a novel selection mechanism. Understand how these simplified neural network architectures match or surpass modern Transformers' performance in language modeling while delivering 5x higher inference throughput and linear sequence length scaling, validated at 3B parameter scales.

Syllabus

[SPCL_Bcast #50] Hardware-aware Algorithms for Language Modeling

Taught by

Scalable Parallel Computing Lab, SPCL @ ETH Zurich

Reviews

Start your review of Hardware-aware Algorithms for Language Modeling - FlashAttention and Mamba Architectures

Taught by

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Never Stop Learning.