Watch a technical lecture where Tri Dao presents groundbreaking developments in hardware-optimized algorithms for language modeling at SPCL_Bcast #50. Dive deep into the latest advancements in FlashAttention, particularly focusing on optimizations for Hopper GPUs through techniques like warp-specialization, block-wise operations integration, and block quantization with FP8 support. Learn how FlashAttention-3 achieves remarkable performance improvements on H100 GPUs, demonstrating 1.5-2.0× speedup with FP16 and reaching up to 1.3 PFLOPs/s with FP8. Explore the evolution of subquadratic-time architectures, specifically structured state space models (SSMs), and discover how the innovative Mamba and Mamba-2 architectures address content-based reasoning limitations through a novel selection mechanism. Understand how these simplified neural network architectures match or surpass modern Transformers' performance in language modeling while delivering 5x higher inference throughput and linear sequence length scaling, validated at 3B parameter scales.
Hardware-aware Algorithms for Language Modeling - FlashAttention and Mamba Architectures
Scalable Parallel Computing Lab, SPCL @ ETH Zurich via YouTube
Overview
Syllabus
[SPCL_Bcast #50] Hardware-aware Algorithms for Language Modeling
Taught by
Scalable Parallel Computing Lab, SPCL @ ETH Zurich