SparQ Attention: Bandwidth-Efficient LLM Inference

Overview

Explore a comprehensive presentation on SparQ Attention, delivered by Ivan Chelombiev and Luka Ribar from GraphCore. Delve into their groundbreaking work on increasing inference throughput of Large Language Models (LLMs) by reducing memory bandwidth requirements in attention blocks. Learn about the innovative technique of selective fetching of cached history, which can be applied to existing LLMs during inference without modifying pre-training or requiring additional fine-tuning. Discover how SparQ Attention can decrease attention memory bandwidth requirements up to eight times while maintaining accuracy, as demonstrated through evaluations of Llama 2 and Pythia models on various downstream tasks. Gain insights into the latest advancements in AI optimization and LLM efficiency, and understand the potential impact of this research on the future of language model deployment and performance.

Syllabus

We're very excited to welcome both Ivan Chelombiev and Luka Ribar from GraphCore. They will be presenting their work on SparQ Attention presentation starts at

Taught by

Unify

Reviews

Start your review of SparQ Attention: Bandwidth-Efficient LLM Inference

100 Most Popular Courses For October

Most common

Popular subjects

Popular courses

SparQ Attention: Bandwidth-Efficient LLM Inference

Overview

Syllabus

Taught by

Reviews

100 Most Popular Courses For October

Taught by

AWQ for LLM Quantization - Efficient Inference Framework for Large Language Models

Long-context Attention in Near-Linear Time

Efficient Inference of Extremely Large Transformer Models

LLM Foundations - LLM Bootcamp

Not All Memories Are Created Equal - Learning to Forget by Expiring

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Never Stop Learning.