Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

USENIX via YouTube

Overview

Explore a conference talk that delves into Quant-LLM, an innovative approach to accelerate the serving of large language models through FP6-centric algorithm-system co-design on modern GPUs. Learn about the challenges of supporting FP6 quantization on GPUs and discover TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for 6-bit and arbitrary bit-width quantization. Understand how Quant-LLM integrates TC-FPx kernel into existing inference systems, providing new end-to-end support for quantized LLM inference and achieving better trade-offs between inference cost and model quality. Examine experimental results demonstrating Quant-LLM's ability to enable LLaMA-70b inference on a single GPU, achieving significantly higher normalized inference throughput compared to the FP16 baseline. Access the publicly available source code and gain insights into the potential of six-bit quantization for effectively reducing LLM size while preserving model quality across various applications.

Syllabus

USENIX ATC '24 - Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric...

Taught by

USENIX

Reviews

Start your review of Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.