Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore a conference talk that delves into Quant-LLM, an innovative approach to accelerate the serving of large language models through FP6-centric algorithm-system co-design on modern GPUs. Learn about the challenges of supporting FP6 quantization on GPUs and discover TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support for 6-bit and arbitrary bit-width quantization. Understand how Quant-LLM integrates TC-FPx kernel into existing inference systems, providing new end-to-end support for quantized LLM inference and achieving better trade-offs between inference cost and model quality. Examine experimental results demonstrating Quant-LLM's ability to enable LLaMA-70b inference on a single GPU, achieving significantly higher normalized inference throughput compared to the FP16 baseline. Access the publicly available source code and gain insights into the potential of six-bit quantization for effectively reducing LLM size while preserving model quality across various applications.
Syllabus
USENIX ATC '24 - Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric...
Taught by
USENIX