Unlocking Local LLMs with Quantization

Overview

Learn about quantization's evolution and its impact on local Large Language Models in this 40-minute conference talk from Hugging Face's Marc Sun. Explore the journey of quantization through influential papers like QLoRA and GPTQ, and discover its practical applications across different stages of model development. Gain insights into pre-training a 1.58-bit model, implementing fine-tuning techniques with PEFT + QLoRA, and optimizing inference performance using torch.compile or custom kernels. Understand how the open-source community is making quantized models more accessible through transformers and GGUF models from llama.cpp, enabling broader adoption of local LLM implementations.