Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Quantizing LLMs and Converting to GGUF Format for Faster and Smaller Models

Venelin Valkov via YouTube

Overview

Learn how to optimize large language models (LLMs) through quantization in this comprehensive video tutorial that demonstrates the process of converting and deploying models for resource-constrained environments. Master the techniques of model quantization using llama.cpp and Ollama while comparing performance metrics between full-precision and quantized versions. Follow along with hands-on demonstrations covering Google Colab setup, llama.cpp installation, GGUF conversion, model evaluation, and integration with Ollama. Gain practical insights into reducing model size and improving inference speed without significant accuracy loss, using a fine-tuned Llama 3.2 model as an example. Perfect for developers and machine learning practitioners looking to efficiently deploy LLMs on devices with limited computational resources.

Syllabus

- Welcome
- Text tutorial on MLExpert.io
- Fine-tuned model on HuggingFace
- Why quantize your model?
- Google Colab Setup
- Install llama.cpp
- Convert HF model to GGUF
- Run the quantized model with llama-cpp-python
- Evaluate full-precision vs quantized model
- Use your quantized model in Ollama
- Conclusion

Taught by

Venelin Valkov

Reviews

Start your review of Quantizing LLMs and Converting to GGUF Format for Faster and Smaller Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.