Running Gemma 2B and Llama-2 7B with Model Quantization - A Hands-on Lab

Overview

Learn to implement and run open-weight Large Language Models (LLMs) in this hands-on lab tutorial that explores Gemma 2B and Llama-2 7B models using Google Colab's free T4 tier. Discover essential resources for building LLMs from scratch while understanding hardware requirements for model deployment. Master practical techniques for loading and prompting Gemma 2B, including completion, zero-shot, and one-shot approaches. Explore the differences between base and instruction-tuned versions, examining the impact of fine-tuning and RLHF on model performance. Gain expertise in model quantization techniques across 16-bit, 8-bit, and 4-bit implementations, and learn to effectively load and run Llama-2 7B with 4-bit quantization. Access a comprehensive lab notebook for hands-on practice and follow along with detailed demonstrations of chat formatting and model optimization strategies.

Syllabus

- Resources to build your own LLM from scratch
- Determining hardware requirements to run LLMs
- Loading Gemma 2B
- Prompting Gemma 2B completion, zero shot, one-shot
- Loading and using the instruction tuned version of Gemma 2B chat formatters
- Model quantization 16bit, 8bit, 4bit
- Loading Llama-2 7B with 4bit quantization
- RunningLlama 2 7B Chat