Memory Optimization Techniques for On-Device Large Language Models

Overview

Explore memory optimization techniques for large language models in this technical talk featuring Seonyeong Heo from Kyung-Hee University, who demonstrates how to deploy 7-billion-parameter models on memory-constrained devices. Learn about key-value caching in decoder-only transformers and how this innovation reduces computational overhead through efficient output storage and reuse. Dive into dynamic compression methods for optimizing memory usage, including techniques like quantization, pruning, and dimensionality reduction with autoencoders. Understand the implementation of weighted quantization for achieving high compression rates while maintaining minimal errors through proper fine-tuning. Gain valuable insights into efficient memory management strategies that enhance LLM performance in resource-constrained environments, making these powerful models more feasible and energy-efficient for on-device applications.

Syllabus

Memory Optimization for On-Device LLMs

Taught by

tinyML

Reviews

Start your review of Memory Optimization Techniques for On-Device Large Language Models

Taught by

Compressing Large Language Models (LLMs) with Python Code - 3 Techniques

Structured Quantization for Neural Network Language Model Compression

On-Device Training and Transfer Learning - Lecture 15

On-Device Training and Transfer Learning - Lecture 15

Faster and Cheaper LLMs with Weight and Key-value Cache Quantization

Efficient Transformers - Lecture 20

Never Stop Learning.