Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Memory Optimization Techniques for On-Device Large Language Models

tinyML via YouTube

Overview

Explore memory optimization techniques for large language models in this technical talk featuring Seonyeong Heo from Kyung-Hee University, who demonstrates how to deploy 7-billion-parameter models on memory-constrained devices. Learn about key-value caching in decoder-only transformers and how this innovation reduces computational overhead through efficient output storage and reuse. Dive into dynamic compression methods for optimizing memory usage, including techniques like quantization, pruning, and dimensionality reduction with autoencoders. Understand the implementation of weighted quantization for achieving high compression rates while maintaining minimal errors through proper fine-tuning. Gain valuable insights into efficient memory management strategies that enhance LLM performance in resource-constrained environments, making these powerful models more feasible and energy-efficient for on-device applications.

Syllabus

Memory Optimization for On-Device LLMs

Taught by

tinyML

Reviews

Start your review of Memory Optimization Techniques for On-Device Large Language Models

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.