Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Explore memory optimization techniques for large language models in this technical talk featuring Seonyeong Heo from Kyung-Hee University, who demonstrates how to deploy 7-billion-parameter models on memory-constrained devices. Learn about key-value caching in decoder-only transformers and how this innovation reduces computational overhead through efficient output storage and reuse. Dive into dynamic compression methods for optimizing memory usage, including techniques like quantization, pruning, and dimensionality reduction with autoencoders. Understand the implementation of weighted quantization for achieving high compression rates while maintaining minimal errors through proper fine-tuning. Gain valuable insights into efficient memory management strategies that enhance LLM performance in resource-constrained environments, making these powerful models more feasible and energy-efficient for on-device applications.