How to Code Long-Context LLMs - LongLoRA Implementation with Llama 2 100K

Overview

Learn to implement and understand long-context Large Language Models through this technical tutorial video that explains LongLoRA's application on LLama 2 100K. Dive into essential concepts including Flash Attention 2, vision transformers, and rotary positional embedding while exploring the theoretical foundations and practical implementation details of extending LLM context lengths. Master the technical aspects of transformer architecture, embedded normalization layers, and model tokenization necessary for working with extended context lengths like Claude 100K, ChatGPT 32K, and LLama2 100K. Explore performance figures, scientific preprints, and understand why certain architectural choices impact long sequence processing in LLMs. Follow along with code examples optimized for Flash Attention 2 to implement these concepts in your own projects, particularly useful when dealing with lengthy scientific articles exceeding 32K or 64K tokens.

Syllabus

Introduction
Flash Attention
What is LongLoRA
Vision Transformers
Simplest solution
LongLoRA
Why is this happening
Scientific Preprint
Transformer Architecture
Performance figures
Summary
LongLoRA repo
Rotary positional embedding
Model tokenizer
Embedded normalization layers