LongNet: Understanding Transformer Scaling to 1 Billion Tokens - A Technical Overview

Overview

Explore Microsoft Research's latest transformer model architecture in this 12-minute technical video that breaks down how LongNet achieves the capability to process 1 billion input tokens. Learn about the computational complexities in Large Language Models, starting with an overview of sparse attention mechanisms before diving into the innovative dilated attention approach that overcomes traditional transformer sequence length limitations. Discover the implementation details of multi-head dilated attention, distributed training strategies, and performance evaluations that demonstrate LongNet's effectiveness. Through clear explanations and structured segments, gain insights into this ambitious advancement toward Artificial General Intelligence, complete with references to foundational concepts in self-attention and sparse attention architectures.

Syllabus

- Intro
- Computational Complexity in LLM models
- Sparse Attention Paper
- Self Attention overview
- Dilated Attention
- Multi-head Dilated Attention
- Distributed Training
- Evaluation of LongNet Dilated Attention