Let's Build GPT - From Scratch, in Code, Spelled Out

Overview

Dive into a comprehensive tutorial on building a Generatively Pretrained Transformer (GPT) from scratch, following the "Attention is All You Need" paper and OpenAI's GPT-2/GPT-3 models. Explore the connections to ChatGPT and watch GitHub Copilot assist in writing GPT code. Begin with an introduction to ChatGPT, Transformers, nanoGPT, and Shakespeare, then progress through data exploration, tokenization, and implementing a baseline bigram language model. Delve into the core concepts of self-attention, including matrix multiplication for weighted aggregation, positional encoding, and multi-headed attention. Build the Transformer architecture step-by-step, incorporating feedforward layers, residual connections, and layer normalization. Conclude with insights on encoder vs. decoder Transformers, a walkthrough of nanoGPT, and discussions on pretraining, fine-tuning, and RLHF in the context of ChatGPT and GPT-3.

Syllabus

intro: ChatGPT, Transformers, nanoGPT, Shakespeare
reading and exploring the data
tokenization, train/val split
data loader: batches of chunks of data
simplest baseline: bigram language model, loss, generation
training the bigram model
port our code to a script
version 1: averaging past context with for loops, the weakest form of aggregation
the trick in self-attention: matrix multiply as weighted aggregation
version 2: using matrix multiply
version 3: adding softmax
minor code cleanup
positional encoding
THE CRUX OF THE VIDEO: version 4: self-attention
note 1: attention as communication
note 2: attention has no notion of space, operates over sets
note 3: there is no communication across batch dimension
note 4: encoder blocks vs. decoder blocks
note 5: attention vs. self-attention vs. cross-attention
note 6: "scaled" self-attention. why divide by sqrthead_size
inserting a single self-attention block to our network
multi-headed self-attention
feedforward layers of transformer block
residual connections
layernorm and its relationship to our previous batchnorm
scaling up the model! creating a few variables. adding dropout
encoder vs. decoder vs. both ? Transformers
super quick walkthrough of nanoGPT, batched multi-headed self-attention
back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
conclusions
Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
Oops I should be using the head_size for the normalization, not C