Overview
Syllabus
intro: ChatGPT, Transformers, nanoGPT, Shakespeare
reading and exploring the data
tokenization, train/val split
data loader: batches of chunks of data
simplest baseline: bigram language model, loss, generation
training the bigram model
port our code to a script
version 1: averaging past context with for loops, the weakest form of aggregation
the trick in self-attention: matrix multiply as weighted aggregation
version 2: using matrix multiply
version 3: adding softmax
minor code cleanup
positional encoding
THE CRUX OF THE VIDEO: version 4: self-attention
note 1: attention as communication
note 2: attention has no notion of space, operates over sets
note 3: there is no communication across batch dimension
note 4: encoder blocks vs. decoder blocks
note 5: attention vs. self-attention vs. cross-attention
note 6: "scaled" self-attention. why divide by sqrthead_size
inserting a single self-attention block to our network
multi-headed self-attention
feedforward layers of transformer block
residual connections
layernorm and its relationship to our previous batchnorm
scaling up the model! creating a few variables. adding dropout
encoder vs. decoder vs. both ? Transformers
super quick walkthrough of nanoGPT, batched multi-headed self-attention
back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
conclusions
Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
Oops I should be using the head_size for the normalization, not C
Taught by
Andrej Karpathy