Completed
training the bigram model
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Let's Build GPT - From Scratch, in Code, Spelled Out
Automatically move to the next video in the Classroom when playback concludes
- 1 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
- 2 reading and exploring the data
- 3 tokenization, train/val split
- 4 data loader: batches of chunks of data
- 5 simplest baseline: bigram language model, loss, generation
- 6 training the bigram model
- 7 port our code to a script
- 8 version 1: averaging past context with for loops, the weakest form of aggregation
- 9 the trick in self-attention: matrix multiply as weighted aggregation
- 10 version 2: using matrix multiply
- 11 version 3: adding softmax
- 12 minor code cleanup
- 13 positional encoding
- 14 THE CRUX OF THE VIDEO: version 4: self-attention
- 15 note 1: attention as communication
- 16 note 2: attention has no notion of space, operates over sets
- 17 note 3: there is no communication across batch dimension
- 18 note 4: encoder blocks vs. decoder blocks
- 19 note 5: attention vs. self-attention vs. cross-attention
- 20 note 6: "scaled" self-attention. why divide by sqrthead_size
- 21 inserting a single self-attention block to our network
- 22 multi-headed self-attention
- 23 feedforward layers of transformer block
- 24 residual connections
- 25 layernorm and its relationship to our previous batchnorm
- 26 scaling up the model! creating a few variables. adding dropout
- 27 encoder vs. decoder vs. both ? Transformers
- 28 super quick walkthrough of nanoGPT, batched multi-headed self-attention
- 29 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
- 30 conclusions
- 31 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
- 32 Oops I should be using the head_size for the normalization, not C