Let's Build GPT - From Scratch, in Code, Spelled Out

Let's Build GPT - From Scratch, in Code, Spelled Out

Andrej Karpathy via YouTube Direct link

multi-headed self-attention

22 of 32

22 of 32

multi-headed self-attention

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Let's Build GPT - From Scratch, in Code, Spelled Out

Automatically move to the next video in the Classroom when playback concludes

  1. 1 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
  2. 2 reading and exploring the data
  3. 3 tokenization, train/val split
  4. 4 data loader: batches of chunks of data
  5. 5 simplest baseline: bigram language model, loss, generation
  6. 6 training the bigram model
  7. 7 port our code to a script
  8. 8 version 1: averaging past context with for loops, the weakest form of aggregation
  9. 9 the trick in self-attention: matrix multiply as weighted aggregation
  10. 10 version 2: using matrix multiply
  11. 11 version 3: adding softmax
  12. 12 minor code cleanup
  13. 13 positional encoding
  14. 14 THE CRUX OF THE VIDEO: version 4: self-attention
  15. 15 note 1: attention as communication
  16. 16 note 2: attention has no notion of space, operates over sets
  17. 17 note 3: there is no communication across batch dimension
  18. 18 note 4: encoder blocks vs. decoder blocks
  19. 19 note 5: attention vs. self-attention vs. cross-attention
  20. 20 note 6: "scaled" self-attention. why divide by sqrthead_size
  21. 21 inserting a single self-attention block to our network
  22. 22 multi-headed self-attention
  23. 23 feedforward layers of transformer block
  24. 24 residual connections
  25. 25 layernorm and its relationship to our previous batchnorm
  26. 26 scaling up the model! creating a few variables. adding dropout
  27. 27 encoder vs. decoder vs. both ? Transformers
  28. 28 super quick walkthrough of nanoGPT, batched multi-headed self-attention
  29. 29 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
  30. 30 conclusions
  31. 31 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
  32. 32 Oops I should be using the head_size for the normalization, not C

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.