Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy via YouTube Direct link

batch size schedule, weight decay, FusedAdamW, 90ms

23 of 31

23 of 31

batch size schedule, weight decay, FusedAdamW, 90ms

Class Central Classrooms beta

YouTube playlists curated by Class Central.

Classroom Contents

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Automatically move to the next video in the Classroom when playback concludes

  1. 1 intro: Let’s reproduce GPT-2 124M
  2. 2 exploring the GPT-2 124M OpenAI checkpoint
  3. 3 SECTION 1: implementing the GPT-2 nn.Module
  4. 4 loading the huggingface/GPT-2 parameters
  5. 5 implementing the forward pass to get logits
  6. 6 sampling init, prefix tokens, tokenization
  7. 7 sampling loop
  8. 8 sample, auto-detect the device
  9. 9 let’s train: data batches B,T → logits B,T,C
  10. 10 cross entropy loss
  11. 11 optimization loop: overfit a single batch
  12. 12 data loader lite
  13. 13 parameter sharing wte and lm_head
  14. 14 model initialization: std 0.02, residual init
  15. 15 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
  16. 16 Tensor Cores, timing the code, TF32 precision, 333ms
  17. 17 float16, gradient scalers, bfloat16, 300ms
  18. 18 torch.compile, Python overhead, kernel fusion, 130ms
  19. 19 flash attention, 96ms
  20. 20 nice/ugly numbers. vocab size 50257 → 50304, 93ms
  21. 21 SECTION 3: hyperpamaters, AdamW, gradient clipping
  22. 22 learning rate scheduler: warmup + cosine decay
  23. 23 batch size schedule, weight decay, FusedAdamW, 90ms
  24. 24 gradient accumulation
  25. 25 distributed data parallel DDP
  26. 26 datasets used in GPT-2, GPT-3, FineWeb EDU
  27. 27 validation data split, validation loss, sampling revive
  28. 28 evaluation: HellaSwag, starting the run
  29. 29 SECTION 4: results in the morning! GPT-2, GPT-3 repro
  30. 30 shoutout to llm.c, equivalent but faster code in raw C/CUDA
  31. 31 summary, phew, build-nanogpt github repo

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.