Completed
SECTION 3: hyperpamaters, AdamW, gradient clipping
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization
Automatically move to the next video in the Classroom when playback concludes
- 1 intro: Let’s reproduce GPT-2 124M
- 2 exploring the GPT-2 124M OpenAI checkpoint
- 3 SECTION 1: implementing the GPT-2 nn.Module
- 4 loading the huggingface/GPT-2 parameters
- 5 implementing the forward pass to get logits
- 6 sampling init, prefix tokens, tokenization
- 7 sampling loop
- 8 sample, auto-detect the device
- 9 let’s train: data batches B,T → logits B,T,C
- 10 cross entropy loss
- 11 optimization loop: overfit a single batch
- 12 data loader lite
- 13 parameter sharing wte and lm_head
- 14 model initialization: std 0.02, residual init
- 15 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
- 16 Tensor Cores, timing the code, TF32 precision, 333ms
- 17 float16, gradient scalers, bfloat16, 300ms
- 18 torch.compile, Python overhead, kernel fusion, 130ms
- 19 flash attention, 96ms
- 20 nice/ugly numbers. vocab size 50257 → 50304, 93ms
- 21 SECTION 3: hyperpamaters, AdamW, gradient clipping
- 22 learning rate scheduler: warmup + cosine decay
- 23 batch size schedule, weight decay, FusedAdamW, 90ms
- 24 gradient accumulation
- 25 distributed data parallel DDP
- 26 datasets used in GPT-2, GPT-3, FineWeb EDU
- 27 validation data split, validation loss, sampling revive
- 28 evaluation: HellaSwag, starting the run
- 29 SECTION 4: results in the morning! GPT-2, GPT-3 repro
- 30 shoutout to llm.c, equivalent but faster code in raw C/CUDA
- 31 summary, phew, build-nanogpt github repo