Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Embark on a comprehensive 4-hour journey to reproduce GPT-2 (124M) from scratch in this in-depth video tutorial. Explore the entire process, from building the GPT-2 network to optimizing its training for maximum efficiency. Follow along as the instructor sets up the training run according to GPT-2 and GPT-3 paper specifications, initiates the process, and analyzes the results. Gain insights into model architecture, parameter loading, forward pass implementation, sampling techniques, and data handling. Dive into advanced topics such as mixed precision training, GPU optimization, gradient accumulation, and distributed data parallel processing. Learn about hyperparameter tuning, learning rate scheduling, and evaluation methods. By the end, you'll have a thorough understanding of building and training a GPT-2 model, with practical knowledge applicable to larger language models.

Syllabus

intro: Let’s reproduce GPT-2 124M
exploring the GPT-2 124M OpenAI checkpoint
SECTION 1: implementing the GPT-2 nn.Module
loading the huggingface/GPT-2 parameters
implementing the forward pass to get logits
sampling init, prefix tokens, tokenization
sampling loop
sample, auto-detect the device
let’s train: data batches B,T → logits B,T,C
cross entropy loss
optimization loop: overfit a single batch
data loader lite
parameter sharing wte and lm_head
model initialization: std 0.02, residual init
SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
Tensor Cores, timing the code, TF32 precision, 333ms
float16, gradient scalers, bfloat16, 300ms
torch.compile, Python overhead, kernel fusion, 130ms
flash attention, 96ms
nice/ugly numbers. vocab size 50257 → 50304, 93ms
SECTION 3: hyperpamaters, AdamW, gradient clipping
learning rate scheduler: warmup + cosine decay
batch size schedule, weight decay, FusedAdamW, 90ms
gradient accumulation
distributed data parallel DDP
datasets used in GPT-2, GPT-3, FineWeb EDU
validation data split, validation loss, sampling revive
evaluation: HellaSwag, starting the run
SECTION 4: results in the morning! GPT-2, GPT-3 repro
shoutout to llm.c, equivalent but faster code in raw C/CUDA
summary, phew, build-nanogpt github repo

Taught by

Andrej Karpathy

Reviews

Start your review of Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.