Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy via YouTube Direct link

batch size schedule, weight decay, FusedAdamW, 90ms

23

of 31

23 of 31

batch size schedule, weight decay, FusedAdamW, 90ms

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization