Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization

Andrej Karpathy via YouTube Direct link

SECTION 3: hyperpamaters, AdamW, gradient clipping

21

of 31

21 of 31

SECTION 3: hyperpamaters, AdamW, gradient clipping

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Reproducing GPT-2 (124M) from Scratch - Implementation and Optimization