Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Let's Build GPT - From Scratch, in Code, Spelled Out

Andrej Karpathy via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into a comprehensive tutorial on building a Generatively Pretrained Transformer (GPT) from scratch, following the "Attention is All You Need" paper and OpenAI's GPT-2/GPT-3 models. Explore the connections to ChatGPT and watch GitHub Copilot assist in writing GPT code. Begin with an introduction to ChatGPT, Transformers, nanoGPT, and Shakespeare, then progress through data exploration, tokenization, and implementing a baseline bigram language model. Delve into the core concepts of self-attention, including matrix multiplication for weighted aggregation, positional encoding, and multi-headed attention. Build the Transformer architecture step-by-step, incorporating feedforward layers, residual connections, and layer normalization. Conclude with insights on encoder vs. decoder Transformers, a walkthrough of nanoGPT, and discussions on pretraining, fine-tuning, and RLHF in the context of ChatGPT and GPT-3.

Syllabus

intro: ChatGPT, Transformers, nanoGPT, Shakespeare
reading and exploring the data
tokenization, train/val split
data loader: batches of chunks of data
simplest baseline: bigram language model, loss, generation
training the bigram model
port our code to a script
version 1: averaging past context with for loops, the weakest form of aggregation
the trick in self-attention: matrix multiply as weighted aggregation
version 2: using matrix multiply
version 3: adding softmax
minor code cleanup
positional encoding
THE CRUX OF THE VIDEO: version 4: self-attention
note 1: attention as communication
note 2: attention has no notion of space, operates over sets
note 3: there is no communication across batch dimension
note 4: encoder blocks vs. decoder blocks
note 5: attention vs. self-attention vs. cross-attention
note 6: "scaled" self-attention. why divide by sqrthead_size
inserting a single self-attention block to our network
multi-headed self-attention
feedforward layers of transformer block
residual connections
layernorm and its relationship to our previous batchnorm
scaling up the model! creating a few variables. adding dropout
encoder vs. decoder vs. both ? Transformers
super quick walkthrough of nanoGPT, batched multi-headed self-attention
back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
conclusions
Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :
Oops I should be using the head_size for the normalization, not C

Taught by

Andrej Karpathy

Reviews

Start your review of Let's Build GPT - From Scratch, in Code, Spelled Out

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.