Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building the GPT Tokenizer - From Strings to Tokens and Back

Andrej Karpathy via YouTube

Overview

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Dive into a comprehensive 2-hour lecture on building the GPT Tokenizer from scratch. Explore the crucial role of tokenization in Large Language Models (LLMs), understanding its separate training process and fundamental functions. Learn about Byte Pair Encoding, Unicode, and various encoding methods. Implement key components like encoding, decoding, and regex patterns. Compare different tokenizer libraries and examine tokenization quirks in LLMs. Gain hands-on experience through exercises, including creating your own GPT-4 tokenizer. Discover insights on multimodal tokenization and potential future improvements in the field.

Syllabus

intro: Tokenization, GPT-2 paper, tokenization-related issues
tokenization by example in a Web UI tiktokenizer
strings in Python, Unicode code points
Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
daydreaming: deleting tokenization
Byte Pair Encoding BPE algorithm walkthrough
starting the implementation
counting consecutive pairs, finding most common pair
merging the most common pair
training the tokenizer: adding the while loop, compression ratio
tokenizer/LLM diagram: it is a completely separate stage
decoding tokens to strings
encoding strings to tokens
regex patterns to force splits across categories
tiktoken library intro, differences between GPT-2/GPT-4 regex
GPT-2 encoder.py released by OpenAI walkthrough
special tokens, tiktoken handling of, GPT-2/GPT-4 differences
minbpe exercise time! write your own GPT-4 tokenizer
sentencepiece library intro, used to train Llama 2 vocabulary
how to set vocabulary set? revisiting gpt.py transformer
training new tokens, example of prompt compression
multimodal [image, video, audio] tokenization with vector quantization
revisiting and explaining the quirks of LLM tokenization
final recommendations
??? :

Taught by

Andrej Karpathy

Reviews

Start your review of Building the GPT Tokenizer - From Strings to Tokens and Back

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.