Completed
special tokens, tiktoken handling of, GPT-2/GPT-4 differences
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
Building the GPT Tokenizer - From Strings to Tokens and Back
Automatically move to the next video in the Classroom when playback concludes
- 1 intro: Tokenization, GPT-2 paper, tokenization-related issues
- 2 tokenization by example in a Web UI tiktokenizer
- 3 strings in Python, Unicode code points
- 4 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
- 5 daydreaming: deleting tokenization
- 6 Byte Pair Encoding BPE algorithm walkthrough
- 7 starting the implementation
- 8 counting consecutive pairs, finding most common pair
- 9 merging the most common pair
- 10 training the tokenizer: adding the while loop, compression ratio
- 11 tokenizer/LLM diagram: it is a completely separate stage
- 12 decoding tokens to strings
- 13 encoding strings to tokens
- 14 regex patterns to force splits across categories
- 15 tiktoken library intro, differences between GPT-2/GPT-4 regex
- 16 GPT-2 encoder.py released by OpenAI walkthrough
- 17 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
- 18 minbpe exercise time! write your own GPT-4 tokenizer
- 19 sentencepiece library intro, used to train Llama 2 vocabulary
- 20 how to set vocabulary set? revisiting gpt.py transformer
- 21 training new tokens, example of prompt compression
- 22 multimodal [image, video, audio] tokenization with vector quantization
- 23 revisiting and explaining the quirks of LLM tokenization
- 24 final recommendations
- 25 ??? :