Building the GPT Tokenizer - From Strings to Tokens and Back

Building the GPT Tokenizer - From Strings to Tokens and Back

Andrej Karpathy via YouTube Direct link

merging the most common pair

9 of 25

9 of 25

merging the most common pair

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

Building the GPT Tokenizer - From Strings to Tokens and Back

Automatically move to the next video in the Classroom when playback concludes

  1. 1 intro: Tokenization, GPT-2 paper, tokenization-related issues
  2. 2 tokenization by example in a Web UI tiktokenizer
  3. 3 strings in Python, Unicode code points
  4. 4 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
  5. 5 daydreaming: deleting tokenization
  6. 6 Byte Pair Encoding BPE algorithm walkthrough
  7. 7 starting the implementation
  8. 8 counting consecutive pairs, finding most common pair
  9. 9 merging the most common pair
  10. 10 training the tokenizer: adding the while loop, compression ratio
  11. 11 tokenizer/LLM diagram: it is a completely separate stage
  12. 12 decoding tokens to strings
  13. 13 encoding strings to tokens
  14. 14 regex patterns to force splits across categories
  15. 15 tiktoken library intro, differences between GPT-2/GPT-4 regex
  16. 16 GPT-2 encoder.py released by OpenAI walkthrough
  17. 17 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
  18. 18 minbpe exercise time! write your own GPT-4 tokenizer
  19. 19 sentencepiece library intro, used to train Llama 2 vocabulary
  20. 20 how to set vocabulary set? revisiting gpt.py transformer
  21. 21 training new tokens, example of prompt compression
  22. 22 multimodal [image, video, audio] tokenization with vector quantization
  23. 23 revisiting and explaining the quirks of LLM tokenization
  24. 24 final recommendations
  25. 25 ??? :

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.