Byte Pair Encoding (BPE): the tokenizer that made GPTs practical
Introduction Byte Pair Encoding (BPE) is a subword tokenization scheme that gives us the best of both worlds: compact vocabulary sizes (not the full wordlist), the ability to represent any unknown word (by falling back to subwords/characters), and meaningful shared pieces (roots, suffixes) that help models generalize. GPT-2 used a BPE tokenizer with a vocabulary of ≈50,257 tokens, and OpenAI’s tiktoken is a fast Rust-backed implementation you can use today. Below I explain the why, the how (intuition + algorithm), and a short hands-on demo using tiktoken. ...