Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain.

This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled.

Why this matters

Tokenization directly controls three things that hit your bottom line: