Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

You deploy a chatbot. English queries average 42 tokens each. Then a Spanish-speaking user sends "¿Cómo puedo restablecer mi contraseña?" and it eats 103 tokens. Two weeks later, the same model starts outputting "Ġcon" at the edges of its generations and you cannot tell if it is a bug or a feature. The finance team flags a 40% month-over-month cost increase that no one can explain.

This is what happens when tokenization is treated as invisible plumbing. Every major LLM pipeline uses one of four subword tokenization algorithms, and the choice determines vocabulary size, handling of rare words, cross-language efficiency, and inference cost. Understanding which one your model uses -- and why -- is the difference between shipping a cost-efficient product and discovering mid-quarter that your token-per-query ratio quietly doubled.

Why this matters

Tokenization directly controls three things that hit your bottom line:

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

Why this matters

Tokenization directly controls three things that hit your bottom line:

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

Other newsrooms on this story

Tokenization under the hood: BPE, WordPiece, SentencePiece, and Unigram compared

Other newsrooms on this story

Related reading

Tokenization in LLMs: What AI App Devs Need to Know

Build a Meeting Minutes AI From Raw Audio

One Ruler to Measure Them All: How Language Affects LLM Quality

Which LLM should you use? Token Monster automatically combines multiple models…

BYTE PAIR ENCODING

Tokenization is Killing our Multilingual LLM Dream

Related reading

Tokenization in LLMs: What AI App Devs Need to Know

Build a Meeting Minutes AI From Raw Audio

One Ruler to Measure Them All: How Language Affects LLM Quality

Which LLM should you use? Token Monster automatically combines multiple models…

BYTE PAIR ENCODING

Tokenization is Killing our Multilingual LLM Dream