Beginner ⏱️ 4 min

πŸŽ“ What is Tokenization?

How language models convert text into numbers they can process

What is Tokenization?

Tokenization is the process of converting text into discrete units (tokens) that language models can process. It’s the first step in any NLP pipeline and directly impacts model performance, vocabulary size, and multilingual capability.

Why Tokenization Matters

Language models work with numbers, not text. Tokenization bridges this gap:

"Hello, world!" β†’ [15496, 11, 1917, 0] β†’ Model β†’ [output tokens] β†’ "Response"

Tokenization Methods

1. Word-Level

Splits on whitespace and punctuation.

  • ❌ Huge vocabulary (100K+ words)
  • ❌ Can’t handle unknown words
  • ❌ Poor for morphologically rich languages

2. Character-Level

Each character is a token.

  • βœ… Small vocabulary
  • ❌ Very long sequences
  • ❌ Loses word-level meaning

3. Subword (Modern Standard)

Balances vocabulary size and sequence length.

AlgorithmUsed By
BPE (Byte-Pair Encoding)GPT, LLaMA
WordPieceBERT, DistilBERT
UnigramT5, XLNet
SentencePieceMany multilingual models

Byte-Pair Encoding (BPE)

The most popular subword method:

  1. Start with character vocabulary
  2. Find most frequent character pair
  3. Merge into new token
  4. Repeat until desired vocabulary size

Example:

"lower" β†’ ["l", "o", "w", "e", "r"]
After BPE: ["low", "er"]

Vocabulary Size Trade-offs

SizeProsCons
Small (32K)Longer sequences, more computeBetter generalization
Large (128K)Shorter sequences, less computeLarger embedding matrix

Tokenization Efficiency

Different languages tokenize differently:

LanguageTokens for β€œHello, how are you?”
English~6 tokens
Chinese~12 tokens
Japanese~15 tokens

This affects context length and cost for non-English users.

Special Tokens

TokenPurpose
<BOS>Beginning of sequence
<EOS>End of sequence
<PAD>Padding for batching
<UNK>Unknown token
<MASK>For masked language modeling

πŸ•ΈοΈ Knowledge Mesh