What is Tokenization?
Tokenization is the process of converting text into discrete units (tokens) that language models can process. Itβs the first step in any NLP pipeline and directly impacts model performance, vocabulary size, and multilingual capability.
Why Tokenization Matters
Language models work with numbers, not text. Tokenization bridges this gap:
"Hello, world!" β [15496, 11, 1917, 0] β Model β [output tokens] β "Response"
Tokenization Methods
1. Word-Level
Splits on whitespace and punctuation.
- β Huge vocabulary (100K+ words)
- β Canβt handle unknown words
- β Poor for morphologically rich languages
2. Character-Level
Each character is a token.
- β Small vocabulary
- β Very long sequences
- β Loses word-level meaning
3. Subword (Modern Standard)
Balances vocabulary size and sequence length.
| Algorithm | Used By |
|---|---|
| BPE (Byte-Pair Encoding) | GPT, LLaMA |
| WordPiece | BERT, DistilBERT |
| Unigram | T5, XLNet |
| SentencePiece | Many multilingual models |
Byte-Pair Encoding (BPE)
The most popular subword method:
- Start with character vocabulary
- Find most frequent character pair
- Merge into new token
- Repeat until desired vocabulary size
Example:
"lower" β ["l", "o", "w", "e", "r"]
After BPE: ["low", "er"]
Vocabulary Size Trade-offs
| Size | Pros | Cons |
|---|---|---|
| Small (32K) | Longer sequences, more compute | Better generalization |
| Large (128K) | Shorter sequences, less compute | Larger embedding matrix |
Tokenization Efficiency
Different languages tokenize differently:
| Language | Tokens for βHello, how are you?β |
|---|---|
| English | ~6 tokens |
| Chinese | ~12 tokens |
| Japanese | ~15 tokens |
This affects context length and cost for non-English users.
Special Tokens
| Token | Purpose |
|---|---|
<BOS> | Beginning of sequence |
<EOS> | End of sequence |
<PAD> | Padding for batching |
<UNK> | Unknown token |
<MASK> | For masked language modeling |
Related Concepts
- Context Length - How tokens affect input limits
- Transformer - Architecture that processes tokens