What is Speculative Decoding?
Speculative Decoding is an inference acceleration technique that uses a smaller, faster βdraftβ model to propose multiple tokens at once, which are then verified by the larger target model in parallel. This can achieve 2-3x speedup with no quality loss.
How It Works
Traditional Autoregressive Decoding
Token 1 β Token 2 β Token 3 β Token 4 β ...
(Each token requires a full forward pass)
Speculative Decoding
Draft model proposes: [T1, T2, T3, T4, T5]
Target model verifies all in one pass
Accept: [T1, T2, T3] β Reject: [T4, T5] β
Continue from T3...
Key Properties
| Property | Description |
|---|---|
| Lossless | Output is mathematically identical to target model |
| Speedup | 2-3x typical, depends on acceptance rate |
| Draft Model | Smaller version or distilled model |
| Acceptance Rate | Higher = faster, task-dependent |
When Speculative Decoding Helps
- β Code generation (high acceptance rate)
- β Boilerplate text
- β Predictable patterns
- β Creative writing (low acceptance)
- β Complex reasoning (unpredictable)
Draft Model Selection
Good draft models are:
- Much smaller: 10-100x fewer parameters
- Similar distribution: Trained on similar data
- Fast: Low latency per token
Example pairs:
- Llama 70B + Llama 7B
- GPT-4 + GPT-3.5
- Custom distilled models
Implementations
| Framework | Support |
|---|---|
| vLLM | Built-in |
| TensorRT-LLM | Supported |
| llama.cpp | Experimental |
| Hugging Face | assisted_generation |
Variants
- Medusa: Multiple prediction heads on same model
- Lookahead Decoding: No separate draft model
- Eagle: Efficient speculation with minimal overhead
Related Concepts
- Inference Optimization - General speedup techniques
- KV Cache - Memory optimization