Advanced ⏱️ 6 min

πŸŽ“ What is Speculative Decoding?

Accelerating LLM inference by using a smaller draft model to propose tokens

What is Speculative Decoding?

Speculative Decoding is an inference acceleration technique that uses a smaller, faster β€œdraft” model to propose multiple tokens at once, which are then verified by the larger target model in parallel. This can achieve 2-3x speedup with no quality loss.

How It Works

Traditional Autoregressive Decoding

Token 1 β†’ Token 2 β†’ Token 3 β†’ Token 4 β†’ ...
(Each token requires a full forward pass)

Speculative Decoding

Draft model proposes: [T1, T2, T3, T4, T5]
Target model verifies all in one pass
Accept: [T1, T2, T3] βœ“  Reject: [T4, T5] βœ—
Continue from T3...

Key Properties

PropertyDescription
LosslessOutput is mathematically identical to target model
Speedup2-3x typical, depends on acceptance rate
Draft ModelSmaller version or distilled model
Acceptance RateHigher = faster, task-dependent

When Speculative Decoding Helps

  • βœ… Code generation (high acceptance rate)
  • βœ… Boilerplate text
  • βœ… Predictable patterns
  • ❌ Creative writing (low acceptance)
  • ❌ Complex reasoning (unpredictable)

Draft Model Selection

Good draft models are:

  • Much smaller: 10-100x fewer parameters
  • Similar distribution: Trained on similar data
  • Fast: Low latency per token

Example pairs:

  • Llama 70B + Llama 7B
  • GPT-4 + GPT-3.5
  • Custom distilled models

Implementations

FrameworkSupport
vLLMBuilt-in
TensorRT-LLMSupported
llama.cppExperimental
Hugging Faceassisted_generation

Variants

  • Medusa: Multiple prediction heads on same model
  • Lookahead Decoding: No separate draft model
  • Eagle: Efficient speculation with minimal overhead

πŸ•ΈοΈ Knowledge Mesh