What is KV Cache?
KV Cache (Key-Value Cache) is a memory optimization technique that stores previously computed attention key and value tensors during autoregressive generation, avoiding redundant computation and dramatically speeding up inference.
The Problem
In autoregressive generation, each new token requires computing attention over all previous tokens. Without caching:
Token 1: Compute attention for [1]
Token 2: Compute attention for [1, 2] β Recomputes 1
Token 3: Compute attention for [1, 2, 3] β Recomputes 1, 2
...
Token N: Compute attention for [1...N] β Recomputes 1...(N-1)
This leads to O(NΒ²) complexity for generating N tokens.
The Solution
Cache the key (K) and value (V) projections for each layer:
Token 1: Compute Kβ, Vβ β Cache
Token 2: Compute Kβ, Vβ β Cache, use [Kβ,Kβ], [Vβ,Vβ]
Token 3: Compute Kβ, Vβ β Cache, use [Kβ,Kβ,Kβ], [Vβ,Vβ,Vβ]
Now each token only requires O(N) computation, giving O(NΒ²) total.
Memory Requirements
KV cache memory grows with:
- Batch size: More sequences = more cache
- Context length: Longer sequences = larger cache
- Model size: More layers and heads = more storage
Formula:
KV_memory = 2 Γ batch Γ layers Γ heads Γ seq_len Γ head_dim Γ precision
For a 70B model with 32K context and batch 8:
- KV cache alone can use 40+ GB!
Memory Optimization Techniques
1. PagedAttention (vLLM)
Manages KV cache like virtual memory pages:
- Non-contiguous memory allocation
- No fragmentation
- Efficient memory sharing between sequences
2. Grouped Query Attention (GQA)
Shares KV heads across multiple query heads:
- 8x reduction in KV cache for Llama 2 70B
- Minimal quality impact
3. Multi-Query Attention (MQA)
All query heads share a single KV pair:
- Maximum memory savings
- Some quality trade-off
4. Sliding Window
Only cache recent tokens:
- Fixed memory regardless of total length
- Used by Mistral models
KV Cache in Practice
| Framework | KV Cache Strategy |
|---|---|
| vLLM | PagedAttention |
| TensorRT-LLM | Paged KV Cache |
| Ollama | Standard caching |
| llama.cpp | Ring buffer option |
Related Concepts
- Flash Attention - Efficient attention computation
- Context Length - Maximum sequence length
- Inference Optimization - General optimization techniques
- VRAM - GPU memory management