Advanced ⏱️ 5 min

πŸŽ“ What is KV Cache?

Key-Value caching mechanism that accelerates autoregressive generation in transformers

What is KV Cache?

KV Cache (Key-Value Cache) is a memory optimization technique that stores previously computed attention key and value tensors during autoregressive generation, avoiding redundant computation and dramatically speeding up inference.

The Problem

In autoregressive generation, each new token requires computing attention over all previous tokens. Without caching:

Token 1: Compute attention for [1]
Token 2: Compute attention for [1, 2]      ← Recomputes 1
Token 3: Compute attention for [1, 2, 3]   ← Recomputes 1, 2
...
Token N: Compute attention for [1...N]     ← Recomputes 1...(N-1)

This leads to O(NΒ²) complexity for generating N tokens.

The Solution

Cache the key (K) and value (V) projections for each layer:

Token 1: Compute K₁, V₁ β†’ Cache
Token 2: Compute Kβ‚‚, Vβ‚‚ β†’ Cache, use [K₁,Kβ‚‚], [V₁,Vβ‚‚]
Token 3: Compute K₃, V₃ β†’ Cache, use [K₁,Kβ‚‚,K₃], [V₁,Vβ‚‚,V₃]

Now each token only requires O(N) computation, giving O(NΒ²) total.

Memory Requirements

KV cache memory grows with:

  • Batch size: More sequences = more cache
  • Context length: Longer sequences = larger cache
  • Model size: More layers and heads = more storage

Formula:

KV_memory = 2 Γ— batch Γ— layers Γ— heads Γ— seq_len Γ— head_dim Γ— precision

For a 70B model with 32K context and batch 8:

  • KV cache alone can use 40+ GB!

Memory Optimization Techniques

1. PagedAttention (vLLM)

Manages KV cache like virtual memory pages:

  • Non-contiguous memory allocation
  • No fragmentation
  • Efficient memory sharing between sequences

2. Grouped Query Attention (GQA)

Shares KV heads across multiple query heads:

  • 8x reduction in KV cache for Llama 2 70B
  • Minimal quality impact

3. Multi-Query Attention (MQA)

All query heads share a single KV pair:

  • Maximum memory savings
  • Some quality trade-off

4. Sliding Window

Only cache recent tokens:

  • Fixed memory regardless of total length
  • Used by Mistral models

KV Cache in Practice

FrameworkKV Cache Strategy
vLLMPagedAttention
TensorRT-LLMPaged KV Cache
OllamaStandard caching
llama.cppRing buffer option

πŸ•ΈοΈ Knowledge Mesh