Mixture of Experts (MoE)

Mixture of Experts (MoE)

MoE is an architecture that uses conditional computation to scale models efficiently.

How It Works

Instead of activating all parameters, MoE routes each token to a subset of “expert” networks:

y = Σ G(x)_i · E_i(x)
  • G(x): Gating function (router)
  • E_i: Expert network
  • Top-k: Usually 1-2 experts active per token

Benefits

AspectDense ModelMoE Model
Total Params70B141B (8x8)
Active Params70B~17B
VRAM (inference)140GB~35GB
SpeedBaselineFaster per token

Notable MoE Models

  • Mixtral 8x7B: 8 experts, 2 active, ~13B active
  • DeepSeek-V2: 236B total, 21B active
  • Qwen MoE: Multiple variants

Trade-offs

Pros:

  • More capacity with less compute
  • Efficient inference

⚠️ Cons:

  • Complex training
  • Higher total VRAM for full model
  • Router quality matters