Intermediate ⏱️ 4 min

πŸŽ“ What is AWQ?

Activation-aware Weight Quantization - an efficient 4-bit quantization method

What is AWQ?

AWQ (Activation-aware Weight Quantization) is a 4-bit quantization technique that preserves model accuracy by protecting salient weights based on activation patterns. It achieves better quality than naive quantization while enabling 4-bit inference.

How AWQ Works

AWQ’s key insight: only ~1% of weights are critical for accuracy.

  1. Identify Salient Weights: Analyze activations to find important weights
  2. Protect via Scaling: Scale salient channels to reduce quantization error
  3. Quantize: Apply 4-bit quantization to all weights
  4. Absorb Scales: Merge scaling into adjacent layers

AWQ vs Other Methods

MethodBitsQualitySpeedCalibration
FP1616Baseline1xNone
GPTQ4Good3-4xRequired
AWQ4Better3-4xFaster
GGUF Q44Good2-3xNone

Advantages of AWQ

  1. Better Quality: Outperforms GPTQ at same bit-width
  2. Faster Calibration: Minutes vs hours for GPTQ
  3. Efficient Kernels: Optimized CUDA implementations
  4. Hardware Friendly: Works well with tensor cores

When to Use AWQ

  • βœ… GPU inference requiring speed
  • βœ… Memory-constrained deployment
  • βœ… When quality matters more than GPTQ
  • ❌ CPU inference (use GGUF instead)
  • ❌ Very small models (overhead not worth it)

AWQ in Practice

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True
)

Many HuggingFace models are available in AWQ format:

  • TheBloke’s AWQ collection
  • Official vendor AWQ releases
  • Community quantizations

Memory Savings

ModelFP16AWQ 4-bitSavings
7B14 GB4 GB71%
13B26 GB8 GB69%
70B140 GB40 GB71%
  • Quantization - General quantization overview
  • GGUF - Alternative quantization format
  • VRAM - GPU memory requirements

πŸ•ΈοΈ Knowledge Mesh