What is AWQ?
AWQ (Activation-aware Weight Quantization) is a 4-bit quantization technique that preserves model accuracy by protecting salient weights based on activation patterns. It achieves better quality than naive quantization while enabling 4-bit inference.
How AWQ Works
AWQβs key insight: only ~1% of weights are critical for accuracy.
- Identify Salient Weights: Analyze activations to find important weights
- Protect via Scaling: Scale salient channels to reduce quantization error
- Quantize: Apply 4-bit quantization to all weights
- Absorb Scales: Merge scaling into adjacent layers
AWQ vs Other Methods
| Method | Bits | Quality | Speed | Calibration |
|---|---|---|---|---|
| FP16 | 16 | Baseline | 1x | None |
| GPTQ | 4 | Good | 3-4x | Required |
| AWQ | 4 | Better | 3-4x | Faster |
| GGUF Q4 | 4 | Good | 2-3x | None |
Advantages of AWQ
- Better Quality: Outperforms GPTQ at same bit-width
- Faster Calibration: Minutes vs hours for GPTQ
- Efficient Kernels: Optimized CUDA implementations
- Hardware Friendly: Works well with tensor cores
When to Use AWQ
- β GPU inference requiring speed
- β Memory-constrained deployment
- β When quality matters more than GPTQ
- β CPU inference (use GGUF instead)
- β Very small models (overhead not worth it)
AWQ in Practice
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-AWQ",
fuse_layers=True
)
Popular AWQ Models
Many HuggingFace models are available in AWQ format:
- TheBlokeβs AWQ collection
- Official vendor AWQ releases
- Community quantizations
Memory Savings
| Model | FP16 | AWQ 4-bit | Savings |
|---|---|---|---|
| 7B | 14 GB | 4 GB | 71% |
| 13B | 26 GB | 8 GB | 69% |
| 70B | 140 GB | 40 GB | 71% |
Related Concepts
- Quantization - General quantization overview
- GGUF - Alternative quantization format
- VRAM - GPU memory requirements