[!IMPORTANT]
Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
๐ง Gemma-4-26B-A4B-it-HLWQ-Q5
25.2B MoE (3.8B active) + Vision โ PQ5 quantized weights including ALL MoE experts.
Download: 26.9 GB (vs 51.6 GB BF16 original)
Metric
Value
Download
26.9 GB (1.9x smaller)
Quantized
427 linear + 7,680 MoE experts
Architecture
30 layers, 128 experts (top-8)
Vision
โ Image+Text โ Text
Routers
FP16 (exact expert selection)
๐ Charts
๐ Quick Start
Expert Offloading (8.6 GB GPU โ best for consumer GPUs)
bash
# vLLM MoE expert cache is in PR #37190 (still open). Use the fork until it lands:
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
from polarengine_vllm import HLWQModel
model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))
With KV Cache Compression (5.3x more context)
python
model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory โ fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))