๐Ÿง 
Model

Gemma 4 26b A4b It Hlwq Q5

by caiovicentino1 hf-model--caiovicentino1--gemma-4-26b-a4b-it-hlwq-q5
Nexus Index
39.7 Top 100%
S: Semantic 50
A: Authority 0
P: Popularity 16
R: Recency 97
Q: Quality 65
Tech Context
26 Params
4.096K Ctx
Vital Performance
316 DL / 30D
0.0%
Audited 39.7 FNI Score
26B Params
4k Context
316 Downloads
24G GPU ~21GB Est. VRAM
Commercial APACHE License
Model Information Summary
Entity Passport
Registry ID hf-model--caiovicentino1--gemma-4-26b-a4b-it-hlwq-q5
License Apache-2.0
Provider huggingface
๐Ÿ’พ

Compute Threshold

~20.8GB VRAM

Interactive
Analyze Hardware
โ–ผ

* Static estimation for 4-Bit Quantization.

๐Ÿ“œ

Cite this model

Academic & Research Attribution

BibTeX
@misc{hf_model__caiovicentino1__gemma_4_26b_a4b_it_hlwq_q5,
  author = {caiovicentino1},
  title = {Gemma 4 26b A4b It Hlwq Q5 Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/caiovicentino1/gemma-4-26b-a4b-it-hlwq-q5}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
caiovicentino1. (2026). Gemma 4 26b A4b It Hlwq Q5 [Model]. Free2AITools. https://huggingface.co/caiovicentino1/gemma-4-26b-a4b-it-hlwq-q5

๐Ÿ”ฌTechnical Deep Dive

Full Specifications [+]

Quick Commands

๐Ÿฆ™ Ollama Run
ollama run gemma-4-26b-a4b-it-hlwq-q5
๐Ÿค— HF Download
huggingface-cli download caiovicentino1/gemma-4-26b-a4b-it-hlwq-q5

โš–๏ธ Nexus Index V2.0

39.7
TOP 100% SYSTEM IMPACT
Semantic (S) 50
Authority (A) 0
Popularity (P) 16
Recency (R) 97
Quality (Q) 65

๐Ÿ’ฌ Index Insight

FNI V2.0 for Gemma 4 26b A4b It Hlwq Q5: Semantic (S:50), Authority (A:0), Popularity (P:16), Recency (R:97), Quality (Q:65).

Free2AITools Nexus Index

Verification Authority

Unbiased Data Node Refresh: VFS Live
---

๐Ÿš€ What's Next?

Technical Deep Dive

[!IMPORTANT] Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

๐ŸงŠ Gemma-4-26B-A4B-it-HLWQ-Q5

25.2B MoE (3.8B active) + Vision โ€” PQ5 quantized weights including ALL MoE experts.

Download: 26.9 GB (vs 51.6 GB BF16 original)

Metric Value
Download 26.9 GB (1.9x smaller)
Quantized 427 linear + 7,680 MoE experts
Architecture 30 layers, 128 experts (top-8)
Vision โœ… Image+Text โ†’ Text
Routers FP16 (exact expert selection)

๐Ÿ“Š Charts

Download Size Quantization Coverage Gemma Family

๐Ÿš€ Quick Start

Expert Offloading (8.6 GB GPU โ€” best for consumer GPUs)

bash
# vLLM MoE expert cache is in PR #37190 (still open). Use the fork until it lands:
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
python
from vllm import LLM, SamplingParams
llm = LLM('google/gemma-4-26B-A4B-it', dtype='bfloat16',
          moe_expert_cache_size=8, enforce_eager=True,
          kernel_config={'moe_backend': 'triton'})

Streaming Loader (PQ5 dequant + INT4)

See POLARQUANT_GEMMA4_26B_A4B_VISION.ipynb

๐Ÿ† GPU Support

GPU Method VRAM
T4 (16 GB) Expert offloading 8.6 GB
RTX 4090 (24 GB) Expert offloading 8.6 GB
A100 (80 GB) Full load + PQ5 dequant ~50 GB

๐Ÿ““ Notebooks

Notebook Description
MoE Quantize PQ5 quantize all experts + save codes
Vision Inference Multimodal streaming loader
Expert Offload vLLM fork, 14.8 tok/s

๐Ÿ”ง Technical Details

  • MoE experts: 3D nn.Parameter (128, out, in) โ€” each expert quantized independently
  • gate_up_proj: (128, 1408, 2816) per layer
  • down_proj: (128, 2816, 704) per layer
  • 128 experts ร— 2 params ร— 30 layers = 7,680 expert quantizations
  • Quantization time: 50 seconds on A100
  • PQ5 codes: int8 + fp16 norms + Hadamard rotation + Lloyd-Max centroids

๐Ÿ“– Citation

bibtex
@article{polarquant2025,
  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2025}
}

๐Ÿ“„ Paper ยท ๐Ÿ’ป GitHub ยท ๐Ÿ“ฆ pip install polarquant


๐Ÿš€ Quick Start

Install

bash
pip install git+https://github.com/caiovicentino/polarengine-vllm.git

Load & Generate (1 line!)

python
from polarengine_vllm import HLWQModel

model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5")
print(model.generate("Hello, how are you?", max_new_tokens=100))

With KV Cache Compression (5.3x more context)

python
model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5", kv_cache_nbits=3)
# KV cache now uses 5.3x less memory โ€” fit longer conversations!
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))

Benchmark

bash
polarquant bench caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5 --ppl --chart

Gradio Demo

bash
polarquant demo caiovicentino1/Gemma-4-26B-A4B-it-HLWQ-Q5 --share

๐Ÿ“ฆ Method: HLWQ

Hadamard Rotation + Lloyd-Max Optimal Centroids

Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest โ€” mathematically proven optimal for Gaussian-distributed neural network weights.

text
HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size

โš ๏ธ Incomplete Data

Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.

View Original Source โ†’

๐Ÿ“ Limitations & Considerations

  • โ€ข Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • โ€ข VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • โ€ข FNI scores are relative rankings and may change as new models are added.
  • โš  License Unknown: Verify licensing terms before commercial use.

Social Proof

HuggingFace Hub
316Downloads
๐Ÿ”„ Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

๐Ÿ“Š FNI Methodology ๐Ÿ“š Knowledge Baseโ„น๏ธ Verify with original source

๐Ÿ›ก๏ธ Model Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

๐Ÿ†” Identity & Source

id
hf-model--caiovicentino1--gemma-4-26b-a4b-it-hlwq-q5
slug
caiovicentino1--gemma-4-26b-a4b-it-hlwq-q5
source
huggingface
author
caiovicentino1
license
Apache-2.0
tags
safetensors, gemma4, hlwq, moe, quantized, vision, multimodal, image-text-to-text, conversational, arxiv:2502.02617, arxiv:2603.29078, base_model:google/gemma-4-26b-a4b-it, base_model:quantized:google/gemma-4-26b-a4b-it, license:apache-2.0, 8-bit, polarengine, region:us

โš™๏ธ Technical Specs

architecture
null
params billions
26
context length
4,096
pipeline tag
image-text-to-text
vram gb
20.8
vram is estimated
true
vram formula
VRAM โ‰ˆ (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)

๐Ÿ“Š Engagement & Metrics

downloads
316
stars
0
forks
0

Data indexed from public sources. Updated daily.