Huihui Gemma 4 26b A4b It Abliterated Nvfp4
| Entity Passport | |
| Registry ID | hf-model--sakamakismile--huihui-gemma-4-26b-a4b-it-abliterated-nvfp4 |
| License | Apache-2.0 |
| Provider | huggingface |
Compute Threshold
~20.8GB VRAM
* Static estimation for 4-Bit Quantization.
Cite this model
Academic & Research Attribution
@misc{hf_model__sakamakismile__huihui_gemma_4_26b_a4b_it_abliterated_nvfp4,
author = {sakamakismile},
title = {Huihui Gemma 4 26b A4b It Abliterated Nvfp4 Model},
year = {2026},
howpublished = {\url{https://huggingface.co/sakamakismile/huihui-gemma-4-26b-a4b-it-abliterated-nvfp4}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
Quick Commands
ollama run huihui-gemma-4-26b-a4b-it-abliterated-nvfp4 huggingface-cli download sakamakismile/huihui-gemma-4-26b-a4b-it-abliterated-nvfp4 pip install -U transformers âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Huihui Gemma 4 26b A4b It Abliterated Nvfp4: Semantic (S:50), Authority (A:0), Popularity (P:5), Recency (R:98), Quality (Q:65).
Verification Authority
đ What's Next?
Technical Deep Dive
Huihui-gemma-4-26B-A4B-it-abliterated-NVFP4
NVFP4 quantized version of huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated â an abliterated (uncensored) variant of Google's Gemma 4 26B-A4B Mixture-of-Experts model.
49 GB â 16.5 GB with minimal quality loss. Runs on a single NVIDIA Blackwell GPU.
Key Specs
| Base model | huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated |
| Architecture | Gemma 4 MoE â 25.2B total, 3.8B active per token |
| Quantization | NVFP4 (W4A4 â weights FP4, activations FP4, scales FP8) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor (main) |
| Size | 16.5 GB (single safetensors shard) |
| Requires | NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130) |
Quickstart
vLLM (recommended)
vllm serve Lna-Lab/Huihui-gemma-4-26B-A4B-it-abliterated-NVFP4 \
--max-model-len 8192
No --quantization flag needed â vLLM auto-detects compressed-tensors format.
Docker
docker run --gpus '"device=0"' -p 8016:8016 \
-v /path/to/model:/models/current:ro \
--shm-size 16gb \
vllm/vllm-openai:cu130-nightly \
vllm serve /models/current --port 8016 --max-model-len 8192
Python (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(
model="Lna-Lab/Huihui-gemma-4-26B-A4B-it-abliterated-NVFP4",
max_model_len=8192,
gpu_memory_utilization=0.85,
)
output = llm.generate(
["Explain quantum entanglement in simple terms."],
SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)
Benchmark
Tested on a single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM), CUDA Graph PIECEWISE mode.
| Test | Tokens | Speed | Result |
|---|---|---|---|
| English reasoning | 46 | 137 tok/s | PASS |
| Code generation | 64 | 125 tok/s | PASS |
| Long-form output | 354 | 160 tok/s | PASS |
Sustained throughput: ~160 tok/s (post-warmup, single GPU).
Quantization Details
Recipe
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*embed.*', 're:.*router', 're:.*vision_tower.*']
scheme: NVFP4
What's quantized, what's not
- Quantized (NVFP4): All
Linearlayers in the language model, including MoE expert layers - Kept in BF16:
lm_head, all embedding layers, MoE routers, entire vision tower
Calibration
- Dataset: neuralmagic/calibration (LLM split)
- Samples: 20
- Max sequence length: 8192
- MoE expert calibration handled automatically by llm-compressor's
SequentialGemma4TextExperts
Reproduction
from datasets import load_dataset
from transformers import AutoProcessor, Gemma4ForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
model = Gemma4ForConditionalGeneration.from_pretrained(
"huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated", dtype="auto"
)
processor = AutoProcessor.from_pretrained(
"huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated"
)
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=["lm_head", "re:.*embed.*", "re:.*router", "re:.*vision_tower.*"],
)
ds = load_dataset("neuralmagic/calibration", name="LLM", split="train[:20]")
def preprocess_function(example):
messages = [
{"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
for m in example["messages"]
]
return processor.apply_chat_template(
messages, return_tensors="pt", padding=False, truncation=True,
max_length=8192, tokenize=True, add_special_tokens=False,
return_dict=True, add_generation_prompt=False,
)
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
import torch
def data_collator(batch):
assert len(batch) == 1
return {
key: (torch.tensor(value) if key != "pixel_values"
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0))
for key, value in batch[0].items()
}
oneshot(
model=model, recipe=recipe, dataset=ds,
max_seq_length=8192, num_calibration_samples=20,
data_collator=data_collator,
)
model.save_pretrained("output-NVFP4", save_compressed=True)
processor.save_pretrained("output-NVFP4")
Environment
| Package | Version |
|---|---|
| torch | 2.11.0+cu130 |
| transformers | 5.5.4 |
| llmcompressor | 0.1.dev (main @ 3084520) |
| compressed-tensors | 0.15.1a20260414 |
| safetensors | 0.7.0 |
| CUDA | 13.0 |
Requirements
- GPU: NVIDIA Blackwell (RTX 5090, RTX PRO 6000, B200, etc.) â NVFP4 requires SM 120
- VRAM: ~16 GB minimum
- Software: vLLM nightly (cu130 build), or any framework supporting
compressed-tensorsNVFP4
Notes
- This is an abliterated (uncensored) model. The base model has had safety training removed. Use responsibly.
- Vision tower is kept in BF16 â multimodal capabilities are preserved at full precision.
- NVFP4 is a Blackwell-specific format. This checkpoint will not work on Ampere/Hopper GPUs.
Credits
- Base model: huihui-ai (abliteration)
- Original model: Google DeepMind (Gemma 4)
- Quantization tool: vllm-project/llm-compressor
- Quantization method follows RedHatAI/gemma-4-26B-A4B-it-NVFP4 (proven path)
Support the Base Model Author
If you find this model useful, please consider supporting huihui-ai â the creator of the abliterated base model:
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
â ī¸ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source âđ Limitations & Considerations
- âĸ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- âĸ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- âĸ FNI scores are relative rankings and may change as new models are added.
- â License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Model Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-model--sakamakismile--huihui-gemma-4-26b-a4b-it-abliterated-nvfp4
- slug
- sakamakismile--huihui-gemma-4-26b-a4b-it-abliterated-nvfp4
- source
- huggingface
- author
- sakamakismile
- license
- Apache-2.0
- tags
- transformers, safetensors, gemma4, image-text-to-text, nvfp4, quantized, abliterated, vllm, compressed-tensors, blackwell, moe, text-generation, conversational, license:apache-2.0, endpoints_compatible, 8-bit, region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- 26
- context length
- 4,096
- pipeline tag
- text-generation
- vram gb
- 20.8
- vram is estimated
- true
- vram formula
- VRAM â (params * 0.75) + 0.8GB (KV) + 0.5GB (OS)
đ Engagement & Metrics
- downloads
- 48
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.