Nemotron 3 Super 120b A12b Jang 2l
| Entity Passport | |
| Registry ID | hf-model--jangq-ai--nemotron-3-super-120b-a12b-jang_2l |
| License | Other |
| Provider | huggingface |
Compute Threshold
~92.5GB VRAM
* Static estimation for 4-Bit Quantization. [Multi-GPU / Unified Memory Required]
Cite this model
Academic & Research Attribution
@misc{hf_model__jangq_ai__nemotron_3_super_120b_a12b_jang_2l,
author = {Jangq Ai},
title = {Nemotron 3 Super 120b A12b Jang 2l Model},
year = {2026},
howpublished = {\url{https://huggingface.co/jangq-ai/nemotron-3-super-120b-a12b-jang_2l}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
Quick Commands
huggingface-cli download jangq-ai/nemotron-3-super-120b-a12b-jang_2l βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Nemotron 3 Super 120b A12b Jang 2l: Semantic (S:50), Authority (A:0), Popularity (P:27), Recency (R:98), Quality (Q:65).
Verification Authority
π What's Next?
Technical Deep Dive
MLX Studio β the only app that natively supports JANG models with reasoning
Nemotron-3-Super-120B on a 64 GB Mac. The first working Nemotron-H quantization for Apple Silicon. 43 GB, 52 tok/s, 86% MMLU with reasoning. Hybrid Mamba-2 SSM + Latent MoE + Attention architecture.
LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or
pip install "jang[mlx]>=2.1.5".
Nemotron-3-Super-120B-A12B β JANG_2L (2.8-bit, 8-bit attention) β Reasoning
JANG β Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX
JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.
Key Features
- 86.0% MMLU (200 questions, reasoning mode)
- 51.6 tok/s generation, 136 tok/s prefill
- 43 GB on disk, 42.4 GB GPU RAM
- Reasoning mode:
<think>...</think>step-by-step problem solving - Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
- bfloat16 compute: auto-detected for 512-expert models
Results: JANG vs MLX (200-question MMLU)
Per-subject comparison. Both tested with and without reasoning using identical methodology.
| Subject | JANG_2L No-Think | JANG_2L Reasoning | JANG_4M No-Think | JANG_4M Reasoning | MLX 4-bit No-Think | MLX 4-bit Reasoning |
|---|---|---|---|---|---|---|
| Abstract Algebra | 12/20 | 16/20 | 10/20 | 19/20 | 9/20 | 19/20 |
| Anatomy | 15/20 | 17/20 | 15/20 | 18/20 | 14/20 | 18/20 |
| Astronomy | 19/20 | 19/20 | 19/20 | 19/20 | 19/20 | 19/20 |
| College CS | 13/20 | 15/20 | 13/20 | 17/20 | 14/20 | 17/20 |
| College Physics | 14/20 | 18/20 | 14/20 | 19/20 | 13/20 | 20/20 |
| HS Biology | 19/20 | 18/20 | 19/20 | 20/20 | 18/20 | 20/20 |
| HS Chemistry | 15/20 | 16/20 | 15/20 | 18/20 | 16/20 | 19/20 |
| HS Mathematics | 8/20 | 18/20 | 6/20 | 18/20 | 6/20 | 18/20 |
| Logical Fallacies | 17/20 | 18/20 | 17/20 | 19/20 | 17/20 | 18/20 |
| World Religions | 18/20 | 17/20 | 17/20 | 19/20 | 16/20 | 19/20 |
| Total | 150/200 (75.0%) | 172/200 (86.0%) | 145/200 (72.5%) | 186/200 (93.0%) | 142/200 (71.0%) | 187/200 (93.5%) |
Summary
| JANG_2L | JANG_4M | MLX 3-bit | MLX 4-bit | |
|---|---|---|---|---|
| MMLU (no-think) | 75.0% | 72.5% | Crashes | 71.0% |
| MMLU (reasoning) | 86.0% | 93.0% | Crashes | 93.5% |
| Size | 43 GB | 63 GB | N/A | 63 GB |
| GPU RAM | 42.4 GB | 61.2 GB | N/A | 63.3 GB |
| Speed | 51.6 tok/s | 55.1 tok/s | N/A | 59.8 tok/s |
| Fits 64 GB? | YES | YES | N/A | YES |
JANG_2L is 20 GB smaller than MLX 4-bit with +4% higher no-think MMLU (75% vs 71%). JANG_4M matches MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size. JANG_2L at 43 GB beats MLX on no-think (75% vs 71%) at 32% less storage. MLX cannot create sub-4-bit versions β crashes on mtp.* weights.
MLX 3-bit cannot be created β
mlx_lm.convertcrashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations of this model.JANG_4M at same size as MLX 4-bit: 93.0% vs 93.5% β nearly identical quality with 8-bit attention protection.
Specs
| Metric | Value |
|---|---|
| Source | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
| Architecture | Hybrid Mamba-2 SSM + Latent MoE + Dense Attention |
| Layers | 88 (40 Mamba-2 + 40 MoE + 8 Attention) |
| Experts | 512 per MoE layer, top-22 active (12B active params) |
| Latent MoE | fc1_latent_proj (4096β1024), experts in latent space |
| Profile | JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2) |
| Average bits | 2.76 bpw |
| Disk size | 43 GB |
| GPU RAM | 42.4 GB |
| Speed | 51.6 tok/s generation, 136 tok/s prefill |
| Compute | bfloat16 (auto-detected) |
Requirements
- Apple Silicon Mac with 64+ GB unified memory
- MLX Studio (recommended) or
pip install "jang[mlx]>=2.1.5" - Requires patched mlx-lm with latent MoE support (included in JANG loader)
Quick Start
pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate
model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L")
# With reasoning (recommended for hard questions)
messages = [{"role": "user", "content": "Explain the difference between SSM and attention."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)
Technical Notes
- Latent MoE: Nemotron-H compresses hidden states from 4096β1024 before expert routing, then decompresses back. This is unique among MoE models and requires a patched mlx-lm. The JANG loader handles this automatically.
- bfloat16: 512-expert models overflow float16 (max 65,504). JANG auto-detects and uses bfloat16 (max 3.4Γ10^38). Zero quality impact.
- Mamba-2 SSM: 40 of 88 layers use state-space models instead of attention, enabling efficient long-context processing.
- trust_remote_code: Model requires custom Python files (modeling_nemotron_h.py, configuration_nemotron_h.py) which are included.
JANG β Created by Jinho Jang ([email protected]) Β· @dealignai
GitHub Β· PyPI Β· HuggingFace
νκ΅μ΄
Nemotron-3-Super-120Bλ₯Ό 64 GB Macμμ μ€νν μ μλ μ΅μ΄μ μμνμ λλ€. 43 GB, 52 tok/s, 86% MMLU.
| JANG_2L | MLX 4-bit | |
|---|---|---|
| MMLU (μΆλ‘ μμ) | 75.0% | 71.0% |
| MMLU (μΆλ‘ ν¬ν¨) | 86.0% | 93.5% |
| ν¬κΈ° | 43 GB | 63 GB |
| μλ | 51.6 tok/s | 59.8 tok/s |
pip install "jang[mlx]>=2.1.5"
β οΈ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source βπ Limitations & Considerations
- β’ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- β’ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- β’ FNI scores are relative rankings and may change as new models are added.
- β License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Model Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- hf-model--jangq-ai--nemotron-3-super-120b-a12b-jang_2l
- slug
- jangq-ai--nemotron-3-super-120b-a12b-jang_2l
- source
- huggingface
- author
- Jangq Ai
- license
- Other
- tags
- mlx, safetensors, nemotron_h, jang, quantized, mixed-precision, apple-silicon, reasoning, thinking, moe, mamba, nemotron, text-generation, conversational, custom_code, en, zh, ko, license:other, region:us
βοΈ Technical Specs
- architecture
- null
- params billions
- 120
- context length
- 4,096
- pipeline tag
- text-generation
- vram gb
- 92.5
- vram is estimated
- true
- vram formula
- VRAM β (params * 0.75) + 2GB (KV) + 0.5GB (OS)
π Engagement & Metrics
- downloads
- 1,160
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.

