🧠

Model

Nemotron Cascade 8b Thinking

Name: Nemotron Cascade 8b Thinking
Author: nvidia

by nvidia hf-model--nvidia--nemotron-cascade-8b-thinking

Free2AITools Nexus Index

39.9 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 39

R: Recency 79

Q: Quality 50

Tech Context

8.19 Params

32.768K Ctx

Vital Performance

6.2K DL / 30D

0.0%

Source →

Audited 39.9 FNI Score

8.19B Params

32k Context

6.2K Downloads

24G GPU ~9GB Est. VRAM

Dense QWEN3FORCAUSALLM Architecture

Restricted OTHER License

Model Information Summary
Entity Passport
Registry ID	hf-model--nvidia--nemotron-cascade-8b-thinking
License	Other
Provider	huggingface

💾

Compute Threshold

~8.6GB VRAM

Interactive

Analyze Hardware

Hardware Compatibility Test

▼

* Static estimation for 4-Bit Quantization.

📜

Cite this model

Academic & Research Attribution

BibTeX

@misc{hf_model__nvidia__nemotron_cascade_8b_thinking,
  author = {nvidia},
  title = {Nemotron Cascade 8b Thinking Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/nvidia/Nemotron-Cascade-8B-Thinking}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

nvidia. (2026). Nemotron Cascade 8b Thinking [Model]. Free2AITools. https://huggingface.co/nvidia/Nemotron-Cascade-8B-Thinking

🔬Technical Deep Dive

Full Specifications [+]

Quick Commands

🦙 Ollama Run

ollama run nemotron-cascade-8b-thinking

🤗 HF Download

huggingface-cli download nvidia/nemotron-cascade-8b-thinking

📦 Install Lib

pip install -U transformers

⚖️ Free2AITools Nexus Index V2.0

Methodology Index Protocol

Semantic (S) 50

Authority (A) 0

Popularity (P) 39

Recency (R) 79

Quality (Q) 50

💬 Index Insight

FNI V2.0 for Nemotron Cascade 8b Thinking: Semantic (S:50), Authority (A:0), Popularity (P:39), Recency (R:79), Quality (Q:50).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

---

🚀 What's Next?

📊

Find Training Datasets

Discover datasets compatible with this model

📈

Compare Benchmarks

See how this model ranks on standard tests

⚡

Technical Deep Dive

Nemotron-Cascade-8B-Thinking

Introduction

We're excited to introduce Nemotron-Cascade-8B-Thinking, a powerful general-purpose model trained through sequential and domain-wise reinforcement learning. Nemotron-Cascade-8B-Thinking is post-trained from the Qwen3-8B-Base model, and it achieves best-in-class performance across a wide range of benchmarks. Different from Nemotron-Cascade-8B, Nemotron-Cascade-8B-Thinking is designed exclusively for the thinking mode.

Training Pipeline

The training pipeline for Nemotron-Cascade begins with a multi-stage SFT phase to equip the model with foundational skills. Subsequently, Cascade RL is applied across multiple domains to further enhance the model’s performance in these areas.

Notably, RLHF for alignment, when used as a pre-step, boosts the model’s complex reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in the following Figure).

lcb_through_cascade_rl_fig — The LiveCodeBench v6 (08/24–05/25) performance of the Nemotron-Cascade-14B-Thinking model throughout the Cascade RL process.

Results

We evaluate our model against competitive reasoning models on a diverse set of benchmarks, covering general-knowledge reasoning, alignment and instruction following, mathematical reasoning, competitive programming, software engineering, and tool-use proficiency.
For Nemotron-Cascade models, we use a maximum generation length of 64K tokens and set the temperature to 0.6 and top-p to 0.95 for reasoning tasks.
Our Nemotron-Cascade models achieve best-in-class performance across almost all benchmarks. Remarkably, Nemotron-Cascade-8B and Nemotron-Cascade-8B-Thinking achieve comparable LiveCodeBench (LCB) and LCB Pro scores to DeepSeek-R1-0528 (671B).

Benchmark Metric: Pass@1	Qwen3-8B	Nemotron-Nano-9B-v2	DeepSeek-R1-0528 671B	Gemini-2.5-Flash-Thinking	Nemotron- Cascade-8B- Thinking	Nemotron- Cascade-8B
Knowledge Reasoning
MMLU	83.0	82.6	89.9	-	84.0	83.7
MMLU Pro	75.1	73.3	85.0	81.9	75.5	75.7
GPQA-Diamond	62.0	64.0	81.0	82.8	66.7	66.5
Alignment
ArenaHard	85.8	74.6	95.1	95.7	85.8	87.9
IFEval (Strict Prompt)	85.0	86.1	84.1	89.8	83.7	90.2
IFBench	34.4	37.4	38.0	36.1	41.4	40.8
Math
AIME 2024	76.0	81.9	91.4	82.3	88.8	89.5
AIME 2025	67.3	72.0	87.5	72.0	81.4	80.1
Code
LCB v5 (08/24-02/25)	61.2	68.2	74.8	63.4	74.5	74.3
LCB v6 (08/24-05/25)	58.3	65.3	73.3	61.9	71.4	71.1
LCB Pro 25Q2 (Easy)	46.1	59.3	63.9	47.4	64.8	65.7
LCB Pro 25Q2 (Med)	2.2	4.8	7.0	1.8	6.1	6.4
SWE Verified (Agentless)	20.5	-	57.6	48.9	38.5	37.2
Tool Calling
BFCL V3	68.1	66.9	67.9	68.6	67.0	64.4

Usage Recommendations

For local deployment, we recommend setting the sampling parameters to temperature = 0.6, top_p = 0.95. We recommend using RoPE scaling with the YaRN method for better long-context support. This can be enabled by updating the model’s config.json as shown below:

json

  {
    ...,
    "rope_scaling": {
        "rope_type": "yarn",
        "factor": 2.0,
        "original_max_position_embeddings": 32768
    }
  }

Nemotron-Cascade-14B-Thinking: use factor: 3.0 to extend the context length to 90K tokens for SWE Verified (Agentless), and factor: 2.0 to extend the context length to 64K tokens for other benchmarks.
Nemotron-Cascade-8B and Nemotron-Cascade-8B-Thinking: use factor: 2.0 across all benchmarks.

Evaluation Tookit

To reproduce our results, please check evaluation code, scripts, cached prediction files in https://huggingface.co/nvidia/Nemotron-Cascade-8B-Thinking/blob/main/evaluation/README.md

Chat Template

Nemotron-Cascade-8B-Thinking follows the Qwen3-style ChatML template and is designed exclusively for the thinking mode. To align with the template used in Nemotron-Cascade-8B, the " /think" tag should be appended to the end of the user input. Note that a leading space is included in this tag to ensure correct tokenization.

To reduce the context length in a multi-turn conversation, we include only the final summary of the model’s output in the conversation history and change the user turn’s " /think" tag to " /no_think".

A brief example is shown below:

python

from transformers import AutoTokenizer

model_name = 'nvidia/Nemotron-Cascade-8B-Thinking'
tokenizer = AutoTokenizer.from_pretrained(model_name)

'''
single-turn example
'''
messages = [
    {"role": "user", "content": "calculate 1+1?"}
]

# only thinking mode is supported (enable_thinking=True)
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /think<|im_end|>\n<|im_start|>assistant\n'


'''
multi-turn example
'''
messages = [
    {"role": "user", "content": "calculate 1+1?"},
    {"role": "assistant", "content": "THINKING_CONTENT\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**:  \n   \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed{2}\\)",},
    {"role": "user", "content": "what about 2+2"}
]

# only thinking mode is supported (enable_thinking=True)
prompt_thinking = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
# prompt_thinking = '<|im_start|>system\nYou are a helpful and harmless assistant.<|im_end|>\n<|im_start|>user\ncalculate 1+1? /no_think<|im_end|>\n<|im_start|>assistant\nTo calculate \\(1 + 1\\):\n\n1. **Identify the operation**: This is a basic addition problem involving two integers.\n2. **Perform the addition**:  \n   \\(1 + 1 = 2\\).\n\n**Result**: \\(\\boxed\{2\}\\)<|im_end|>\n<|im_start|>user\nwhat about 2+2 /think<|im_end|>\n<|im_start|>assistant\n'

Release Date

Dec 08, 2025

License

Your use of this model is governed by the NVIDIA Open Model License.

Citation

text

@article{Nemotron_Cascade_Scaling_Cascaded_Reinforcement_Learning,
  title={Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models},
  author={Wang, Boxin and Lee, Chankyu and Lee, Nayeon and Lin, Sheng-Chieh and Dai, Wenliang and Chen, Yang and Chen, Yangyi and Yang, Zhuolin and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  year={2025}
}

⚠️ Incomplete Data

Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.

View Original Source →

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
⚠ License Unknown: Verify licensing terms before commercial use.

Social Proof

HuggingFace Hub

6.2KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-model--nvidia--nemotron-cascade-8b-thinking
slug: nvidia--nemotron-cascade-8b-thinking
source: huggingface
author: nvidia
license: Other
tags: transformers, safetensors, qwen3, text-generation, nvidia, nemotron-cascade, reasoning, general-purpose, sft, rl, pytorch, conversational, en, arxiv:2512.13607, arxiv:2309.00071, license:other, text-generation-inference, endpoints_compatible, region:us

⚙️ Technical Specs

architecture: Qwen3ForCausalLM
params billions: 8.19
context length: 32,768
pipeline tag: text-generation
vram gb: 8.6
vram is estimated: true
vram formula: VRAM ≈ (params * 0.75) + 2GB (KV) + 0.5GB (OS)

📊 Engagement & Metrics

downloads: 6,235
stars: 0
forks: 0

Data indexed from public sources. Updated daily.