📊

Dataset

Nvidia Nemotron Progress Prize

Name: Nvidia Nemotron Progress Prize
Creator: Naribow

by Naribow hf-dataset--naribow--nvidia-nemotron-progress-prize

Free2AITools Nexus Index

59.1 Top 100%

S: Semantic 50

A: Authority 61

P: Popularity 51

R: Recency 91

Q: Quality 50

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Data Integrity 59.1 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--naribow--nvidia-nemotron-progress-prize
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__naribow__nvidia_nemotron_progress_prize,
  author = {Naribow},
  title = {Nvidia Nemotron Progress Prize Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

Naribow. (2026). Nvidia Nemotron Progress Prize [Dataset]. Free2AITools. https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Free2AITools Nexus Index V2.0

Methodology Index Protocol

Semantic (S) 50

Authority (A) 61

Popularity (P) 51

Recency (R) 91

Quality (Q) 50

💬 Index Insight

FNI V2.0 for Nvidia Nemotron Progress Prize: Semantic (S:50), Authority (A:61), Popularity (P:51), Recency (R:91), Quality (Q:50).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

31,601

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

NVIDIA Progress Prize submission

This is the Github repository to the Progress Prize winning submission for NVIDIA Nemotron Model Reasoning Challenge.

Resources on Kaggle

Tabs on nemotron.huikang.dev

Base — Grid of competition problems colored by how the base model (pre-fine-tuning) does on each: solved / partially solved / unsolved across its generation runs. Click a problem for its prompt, parsed transformation table, answer, per-run extracted answer, and the token-level generation trace colored by logprob.
Synthetic — Same problem set as Base, but colored by investigation status (rule found / hypothesis formed / rule unknown). Click a problem for its prompt, parsed transformation, answer, submission, reasoning text, and investigation notes.
Corpus — Sortable table of training corpus entries with masked, unmasked, and total token counts per row. Filter by category or problem ID; open a row to see the token-level trace with masking highlighted.
Training — Per-problem table of step, loss-token count, and minimum logprob across training epochs. Select an epoch and a row to see token-level logprob changes against the base model.
Metrics — Index of training runs (LR, backend, epochs, batch, LoRA rank, examples, tokens, steps). Click a run to see its per-step charts: loss per token (overall and by category), min logprob by category, gradient norm, learning rate, and step time. Cmd+click a legend entry to isolate that category.

Running the webpage locally

./serve.sh

Serves the static site at http://localhost:33304/.

Executing training

bash

uv run python3 reasoning.py
uv run python3 augmentation.py
uv run python3 corpus.py
uv run python3 train_sft.py         # Requires tinker API key
uv run modal run upload_adapter.py

PyTorch Direct Training (vast.ai 用)

bash

uv run python3 train_sft_pytorch.py  # tinker 不要、PyTorch 直接訓練

# ローカルでのテスト実行 (GPU不要)
bash run_all_tests.sh

Running on vast.ai (or other cloud GPU providers)

This repository is also available on HuggingFace: https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize

Hugging Face へのアップロード

大量ファイル（corpus: 16,365ファイル、training logprobs/tokens: 200,000+ファイル）を含むため、zip化してからアップロードします。

bash

# 1. アーカイブを作成（corpus.zip, training_logprobs_tokens.zip）
uv run python prepare_archives.py

# 2. Hugging Face にアップロード
uv run python upload_to_hf.py

除外パターンは .hfignore で管理されています。

Hugging Face からのセットアップ

bash

# 1. リポジトリをクローン
git clone https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize
cd nvidia-nemotron-progress-prize

# 2. アーカイブを展開
./extract_archives.sh

# 3. 依存関係をインストール
uv sync

# 4. トレーニング実行
uv run python train_sft_pytorch.py

📖 セットアップガイド:

🚀 クイックスタート: VASTAI_QUICKSTART.md - SSH接続後すぐに開始
📚 詳細ガイド: VASTAI_SETUP.md - トラブルシューティング含む完全版
🛠️ ローカルヘルパー: local_vastai_helper.sh - ローカルから操作するスクリプト

Quick start:

bash

# 1. Download dataset from HuggingFace
git clone https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize .

# 2. Create .env file in current directory
cp .env.example .env
# Edit .env and add your HF_TOKEN and WANDB_API_KEY

# 3. Run setup script
./setup_vastai.sh

# 4. Start training (PyTorch 直接訓練版)
uv run python3 train_sft_pytorch.py

Requirements

Python: 3.11 or higher
uv package manager: Required for dependency management
GPU: NVIDIA GPU with CUDA support
Memory: The model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 is a 30B parameter model. LoRA training requires significant GPU memory (estimated 40GB+ VRAM).

Key Dependencies

Original (tinker backend)

tinker>=0.16.1 and tinker-cookbook>=0.3.0 (LoRA training framework) - Requires API key
modal>=1.4.1 (optional, for modal backend)

PyTorch Direct Training (train_sft_pytorch.py)

unsloth (unsloth + PEFT for efficient LoRA training)
peft>=0.15.0 (Parameter-Efficient Fine-Tuning)
accelerate>=1.0.0
bitsandbytes>=0.45.0

Common

torch>=2.11.0
transformers==4.57.6
wandb (for experiment tracking)
python-dotenv (for environment variables)

Required Data Files

Both train_sft.py (tinker) and train_sft_pytorch.py expect the following files:

corpus.jsonl (training corpus index)
corpus/ directory (pre-tokenized training data)

These should be included in the HuggingFace dataset.

Setup on vast.ai

bash

# 1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Download dataset from HuggingFace
# (using git-lfs or hf command)
git clone https://huggingface.co/datasets/Naribow/nvidia-nemotron-progress-prize
cd nvidia-nemotron-progress-prize

# 3. Install dependencies
uv sync

# 4. Set up environment variables (for WandB tracking)
# Create a .env file at the repository root (../../.env from references/nemotron/) with:
# WANDB_API_KEY=your_wandb_api_key_here
# Example:
#   cd ../..  # Go to repository root
#   echo "WANDB_API_KEY=your_wandb_api_key_here" > .env
#   cd references/nemotron  # Return to training directory

# 5. Run training (PyTorch 直接訓練版)
uv run python3 train_sft_pytorch.py

Training Options

PyTorch Direct Training (train_sft_pytorch.py) - Recommended for vast.ai
- Uses unsloth + PEFT for LoRA training
- No API key required (tinker/modal不要)
- Runs directly on your GPU
- Based on the Kaggle notebook implementation
Tinker backend (train_sft.py with backend="tinker")
- Requires Tinker API key
- Offloads training to remote Tinker service
Modal backend (train_sft.py with backend="modal")
- Requires Modal API key
- Runs training on Modal's cloud GPU infrastructure

Important Notes

Verify that all required files (corpus.jsonl, corpus/ directory) are present in the HuggingFace dataset before running training.
PyTorch version (train_sft_pytorch.py):
- Filter categories by setting filter_categories in Cfg class (e.g., filter_categories=["spelling"])
- Training checkpoints and logs are saved to ./training/sft/<timestamp>/
- Adapter is saved as adapter_model.safetensors with lm_head key renaming
Tinker version (train_sft.py):
- Filter categories by modifying filter_training_examples() function
- Checkpoints: Set save_checkpoint_every_epoch=True in Cfg to enable
WandB Integration: Training metrics are logged to Weights & Biases for experiment tracking. The run will resume if interrupted (resume="allow").

Resuming Training from a Checkpoint

If training is interrupted, you can resume from the last saved checkpoint:

Find the checkpoint directory in ./training/sft/<timestamp>/
Identify the epoch checkpoint you want to resume from (e.g., epoch_2)
Edit train_sft.py and modify the Cfg class:

python

cfg = Cfg(
    resume_from_checkpoint="training/sft/05-06-12-34/epoch_2",
    # Keep the same log_path to continue writing to the same directory
    log_path="05-06-12-34",
    # Adjust num_epochs to account for already completed epochs
    # e.g., if you completed 3 epochs and want 5 total, set num_epochs=5
    # and the training will continue from epoch 3
)

Run training again: uv run python3 train_sft.py

Notes:

The WandB run will automatically resume if you use the same log_path (run name).
Make sure to adjust num_epochs appropriately. The training loop starts from epoch 0, so if you've completed epochs 0-2 and want to train 5 epochs total, keep num_epochs=5.
All metrics and logprobs will continue to be appended to the existing log files.

Uploading Training Results to Hugging Face

After training completes, upload the results (including checkpoints and logs) to Hugging Face:

bash

uv run python3 upload_to_hf.py

This uploads:

All training results in training/sft/<timestamp>/
Allows visualization via metrics.html and training.html on Hugging Face
Preserves checkpoints for future use or resumption

Note: Training results can be large (500MB-1GB per epoch). The upload script uses upload_large_folder to handle this efficiently.

Social Proof

HuggingFace Hub

31.6KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-dataset--naribow--nvidia-nemotron-progress-prize
slug: naribow--nvidia-nemotron-progress-prize
source: huggingface
author: Naribow
license
tags: region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag

📊 Engagement & Metrics

downloads: 31,601
stars: 0
forks: null

Data indexed from public sources. Updated daily.

Cite this dataset

🔬Technical Deep Dive

⚖️ Free2AITools Nexus Index V2.0

💬 Index Insight

Verification Authority

👁️ Data Preview

🧬 Field Logic

Dataset Specification

NVIDIA Progress Prize submission

Tabs on nemotron.huikang.dev

Executing training

Original (tinker/modal backend)

PyTorch Direct Training (vast.ai 用)

Running on vast.ai (or other cloud GPU providers)

Hugging Face へのアップロード

Hugging Face からのセットアップ

Requirements

Key Dependencies

Original (tinker backend)

PyTorch Direct Training (train_sft_pytorch.py)

Common

Required Data Files

Setup on vast.ai

Training Options

Important Notes

Resuming Training from a Checkpoint

Uploading Training Results to Hugging Face

Social Proof

🛡️ Dataset Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics