⚠️

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

πŸ“Š

nemotron-pretraining-specialized-v1

FNI 22.4
by nvidia Dataset

"--- license: cc-by-4.0 task_categories: - text-generation configs: - config_name: Nemotron-Pretraining-Wiki-Rewrite data_files: - split: train path: Nemotron-Pretraining-Wiki-Rewrite/*.parquet - config_name: Nemotron-Pretraining-Math-Textbooks data_files: - split: train path: Nemotron-Pretraining-Ma..."

Best Scenarios

✨ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
65 Likes

Capabilities

  • βœ… Data Science

πŸ”¬Deep Dive

Expand Details [+]

πŸ› οΈ Technical Profile

⚑ Hardware & Scale

Size
-
Total Rows
-
Files
332

🧠 Training & Env

Format
Parquet
Cleaning
Raw

🌐 Cloud & Rights

Source
huggingface
License
CC-BY-4.0

πŸ‘οΈ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

🧬 Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

Nemotron-Pre-Training-Dataset-v2.1

Dataset Description

The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens, updated GitHub-sourced source-code corpora, and specialized STEM reasoning datasets. These additions are intended to be used together with, not as replacements for, existing Nemotron Pretraining datasets (see the Nemotron Nano 2 technical report), providing an expanded, modern foundation for training leading LLMs.

Our dataset comes in 4 main categories:

- 427.9B-token high-quality Code pretraining dataset obtained from processing Common Crawl Code pages using the Nemotron-CC-Math Lynx + LLM pipeline. Preserves equations and code often lost by other pipelines, standardizes math equations to LaTeX, and removes noise.

- 2.5T English tokens from Common Crawl: organic, translated, and synthetically rephrased. These are new tokens that are intended to be used in conjunction with the previously released 6.6T tokens of Nemotron-CC-v2.

- Comprehensive update to and expansion of Nemotron-Pretraining-Code-v1. Adds metadata for 377M more filtered and deduplicated files from GitHub, or around 340B tokens. It also includes synthetic data generated with five different techniques: Question Answering, Code Review, Student Teacher, Rewriting, and Transpilation.

- Collection of synthetic datasets for specialized areas like STEM reasoning and scientific coding. It is an extension of the previously released Nemotron-Pretraining-SFT-v1 with updated naming to better reflect the nature of the dataset.

This dataset is ready for commercial use.

Dat

23,421 characters total