⚠️

This is a Dataset, not a Model

The following metrics do not apply: FNI Score, Deployment Options, Model Architecture

πŸ“Š

openmathinstruct-2

FNI 19.9
by nvidia Dataset

"--- language: - en license: cc-by-4.0 size_categories: - 10M"

Best Scenarios

✨ Data Science

Technical Constraints

Generic Use
- Size
- Rows
Parquet Format
224 Likes

Capabilities

  • βœ… Data Science

πŸ”¬Deep Dive

Expand Details [+]

πŸ› οΈ Technical Profile

⚑ Hardware & Scale

Size
-
Total Rows
-
Files
59

🧠 Training & Env

Format
Parquet
Cleaning
Raw

🌐 Cloud & Rights

Source
huggingface
License
CC-BY-4.0

πŸ‘οΈ Data Preview

feature label split
example_text_1 0 train
example_text_2 1 train
example_text_3 0 test
example_text_4 1 validation
example_text_5 0 train
Showing 5 sample rows. Real-time preview requires login.

🧬 Schema & Configs

Fields

feature: string
label: int64
split: string

Dataset Card

OpenMathInstruct-2

OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model.

The training set problems of GSM8K and MATH are used for constructing the dataset in the following ways:

  • Solution augmentation: Generating chain-of-thought solutions for training set problems in GSM8K and MATH.
  • Problem-Solution augmentation: Generating new problems, followed by solutions for these new problems.

OpenMathInstruct-2 dataset contains the following fields:

  • problem: Original problem from either the GSM8K or MATH training set or augmented problem from these training sets.
  • generated_solution: Synthetically generated solution.
  • expected_answer: For problems in the training set, it is the ground-truth answer provided in the datasets. For augmented problems, it is the majority-voting answer.
  • problem_source: Whether the problem is taken directly from GSM8K or MATH or is an augmented version derived from either dataset.

We also release the 1M, 2M, and 5M, fair-downsampled versions of the entire training set corresponding to points in the above scaling plot. These splits are referred to as train_1M, train_2M, and train_5M. To use these subsets, just specify one of these subsets as split while downloading the data:

python
from datasets import load_dataset

<h1 class="text-2xl font-bold mt-8 mb-4 text-gray-900 dark:text-white">Download only the 1M training split</h1> dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True)

To download the entire training set and to convert it into the jsonl format, use the following code snippet. This might take 20-30 minutes (or more depending on your network connection) and will use ~20Gb of RAM.

python
import json

from datasets import load_dataset from tqdm import tqdm

dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train')

print("Converting dataset to jsonl format") output_file = "openmathinstruct2.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in tqdm(dataset): f.write(json.dumps(item, ensure_ascii=False) + '\n')

print(f"Conversion complete. Output saved as {output_file}")

Apart from the dataset, we also release the contamination explorer for looking at problems in the OpenMathInstruct-2 dataset that are similar to the GSM8K, MATH, AMC 2023, AIME 2024, and [Omni-MA

5,454 characters total