Dataset Card for UltiMath
UltiMath is a large-scale synthetic dataset containing ~33 billion math reasoning examples, designed to enhance arithmetic and symbolic reasoning in large language models (LLMs).
Dataset Details
Dataset Description
- Curated by: [Roman]
- Funded by: [No funding used]
- Shared by [Roman]: [Uploads via API]
- License: [CC by SA 4.0]
Dataset Sources [Code Generated]
Uses
Designed to improve multi-step arithmetic, algebraic manipulation, equation solving, and symbolic reasoning during pretraining or continued pretraining of LLMs.
Direct Use
It is supposed to be part of the [pretraining] corpus.
Then it can be later used for finetuning.
Out-of-Scope Use
Redistribution is permitted under CC BY 4.0, provided proper attribution is given to the original creator ([Roman]).
Mirroring without credit or claiming authorship is not allowed.
Dataset Structure
Each shard is 5,000,000 Rows.
It follows this pattern:
- "problem": "What is 42 + 17?",
- "steps": "Add 42 and 17.",
- "explanation": "Addition sums two integers.",
- "answer": "59",
- "difficulty": "easy"
Dataset Creation
There is a total of ~3,7T tokens
Curation Rationale
I made this dataset since I dont think there is enough synthetic math reasoining in the open source ML community.
Source Data
I used a python script to automatically generate the data.
Data Collection and Processing
Theres no filtering done to the data, its a raw upload of the generated shards.
Who are the source data producers?
I am the creator of this dataset.
Unless math is personally offended that i mass produced variants of equations, then uploaded them, ther is no sensitive / personal information.
Bias, Risks, and Limitations
The data is synthetically generated with uniform sampling across templates, but it may underrepresent advanced topics (e.g., calculus, proofs) and non-English contexts.
Overuse during training may lead to arithmetic overfitting or reduced fluency in non-math tasks.
For the datasets purpose, there are currently no limitations.
Recommendations
With respect to what I have mentioned before, I think use of this dataset is more for research. However if you want to use it on a smaller model, use only a few shards.
(the dataset is very massive)
email - [[email protected]]
Please be respectful and avoid spamming my email.
Thank you in advance.
BibTex
@dataset{ultimath2025,
title={UltiMath: Large-Scale Synthetic Math Reasoning Dataset},
author={DataMuncher-Labs},
year={2025},
license={CC BY-SA 4.0},
url={https://huggingface.co/datasets/DataMuncher-Labs/UltiMath}
}