📊

Dataset

Dolma3 Mix 6t 1025 7b

Name: Dolma3 Mix 6t 1025 7b
Creator: allenai
License: odc-by

by allenai hf-dataset--allenai--dolma3_mix-6t-1025-7b

Nexus Index

29.0 Top 1%

S / A / P / R / Q Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context

Vital Performance

0 DL / 30D

0.0%

For all other training use cases, including training from scratch, **...

Source →

Data Integrity 29 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--allenai--dolma3_mix-6t-1025-7b
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__allenai__dolma3_mix_6t_1025_7b,
  author = {allenai},
  title = {Dolma3 Mix 6t 1025 7b Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

allenai. (2026). Dolma3 Mix 6t 1025 7b [Dataset]. Free2AITools. https://huggingface.co/datasets/allenai/dolma3_mix-6T-1025-7B

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V2.0

Methodology Index Protocol

29.0

ESTIMATED IMPACT TIER

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 0

Quality (Q) 0

💬 Index Insight

FNI V2.0 for Dolma3 Mix 6t 1025 7b: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

393,147

❤️

Likes

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: odc-by
task_categories:

text-generation
language:
en

configs:

config_name: default
data_files:
- split: train
  path: data/**/*.jsonl.zst
  features:
- name: id
  dtype: string
- name: text
  dtype: string
- name: metadata
  dtype: string
- name: source
  dtype: string
- name: version
  dtype: string
- name: created
  dtype: string
- name: added
  dtype: string
- name: doc
  dtype: string
- name: attributes
  dtype: string

⚠️ WARNING: This dataset is intended ONLY for reproducing Olmo 3 7B ⚠️

For all other training use cases, including training from scratch, please utilize our primary dolma 3 data mix: https://huggingface.co/datasets/allenai/dolma3_mix-6T.

Note: Some olmOCR science PDFs in the current dataset have been redacted following the training of Olmo 3 7B. These texts are indicated with [REMOVED] in the text field. This will affect reproducibility of Olmo 3 7B.
For this reason, please use our 32B training mix, which utilizes the same sampling strategy and is complete with olmOCR science pdfs.

Dolma 3 Mix (6T)

The Dolma 3 Mix (6T) is the collection of data used during the pretraining stage to train the Olmo-3-1025-7B model. This dataset is made up of ~6 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.

For more information on Dolma, please see our original release here.

Dataset Sources

Source Sizes

This dataset contains the full mix of documents used to train Olmo 3 7B.

Source	Doc Type	Tokens	Bytes (uncompressed)	Documents	License
common_crawl	web pages	4.51T	18.0TB	3.15B	ODC-BY
olmocr_science_pdfs	academic papers	805B	3.22TB	83.8M	ODC-BY
stack_edu	code	409B	1.64TB	525.8M	ODC-BY
finemath-3plus	mathematics	151B	607GB	95.5M	ODC-BY
rpj-proofpile-arxiv	research papers	50.9B	203GB	9.10M	ODC-BY
dolma1_7-wiki-en	encyclopedic	2.51B	10.0GB	4.24M	ODC-BY
Total		5.93T	23.7TB	3.87B	ODC-BY

Mix Compositions

Source	6T
	Source %	Mix %
common_crawl	76.07%	76.07%
olmocr_science_pdfs	13.57%	13.57%
stack_edu	6.89%	6.89%
finemath-3plus	2.56%	2.56%
rpj-proofpile-arxiv	0.86%	0.86%
dolma1_7-wiki-en	0.04%	0.04%

Licensing Information

Dolma 3 mix is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our Responsible Use Guidelines.

Citation

@misc{olmo2025olmo3,
title={Olmo 3},
author={Team Olmo and Allyson Ettinger and Amanda Bertsch and Bailey Kuehl and David Graham and David Heineman and Dirk Groeneveld and Faeze Brahman and Finbarr Timbers and Hamish Ivison and Jacob Morrison and Jake Poznanski and Kyle Lo and Luca Soldaini and Matt Jordan and Mayee Chen and Michael Noukhovitch and Nathan Lambert and Pete Walsh and Pradeep Dasigi and Robert Berry and Saumya Malik and Saurabh Shah and Scott Geng and Shane Arora and Shashank Gupta and Taira Anderson and Teng Xiao and Tyler Murray and Tyler Romero and Victoria Graf and Akari Asai and Akshita Bhagia and Alexander Wettig and Alisa Liu and Aman Rangapur and Chloe Anastasiades and Costa Huang and Dustin Schwenk and Harsh Trivedi and Ian Magnusson and Jaron Lochner and Jiacheng Liu and Lester James V. Miranda and Maarten Sap and Malia Morgan and Michael Schmitz and Michal Guerquin and Michael Wilson and Regan Huff and Ronan Le Bras and Rui Xin and Rulin Shao and Sam Skjonsberg and Shannon Zejiang Shen and Shuyue Stella Li and Tucker Wilde and Valentina Pyatkin and Will Merrill and Yapei Chang and Yuling Gu and Zhiyuan Zeng and Ashish Sabharwal and Luke Zettlemoyer and Pang Wei Koh and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2025},
eprint={2512.13961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.13961},
}

Find the paper at: https://allenai.org/papers/olmo3

Top Tier

Social Proof

HuggingFace Hub

33Likes

393.1KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--allenai--dolma3_mix-6t-1025-7b
source: huggingface
author: allenai
tags: task_categories:text-generationlanguage:enlicense:odc-byarxiv:2512.13961region:us

⚙️ Technical Specs

architecture: null
params billions: 7
context length: 4,096

📊 Engagement & Metrics

likes: 33
downloads: 393,147

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!