Openthoughts 1k Sample
| Entity Passport | |
| Registry ID | hf-dataset--ryanmarten--openthoughts-1k-sample |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__ryanmarten__openthoughts_1k_sample,
author = {ryanmarten},
title = {Openthoughts 1k Sample Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/ryanmarten/openthoughts-1k-sample}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Openthoughts 1k Sample: Semantic (S:50), Authority (A:0), Popularity (P:66), Recency (R:30), Quality (Q:30).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
[!NOTE] We have released a paper for OpenThoughts! See our paper here.
Open-Thoughts-1k-sample
Dataset Description
- Homepage: https://www.open-thoughts.ai/
- Repository: https://github.com/open-thoughts/open-thoughts
- Point of Contact: Open Thoughts Team
This is a 1k sample of the OpenThoughts-114k dataset.
Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles!
Inspect the content with rich formatting with Curator Viewer.
Available Subsets
default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models:
ds = load_dataset("ryanmarten/OpenThoughts-1k-sample", split="train")
metadata subset containing extra columns used in dataset construction:
problemground_truth_solutiondeepseek_reasoningdeepseek_solutiondomainsourcetest_cases(code only)starter_code(code only)
ds = load_dataset("ryanmarten/OpenThoughts-1k-sample", "metadata", split="train")
OpenThinker Models
The numbers reported in the tables below are evaluated with our open-source tool Evalchemy.
| AIME24 | MATH500 | GPQA-Diamond | LCBv2 Easy | LCBv2 Medium | LCBv2 Hard | LCBv2 All | |
|---|---|---|---|---|---|---|---|
| OpenThinker-32B | 66 | 90.6 | 61.6 | 95.1 | 70.9 | 26.8 | 68.9 |
| OpenThinker-7B | 31.3 | 83.0 | 42.4 | 75.3 | 28.6 | 6.5 | 39.9 |
| Bespoke-Stratos-7B | 22.7 | 79.6 | 38.9 | 71.4 | 25.2 | 0.8 | 35.8 |
| DeepSeek-R1-Distill-Qwen-7B | 60 | 88.2 | 46.9 | 79.7 | 45.1 | 14.6 | 50.1 |
| gpt-4o-0513 | 8.7 | 75.8 | 46.5 | 87.4 | 42.7 | 8.9 | 50.5 |
| o1-mini | 64 | 85.6 | 60 | 92.8 | 74.7 | 39.8 | 72.8 |
We are fully open-source. Our model weights, datasets, data generation code, evaluation code, and training code are all publicly available.
| Open Weights | Open Data | Open Code | |
|---|---|---|---|
| OpenThinker-32B | â | â | â |
| OpenThinker-7B | â | â | â |
| Bespoke-Stratos-7B | â | â | â |
| DeepSeek-R1-Distill models | â | â | â |
| OpenAI/Gemini | â | â | â |
We are actively working towards improving the dataset, so please stay tuned!
Data Curation Recipe
Code
Math
Science
Puzzle
Using a curated mix of the datasets above, we generate reasoning traces from DeepSeek-R1 and verify correctness to construct the final dataset.

The full code for the data generation pipeline is publicly available in our github repo.
Links
- đ OpenThoughts Paper
- đ OpenThinker-32B Blog Post
- đ Measuing Reasoning with Evalchemy Blog Post
- đ Open Thoughts Launch Blog Post
- đģ Open Thoughts GitHub Repository
- đ§ OpenThoughts-114k dataset - this dataset.
- đ¤ OpenThinker-32B model
- đ¤ OpenThinker-7B model
- đ Bespoke-Stratos Blog Post
- đ§ Bespoke-Stratos-17k dataset
- đ¤ Bespoke-Stratos-32B model
- đ¤ Bespoke-Stratos-7B model
- đģ Curator Viewer
Citation
@misc{guha2025openthoughtsdatarecipesreasoning,
title={OpenThoughts: Data Recipes for Reasoning Models},
author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
year={2025},
eprint={2506.04178},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.04178},
}
đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
system |
string |
conversations |
unknown |
Estimated Rows: 1,000
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--ryanmarten--openthoughts-1k-sample
- slug
- ryanmarten--openthoughts-1k-sample
- source
- huggingface
- author
- ryanmarten
- license
- tags
- size_categories:1k<n<10k, format:parquet, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2506.04178, region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- 1,024
- pipeline tag
đ Engagement & Metrics
- downloads
- 574,597
- stars
- 8
- forks
- 0
Data indexed from public sources. Updated daily.