Openthoughts 1k Sample
Pillar scores are computed during the next indexing cycle.
--- configs: - config_name: default data_files: - split: train path: data/train-* - config_name: metadata data_files: - split: train path: metadata/train-* dataset_info: - config_name: default features: - name: system dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string splits: - name: train num_bytes: 34160692.0 num_examples: 1000 download_size: 13994266 dataset_size: 34160692.0 - config_name: metadata features: - name: problem dtype: string - name...
| Entity Passport | |
| Registry ID | hf-dataset--ryanmarten--openthoughts-1k-sample |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__ryanmarten__openthoughts_1k_sample,
author = {ryanmarten},
title = {Openthoughts 1k Sample Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/ryanmarten/OpenThoughts-1k-sample}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Openthoughts 1k Sample: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:0), Quality (Q:0).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
configs:
- config_name: default
data_files:- split: train
path: data/train-*
- split: train
- config_name: metadata
data_files:- split: train
path: metadata/train-*
dataset_info:
- split: train
- config_name: default
features:- name: system
dtype: string - name: conversations
list:- name: from
dtype: string - name: value
dtype: string
splits:
- name: from
- name: train
num_bytes: 34160692.0
num_examples: 1000
download_size: 13994266
dataset_size: 34160692.0
- name: system
- config_name: metadata
features:- name: problem
dtype: string - name: deepseek_reasoning
dtype: string - name: deepseek_solution
dtype: string - name: ground_truth_solution
dtype: string - name: domain
dtype: string - name: source
dtype: string - name: test_cases
dtype: string - name: starter_code
dtype: string
splits: - name: train
num_bytes: 43816917.80232895
num_examples: 1000
download_size: 13308003
dataset_size: 43816917.80232895
- name: problem
[!NOTE]
We have released a paper for OpenThoughts! See our paper here.
Open-Thoughts-1k-sample
Dataset Description
- Homepage: https://www.open-thoughts.ai/
- Repository: https://github.com/open-thoughts/open-thoughts
- Point of Contact: Open Thoughts Team
This is a 1k sample of the OpenThoughts-114k dataset.
Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles!
Inspect the content with rich formatting with Curator Viewer.
Available Subsets
default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models:
ds = load_dataset("ryanmarten/OpenThoughts-1k-sample", split="train")
metadata subset containing extra columns used in dataset construction:
problemground_truth_solutiondeepseek_reasoningdeepseek_solutiondomainsourcetest_cases(code only)starter_code(code only)
ds = load_dataset("ryanmarten/OpenThoughts-1k-sample", "metadata", split="train")
OpenThinker Models
The numbers reported in the tables below are evaluated with our open-source tool Evalchemy.
| AIME24 | MATH500 | GPQA-Diamond | LCBv2 Easy | LCBv2 Medium | LCBv2 Hard | LCBv2 All | |
|---|---|---|---|---|---|---|---|
| OpenThinker-32B | 66 | 90.6 | 61.6 | 95.1 | 70.9 | 26.8 | 68.9 |
| OpenThinker-7B | 31.3 | 83.0 | 42.4 | 75.3 | 28.6 | 6.5 | 39.9 |
| Bespoke-Stratos-7B | 22.7 | 79.6 | 38.9 | 71.4 | 25.2 | 0.8 | 35.8 |
| DeepSeek-R1-Distill-Qwen-7B | 60 | 88.2 | 46.9 | 79.7 | 45.1 | 14.6 | 50.1 |
| gpt-4o-0513 | 8.7 | 75.8 | 46.5 | 87.4 | 42.7 | 8.9 | 50.5 |
| o1-mini | 64 | 85.6 | 60 | 92.8 | 74.7 | 39.8 | 72.8 |
We are fully open-source. Our model weights, datasets, data generation code, evaluation code, and training code are all publicly available.
| Open Weights | Open Data | Open Code | |
|---|---|---|---|
| OpenThinker-32B | â | â | â |
| OpenThinker-7B | â | â | â |
| Bespoke-Stratos-7B | â | â | â |
| DeepSeek-R1-Distill models | â | â | â |
| OpenAI/Gemini | â | â | â |
We are actively working towards improving the dataset, so please stay tuned!
Data Curation Recipe
Code
Math
Science
Puzzle
Using a curated mix of the datasets above, we generate reasoning traces from DeepSeek-R1 and verify correctness to construct the final dataset.

The full code for the data generation pipeline is publicly available in our github repo.
Links
- đ OpenThoughts Paper
- đ OpenThinker-32B Blog Post
- đ Measuing Reasoning with Evalchemy Blog Post
- đ Open Thoughts Launch Blog Post
- đģ Open Thoughts GitHub Repository
- đ§ OpenThoughts-114k dataset - this dataset.
- đ¤ OpenThinker-32B model
- đ¤ OpenThinker-7B model
- đ Bespoke-Stratos Blog Post
- đ§ Bespoke-Stratos-17k dataset
- đ¤ Bespoke-Stratos-32B model
- đ¤ Bespoke-Stratos-7B model
- đģ Curator Viewer
Citation
@misc{guha2025openthoughtsdatarecipesreasoning,
title={OpenThoughts: Data Recipes for Reasoning Models},
author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
year={2025},
eprint={2506.04178},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.04178},
}
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Verified data manifest for traceability and transparency.
đ Identity & Source
- id
- hf-dataset--ryanmarten--openthoughts-1k-sample
- source
- huggingface
- author
- ryanmarten
- tags
- size_categories:1k
format:parquetmodality:textlibrary:datasetslibrary:pandaslibrary:mlcroissantlibrary:polarsarxiv:2506.04178region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- 1,024
đ Engagement & Metrics
- likes
- 0
- downloads
- 154,617
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)