AI Datasets Catalog

Browse and discover open-source datasets for training AI models

999 datasets loaded from R2-ENTITIES Updated Mon/Thu
Dataset

hf-dataset--deepmind--code_contests

--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K

📥 1848462 ❤️ 206
Dataset

hf-dataset--openai--gsm8k

--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - mit multilinguality: - monolingual size_categories: - 1K

📥 422226 ❤️ 1103
Dataset

hf-dataset--google--fleurs

--- annotations_creators: - expert-generated - crowdsourced - machine-generated language_creators: - crowdsourced - expert-generated language: - afr - amh - ara

📥 39947 ❤️ 364
Dataset

hf-dataset--deepmind--narrativeqa

--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 10K

📥 12592 ❤️ 60
Dataset

hf-dataset--nvidia--openmathreasoning

--- language: - en license: cc-by-4.0 size_categories: - 1M

📥 13849 ❤️ 391
Dataset

hf-dataset--nvidia--nemotron-pretraining-specialized-v1

--- license: cc-by-4.0 task_categories: - text-generation configs: - config_name: Nemotron-Pretraining-Wiki-Rewrite data_files: - split: train path: Nemotron-Pr

📥 12507 ❤️ 65
Dataset

hf-dataset--nvidia--helpsteer2

--- license: cc-by-4.0 language: - en pretty_name: HelpSteer2 size_categories: - 10K

📥 13836 ❤️ 435
Dataset

hf-dataset--google--xquad

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - ar - de - el - en - es - hi - ro - ru - th - tr - vi - zh license

📥 18008 ❤️ 38
Dataset

hf-dataset--deepmind--math_dataset

--- pretty_name: Mathematics Dataset language: - en paperswithcode_id: mathematics dataset_info: - config_name: algebra__linear_1d features: - name: question dt

📥 10831 ❤️ 136
Dataset

hf-dataset--allenai--c4

--- pretty_name: C4 annotations_creators: - no-annotation language_creators: - found language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da

📥 582815 ❤️ 506
Dataset

hf-dataset--cais--mmlu

--- annotations_creators: - no-annotation language_creators: - expert-generated language: - en license: - mit multilinguality: - monolingual size_categories: -

📥 290481 ❤️ 615
Dataset

hf-dataset--nvidia--physicalai-robotics-gr00t-x-embodiment-sim

--- license: cc-by-4.0 task_categories: - robotics tags: - robotics --- !image/png Github Repo: Isaac GR00T N1 We provide a set of datasets used for post-traini

📥 847906 ❤️ 185
Dataset

hf-dataset--openai--openai_humaneval

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - mit multilinguality: - monolingual size_categories:

📥 132072 ❤️ 361
Dataset

hf-dataset--nvidia--physicalai-autonomous-vehicle-cosmos-drive-dreams

--- language: - en license: cc-by-4.0 size_categories: - n>1T task_categories: - robotics tags: - Video - physicalAI - AV github: https://github.com/nv-tlabs/Co

📥 88672 ❤️ 35
Dataset

hf-dataset--microsoft--dayhoff

--- configs: - config_name: dayhoffref data_files: dayhoffref/arrow/data*.arrow - config_name: backboneref data_files: - split: BRn path: backboneref/arrow/BRn/

📥 86298 ❤️ 6
Dataset

hf-dataset--salesforce--wikitext

--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - en license: - cc-by-sa-3.0 - gfdl multilinguality: - monolingual size_ca

📥 791719 ❤️ 611
Dataset

hf-dataset--google-research-datasets--mbpp

--- annotations_creators: - crowdsourced - expert-generated language_creators: - crowdsourced - expert-generated language: - en license: - cc-by-4.0 multilingua

📥 1753258 ❤️ 204
Dataset

hf-dataset--google--ifeval

--- license: apache-2.0 task_categories: - text-generation language: - en pretty_name: IFEval --- - **Repositor

📥 46164 ❤️ 118
Dataset

hf-dataset--anthropic--hh-rlhf

--- license: mit tags: - human-feedback --- This repository provides access to two different kinds of data: 1. Human preference data about helpfulness and harml

📥 22091 ❤️ 1630
Dataset

hf-dataset--nvidia--physicalai-smartspaces

--- license: cc-by-4.0 --- !Demo of MTMC_Tracking_2025 Comprehensive, annotated dataset for multi-camera tracking and 2D/3D object detection. This dataset is sy

📥 22372 ❤️ 59
Dataset

hf-dataset--google--boolq

--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K

📥 17917 ❤️ 91
Dataset

hf-dataset--microsoft--ms_marco

--- language: - en paperswithcode_id: ms-marco pretty_name: Microsoft Machine Reading Comprehension Dataset dataset_info: - config_name: v1.1 features: - name:

📥 12949 ❤️ 220
Dataset

hf-dataset--deepmind--aqua_rat

--- annotations_creators: - crowdsourced language_creators: - crowdsourced - expert-generated language: - en license: - apache-2.0 multilinguality: - monolingua

📥 15491 ❤️ 72
Dataset

hf-dataset--nvidia--openmathinstruct-2

--- language: - en license: cc-by-4.0 size_categories: - 10M

📥 14938 ❤️ 224
// Force Rebuild at 01/10/2026 19:59:41 // Force Rebuild fetcher fix at 01/10/2026 20:46:45