AI Datasets Catalog
Browse and discover open-source datasets for training AI models
hf-dataset--deepmind--code_contests
--- annotations_creators: - found language_creators: - found language: - en license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K
hf-dataset--openai--gsm8k
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - mit multilinguality: - monolingual size_categories: - 1K
hf-dataset--google--fleurs
--- annotations_creators: - expert-generated - crowdsourced - machine-generated language_creators: - crowdsourced - expert-generated language: - afr - amh - ara
hf-dataset--deepmind--narrativeqa
--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 10K
hf-dataset--nvidia--openmathreasoning
--- language: - en license: cc-by-4.0 size_categories: - 1M
hf-dataset--nvidia--nemotron-pretraining-specialized-v1
--- license: cc-by-4.0 task_categories: - text-generation configs: - config_name: Nemotron-Pretraining-Wiki-Rewrite data_files: - split: train path: Nemotron-Pr
hf-dataset--nvidia--helpsteer2
--- license: cc-by-4.0 language: - en pretty_name: HelpSteer2 size_categories: - 10K
hf-dataset--google--xquad
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - ar - de - el - en - es - hi - ro - ru - th - tr - vi - zh license
hf-dataset--deepmind--math_dataset
--- pretty_name: Mathematics Dataset language: - en paperswithcode_id: mathematics dataset_info: - config_name: algebra__linear_1d features: - name: question dt
hf-dataset--allenai--c4
--- pretty_name: C4 annotations_creators: - no-annotation language_creators: - found language: - af - am - ar - az - be - bg - bn - ca - ceb - co - cs - cy - da
hf-dataset--cais--mmlu
--- annotations_creators: - no-annotation language_creators: - expert-generated language: - en license: - mit multilinguality: - monolingual size_categories: -
hf-dataset--nvidia--physicalai-robotics-gr00t-x-embodiment-sim
--- license: cc-by-4.0 task_categories: - robotics tags: - robotics --- !image/png Github Repo: Isaac GR00T N1 We provide a set of datasets used for post-traini
hf-dataset--openai--openai_humaneval
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - mit multilinguality: - monolingual size_categories:
hf-dataset--nvidia--physicalai-autonomous-vehicle-cosmos-drive-dreams
--- language: - en license: cc-by-4.0 size_categories: - n>1T task_categories: - robotics tags: - Video - physicalAI - AV github: https://github.com/nv-tlabs/Co
hf-dataset--microsoft--dayhoff
--- configs: - config_name: dayhoffref data_files: dayhoffref/arrow/data*.arrow - config_name: backboneref data_files: - split: BRn path: backboneref/arrow/BRn/
hf-dataset--salesforce--wikitext
--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - en license: - cc-by-sa-3.0 - gfdl multilinguality: - monolingual size_ca
hf-dataset--google-research-datasets--mbpp
--- annotations_creators: - crowdsourced - expert-generated language_creators: - crowdsourced - expert-generated language: - en license: - cc-by-4.0 multilingua
hf-dataset--google--ifeval
--- license: apache-2.0 task_categories: - text-generation language: - en pretty_name: IFEval --- - **Repositor
hf-dataset--anthropic--hh-rlhf
--- license: mit tags: - human-feedback --- This repository provides access to two different kinds of data: 1. Human preference data about helpfulness and harml
hf-dataset--nvidia--physicalai-smartspaces
--- license: cc-by-4.0 --- !Demo of MTMC_Tracking_2025 Comprehensive, annotated dataset for multi-camera tracking and 2D/3D object detection. This dataset is sy
hf-dataset--google--boolq
--- annotations_creators: - crowdsourced language_creators: - found language: - en license: - cc-by-sa-3.0 multilinguality: - monolingual size_categories: - 10K
hf-dataset--microsoft--ms_marco
--- language: - en paperswithcode_id: ms-marco pretty_name: Microsoft Machine Reading Comprehension Dataset dataset_info: - config_name: v1.1 features: - name:
hf-dataset--deepmind--aqua_rat
--- annotations_creators: - crowdsourced language_creators: - crowdsourced - expert-generated language: - en license: - apache-2.0 multilinguality: - monolingua
hf-dataset--nvidia--openmathinstruct-2
--- language: - en license: cc-by-4.0 size_categories: - 10M