MMMU
| Entity Passport | |
| Registry ID | hf-dataset--mmmu--mmmu |
| License | Apache-2.0 |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__mmmu__mmmu,
author = {MMMU},
title = {MMMU Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/mmmu/mmmu}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for MMMU: Semantic (S:50), Authority (A:0), Popularity (P:59), Recency (R:97), Quality (Q:30).
Verification Authority
ποΈ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
π Explore Full Dataset β𧬠Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)
π Homepage | π Leaderboard | π€ Dataset | π€ Paper | π arXiv | GitHub
πNews
- π οΈ[2026-04-21]: Fixed option issue in test_Psychology_15.
- βΌοΈ[2026-02-12]: We have released the answers for the test set! You can now evaluate your models on the test set locally! π
- π οΈ[2024-05-30]: Fixed duplicate option issues in Materials dataset items (validation_Materials_25; test_Materials_17, 242) and content error in validation_Materials_25.
- π οΈ[2024-04-30]: Fixed missing "-" or "^" signs in Math dataset items (dev_Math_2, validation_Math_11, 12, 16; test_Math_8, 23, 43, 113, 164, 223, 236, 287, 329, 402, 498) and corrected option errors in validation_Math_2. If you encounter any issues with the dataset, please contact us promptly!
- π[2024-01-31]: We added Human Expert performance on the Leaderboard!π
- π₯[2023-12-04]:
Our evaluation server for test set is now availble on EvalAI.We welcome all submissions and look forward to your participation! π
Dataset Details
Dataset Description
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence (AGI).
π― We have released a full set comprising 150 development samples, 900 validation samples and 10,500 test samples.
The development set is used for few-shot/in-context learning, and the validation set is used for debugging models, selecting hyperparameters, or quick evaluations. The answers and explanations for the test set questions are withheld. You can submit your model's predictions for the test set on EvalAI.
The answers and explanations for the test set samples are now released. You can evaluate your models locally!

Dataset Creation
MMMU was created to challenge multimodal models with tasks that demand college-level subject knowledge and deliberate reasoning, pushing the boundaries of what these models can achieve in terms of expert-level perception and reasoning. The data for the MMMU dataset was manually collected by a team of college students from various disciplines, using online sources, textbooks, and lecture materials.
- Content: The dataset contains 11.5K college-level problems across six broad disciplines (Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, Tech & Engineering) and 30 college subjects.
- Image Types: The dataset includes 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures, interleaved with text.

π Mini-Leaderboard
We show a mini-leaderboard here and please find more information in our paper or homepage.
| Model | Val (900) | Test (10.5K) |
|---|---|---|
| Expert (Best) | 88.6 | - |
| Expert (Medium) | 82.6 | - |
| Expert (Worst) | 76.2 | - |
| GPT-4o* | 69.1 | - |
| Gemini 1.5 Pro* | 62.2 | - |
| InternVL2-Pro* | 62.0 | 55.7 |
| Gemini 1.0 Ultra* | 59.4 | - |
| Claude 3 Opus* | 59.4 | - |
| GPT-4V(ision) (Playground) | 56.8 | 55.7 |
| Reka Core* | 56.3 | - |
| Gemini 1.5 Flash* | 56.1 | - |
| SenseChat-Vision-0423-Preview* | 54.6 | 50.3 |
| Reka Flash* | 53.3 | - |
| Claude 3 Sonnet* | 53.1 | - |
| HPT Pro* | 52.0 | - |
| VILA1.5* | 51.9 | 46.9 |
| Qwen-VL-MAX* | 51.4 | 46.8 |
| InternVL-Chat-V1.2* | 51.6 | 46.2 |
| Skywork-VL* | 51.4 | 46.2 |
| LLaVA-1.6-34B* | 51.1 | 44.7 |
| Claude 3 Haiku* | 50.2 | - |
| Adept Fuyu-Heavy* | 48.3 | - |
| Gemini 1.0 Pro* | 47.9 | - |
| Marco-VL-Plus* | 46.2 | 44.3 |
| Yi-VL-34B* | 45.9 | 41.6 |
| Qwen-VL-PLUS* | 45.2 | 40.8 |
| HPT Air* | 44.0 | - |
| Reka Edge* | 42.8 | - |
| Marco-VL* | 41.2 | 40.4 |
| OmniLMM-12B* | 41.1 | 40.4 |
| Bunny-8B* | 43.3 | 39.0 |
| Bunny-4B* | 41.4 | 38.4 |
| Weitu-VL-1.0-15B* | - | 38.4 |
| InternLM-XComposer2-VL* | 43.0 | 38.2 |
| Yi-VL-6B* | 39.1 | 37.8 |
| InfiMM-Zephyr-7B* | 39.4 | 35.5 |
| InternVL-Chat-V1.1* | 39.1 | 35.3 |
| Math-LLaVA-13B* | 38.3 | 34.6 |
| SVIT* | 38.0 | 34.1 |
| MiniCPM-V* | 37.2 | 34.1 |
| MiniCPM-V-2* | 37.1 | - |
| Emu2-Chat* | 36.3 | 34.1 |
| BLIP-2 FLAN-T5-XXL | 35.4 | 34.0 |
| InstructBLIP-T5-XXL | 35.7 | 33.8 |
| LLaVA-1.5-13B | 36.4 | 33.6 |
| Bunny-3B* | 38.2 | 33.0 |
| Qwen-VL-7B-Chat | 35.9 | 32.9 |
| SPHINX* | 32.9 | 32.9 |
| mPLUG-OWL2* | 32.7 | 32.1 |
| BLIP-2 FLAN-T5-XL | 34.4 | 31.0 |
| InstructBLIP-T5-XL | 32.9 | 30.6 |
| Gemini Nano2* | 32.6 | - |
| CogVLM | 32.1 | 30.1 |
| Otter | 32.2 | 29.1 |
| LLaMA-Adapter2-7B | 29.8 | 27.7 |
| MiniGPT4-Vicuna-13B | 26.8 | 27.6 |
| Adept Fuyu-8B | 27.9 | 27.4 |
| Kosmos2 | 24.4 | 26.6 |
| OpenFlamingo2-9B | 28.7 | 26.3 |
| Frequent Choice | 22.1 | 23.9 |
| Random Choice | 26.8 | 25.8 |
*: results provided by the authors.
Limitations
Despite its comprehensive nature, MMMU, like any benchmark, is not without limitations. The manual curation process, albeit thorough, may carry biases. And the focus on college-level subjects might not fully be a sufficient test for Expert AGI. However, we believe it should be necessary for an Expert AGI to achieve strong performance on MMMU to demonstrate their broad and deep subject knowledge as well as expert-level understanding and reasoning capabilities. In future work, we plan to incorporate human evaluations into MMMU. This will provide a more grounded comparison between model capabilities and expert performance, shedding light on the proximity of current AI systems to achieving Expert AGI.
Disclaimers
The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to notify us. Upon verification, such samples will be promptly removed.
Contact
- Xiang Yue: [email protected]
- Yu Su: [email protected]
- Wenhu Chen: [email protected]
Citation
BibTeX:
@inproceedings{yue2023mmmu,
title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
booktitle={Proceedings of CVPR},
year={2024},
}
π Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
id |
string |
question |
string |
options |
string |
explanation |
string |
image_1 |
Image |
image_2 |
Image |
image_3 |
Image |
image_4 |
Image |
image_5 |
Image |
image_6 |
Image |
image_7 |
Image |
img_type |
string |
answer |
string |
topic_difficulty |
string |
question_type |
string |
subfield |
string |
Estimated Rows: 415
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- hf-dataset--mmmu--mmmu
- slug
- mmmu--mmmu
- source
- huggingface
- author
- MMMU
- license
- Apache-2.0
- tags
- task_categories:question-answering, task_categories:visual-question-answering, task_categories:multiple-choice, language:en, license:apache-2.0, size_categories:10k<n<100k, format:parquet, modality:image, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2311.16502, region:us, biology, medical, finance, chemistry, music, art, art_theory, design, business, accounting, economics, manage, marketing, health, medicine, basic_medical_science, clinical, pharmacy,
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
π Engagement & Metrics
- downloads
- 97,919
- stars
- 325
- forks
- 0
Data indexed from public sources. Updated daily.