🏆 AI Model Benchmark Leaderboard

Compare the world's best open-source AI models by standardized benchmark scores

📊 Updated: Dec 18, 2025, 10:35 PM UTC

🥇 Top Performers

📊 Full Rankings

Rank Model MMLU HumanEval HellaSwag ARC Average
🥇 Qwen/Qwen2.5-72B-Instruct 85.3 87.2 87.9 72.5 81.5
🥈 meta-llama/Llama-3.3-70B-Instruct 83.4 82.0 87.2 71.8 80.6
🥉 meta-llama/Llama-3.1-70B-Instruct 82.0 80.5 86.4 70.2 79.4
#4 mistralai/Mistral-Large-Instruct-2411 81.2 83.0 85.1 68.9 78.4
#5 deepseek-ai/DeepSeek-V2.5 79.8 85.3 84.6 67.5 77.8
#6 CohereForAI/c4ai-command-r-plus 75.6 74.3 80.5 64.2 73.4
#7 01-ai/Yi-1.5-34B-Chat 76.2 73.8 81.9 65.1 73.2
#8 Qwen/Qwen2.5-7B-Instruct 74.2 75.8 81.3 63.8 72.9
#9 internlm/internlm2_5-20b-chat 74.9 72.1 80.2 63.5 71.7
#10 meta-llama/Llama-3.1-8B-Instruct 72.8 72.5 79.6 61.5 70.8
#11 microsoft/Phi-3-medium-128k-instruct 73.8 71.5 77.8 62.7 70.6
#12 google/gemma-2-9b-it 71.5 68.9 78.2 60.4 69.4
#13 mistralai/Mistral-7B-Instruct-v0.3 68.5 65.2 76.4 58.9 66.9
#14 Nexusflow/Starling-LM-7B-beta 65.8 62.4 74.5 57.3 64.7
#15 openchat/openchat-3.5-0106 64.2 61.8 73.8 56.1 63.5

Benchmark Metrics Explained

MMLU Massive Multitask Language Understanding - general knowledge & problem solving
HellaSwag Common sense reasoning and sentence completion
HumanEval Python coding capability assessment
ARC AI2 Reasoning Challenge - grade-school science questions