🏆 AI Model Benchmark Leaderboard

Compare the world's best open-source AI models by standardized benchmark scores

📊 Updated: Dec 18, 2025, 10:35 PM UTC

🥇 Top Performers

🥇

Qwen/Qwen2.5-72B-Instruct

No description available.

81.5

Avg Score

🥈

meta-llama/Llama-3.3-70B-Instruct

No description available.

80.6

Avg Score

🥉

meta-llama/Llama-3.1-70B-Instruct

No description available.

79.4

Avg Score

📊 Full Rankings

Rank	Model	MMLU	HumanEval	HellaSwag	ARC	Average
🥇	Qwen/Qwen2.5-72B-Instruct	85.3	87.2	87.9	72.5	81.5
🥈	meta-llama/Llama-3.3-70B-Instruct	83.4	82.0	87.2	71.8	80.6
🥉	meta-llama/Llama-3.1-70B-Instruct	82.0	80.5	86.4	70.2	79.4
#4	mistralai/Mistral-Large-Instruct-2411	81.2	83.0	85.1	68.9	78.4
#5	deepseek-ai/DeepSeek-V2.5	79.8	85.3	84.6	67.5	77.8
#6	CohereForAI/c4ai-command-r-plus	75.6	74.3	80.5	64.2	73.4
#7	01-ai/Yi-1.5-34B-Chat	76.2	73.8	81.9	65.1	73.2
#8	Qwen/Qwen2.5-7B-Instruct	74.2	75.8	81.3	63.8	72.9
#9	internlm/internlm2_5-20b-chat	74.9	72.1	80.2	63.5	71.7
#10	meta-llama/Llama-3.1-8B-Instruct	72.8	72.5	79.6	61.5	70.8
#11	microsoft/Phi-3-medium-128k-instruct	73.8	71.5	77.8	62.7	70.6
#12	google/gemma-2-9b-it	71.5	68.9	78.2	60.4	69.4
#13	mistralai/Mistral-7B-Instruct-v0.3	68.5	65.2	76.4	58.9	66.9
#14	Nexusflow/Starling-LM-7B-beta	65.8	62.4	74.5	57.3	64.7
#15	openchat/openchat-3.5-0106	64.2	61.8	73.8	56.1	63.5

Benchmark Metrics Explained

MMLU Massive Multitask Language Understanding - general knowledge & problem solving

HellaSwag Common sense reasoning and sentence completion

HumanEval Python coding capability assessment

ARC AI2 Reasoning Challenge - grade-school science questions