Compare the world's best open-source AI models by standardized benchmark scores
| Rank | Model | MMLU | HumanEval | HellaSwag | ARC | Average |
|---|---|---|---|---|---|---|
| 🥇 | Qwen/Qwen2.5-72B-Instruct | 85.3 | 87.2 | 87.9 | 72.5 | 81.5 |
| 🥈 | meta-llama/Llama-3.3-70B-Instruct | 83.4 | 82.0 | 87.2 | 71.8 | 80.6 |
| 🥉 | meta-llama/Llama-3.1-70B-Instruct | 82.0 | 80.5 | 86.4 | 70.2 | 79.4 |
| #4 | mistralai/Mistral-Large-Instruct-2411 | 81.2 | 83.0 | 85.1 | 68.9 | 78.4 |
| #5 | deepseek-ai/DeepSeek-V2.5 | 79.8 | 85.3 | 84.6 | 67.5 | 77.8 |
| #6 | CohereForAI/c4ai-command-r-plus | 75.6 | 74.3 | 80.5 | 64.2 | 73.4 |
| #7 | 01-ai/Yi-1.5-34B-Chat | 76.2 | 73.8 | 81.9 | 65.1 | 73.2 |
| #8 | Qwen/Qwen2.5-7B-Instruct | 74.2 | 75.8 | 81.3 | 63.8 | 72.9 |
| #9 | internlm/internlm2_5-20b-chat | 74.9 | 72.1 | 80.2 | 63.5 | 71.7 |
| #10 | meta-llama/Llama-3.1-8B-Instruct | 72.8 | 72.5 | 79.6 | 61.5 | 70.8 |
| #11 | microsoft/Phi-3-medium-128k-instruct | 73.8 | 71.5 | 77.8 | 62.7 | 70.6 |
| #12 | google/gemma-2-9b-it | 71.5 | 68.9 | 78.2 | 60.4 | 69.4 |
| #13 | mistralai/Mistral-7B-Instruct-v0.3 | 68.5 | 65.2 | 76.4 | 58.9 | 66.9 |
| #14 | Nexusflow/Starling-LM-7B-beta | 65.8 | 62.4 | 74.5 | 57.3 | 64.7 |
| #15 | openchat/openchat-3.5-0106 | 64.2 | 61.8 | 73.8 | 56.1 | 63.5 |