Top Model Performance Overview

Top Model Benchmark: NormScore - Livebench

Rank Model Name NormScore - Livebench Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
1 Claude 4 Opus 65.926 61.854 64.279 68.221 63.611 69.585 67.974 66.945
2 Claude 4 Sonnet 59.511 30.926 68.342 65.945 62.716 61.693 65.923 65.050
3 Claude 3.7 Sonnet 57.830 45.838 64.900 61.713 62.087 58.365 55.784 58.352
4 GPT-4.5 Preview 55.833 27.061 66.462 60.747 58.694 58.954 58.757 64.380
5 Grok 3 Beta 53.041 19.329 64.279 58.205 68.786 49.599 54.187 57.118
6 DeepSeek V3.1 52.191 20.986 60.216 61.200 66.141 43.521 61.935 52.367
7 GPT-4.1 52.055 19.329 63.960 66.856 62.524 49.943 54.166 53.017
8 ChatGPT-4o 50.828 19.329 67.704 67.074 58.360 45.549 48.402 57.907
9 Claude 3.5 Sonnet 49.348 27.061 64.581 58.090 56.260 50.788 43.491 50.849
10 Qwen2.5 Max 48.310 9.388 58.351 66.332 61.182 53.474 48.996 45.510
11 GPT-4.1 Mini 48.048 12.702 63.020 59.642 57.066 34.557 50.924 64.275
12 Mistral Medium 3 47.190 20.986 53.683 56.340 57.942 41.271 51.866 49.873
13 Llama 4 Maverick 17B 128E Instruct 45.985 9.388 47.403 51.480 61.473 44.940 52.278 52.078
14 GPT-4o 43.537 14.359 60.534 62.607 52.721 40.975 35.926 46.600
15 Gemini 2.0 Flash Lite 42.450 4.970 51.803 66.102 62.212 30.878 47.523 38.076
16 Hunyuan Turbos 40.985 3.314 43.995 47.302 61.802 30.953 49.634 46.068
17 Gemma 3 27B 40.269 9.388 42.752 38.143 60.831 37.435 45.258 41.343
18 Mistral Large 39.234 1.657 54.926 53.176 55.130 37.634 36.244 40.448
19 Qwen2.5 72B Instruct Turbo 38.957 3.314 49.955 51.395 52.285 33.752 44.800 40.568
20 Mistral Small 38.475 14.359 43.373 52.796 51.680 31.664 33.235 44.220
21 GPT-4.1 Nano 36.307 9.388 55.228 41.440 46.686 27.019 36.438 42.509
22 Claude 3.5 Haiku 35.509 6.627 46.479 53.032 50.222 35.685 30.253 30.858
23 Command R Plus 27.517 1.657 23.710 44.996 46.747 28.105 19.835 25.715
24 Command R 24.988 1.657 22.786 36.477 45.116 25.156 15.836 24.272

Category Performance Comparison

Category Scores by Model

Model Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
Claude 4 Opus 31.667 73.576 66.510 78.379 76.114 78.790 56.444
Claude 4 Sonnet 25.000 78.245 64.684 77.246 67.181 76.390 54.861
Claude 3.7 Sonnet 21.667 74.280 59.965 76.492 63.194 64.654 49.111
GPT-4.5 Preview 15.000 76.072 60.070 72.325 64.759 67.940 54.417
Grok 3 Beta 13.333 73.576 55.629 84.738 53.797 62.752 48.528
DeepSeek V3.1 15.000 68.907 64.019 81.471 46.823 71.437 44.278
GPT-4.1 13.333 73.194 66.404 77.046 54.551 62.386 44.389
ChatGPT-4o 13.333 77.480 66.520 71.921 49.428 55.717 48.806
Claude 3.5 Sonnet 15.000 73.898 56.186 69.296 54.477 50.543 43.222
Qwen2.5 Max 3.333 66.794 64.271 75.346 58.369 56.868 38.528

Categories and Benchmarks

Category Benchmark
Agentic Coding javascript, python, typescript
Coding code_completion, code_generation
Data Analysis tablejoin, tablereformat
IF paraphrase, simplify, story_generation, summarize
Language connections, plot_unscrambling, typos
Mathematics AMPS_Hard, math_comp, olympiad
Reasoning spatial, web_of_lies_v3, zebra_puzzle

Model Details

Select a model to view detailed performance.

NormScore - Livebench Calculation Method and Advantages

NormScore - LiveBench is a score for comparing model performance through normalization based on LiveBench.ai data. This score helps provide more discriminative comparisons between models by correcting for varying difficulty levels across different benchmark metrics through normalization.

The calculation method is as follows.

1

Check the scores of models within each benchmark.

2

The highest score for each benchmark is normalized to 100 points, and the scores of other models are calculated as ratios relative to that highest score.

Normalization Formula
\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)
3

Calculate the average of normalized benchmark scores for each model to produce the primary NormScore.

Primary NormScore Formula
\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)
4

Find the model with the highest primary NormScore.

5

Check the average score of the top-ranking model based on LiveBench.ai original data.

6

Calculate the adjustment ratio by dividing the original average score of the 1st place model by the 1st place model's primary NormScore.

Adjustment Ratio Formula
\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)
7

The final NormScore - Livebench and adjusted category scores are calculated by multiplying the primary NormScore and category average scores of all models by this adjustment ratio.

Final NormScore Formula
\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

Through this approach, NormScore - Livebench can more accurately evaluate the actual capabilities of models by reflecting relative performance within each benchmark while maintaining the average score scale of the original LiveBench.ai data.

License Information

The LiveBench.ai data provided on this website is Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) is provided under.

Terms of Use:

Attribution

You must provide proper attribution, provide a license link, and indicate if changes were made.

ShareAlike

If you remix, transform, or build upon this material, you must distribute the resulting work under the same license.

Terms of Reuse:

When reusing, redistributing, or creating derivative works from this data, you must include the following attribution:

1
Source Data: LiveBench.ai
2
Processed and Provided by: Topllms.com

and the same license ( CC BY-SA 4.0 ) must be applied.