Top Model Performance Overview

Top Model Benchmark: NormScore - Livebench

Rank Model Name NormScore - Livebench Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
1 gpt-5 75.837 70.963 67.891 78.433 77.992 76.211 78.433 78.433
2 Claude 4 Opus 63.987 50.421 72.452 70.052 68.721 74.555 68.948 46.025
3 Qwen 3 235B A22B Instruct 2507 60.115 21.476 65.412 67.828 66.381 62.454 68.861 70.643
4 Claude 4 Sonnet 59.221 29.879 77.032 67.801 67.716 66.653 66.713 44.719
5 DeepSeek V3.1 58.863 28.946 67.512 67.198 73.991 58.449 68.515 48.046
6 Kimi K2 Instruct 58.453 24.277 70.692 65.559 72.329 62.145 64.667 51.327
7 Qwen 3 Coder 480B A35B Instruct 57.743 41.084 72.093 66.850 65.000 61.704 58.255 44.614
8 Claude 3.7 Sonnet 57.062 40.150 73.152 63.327 67.074 63.244 56.615 39.983
9 GPT-5 Chat 56.065 17.741 75.613 66.497 64.016 61.045 63.446 51.439
10 GPT-4.5 Preview 54.922 23.343 74.912 62.561 63.405 63.255 58.930 44.430
11 Grok 3 Beta 52.748 16.807 72.452 59.526 74.288 53.439 54.589 39.781
12 GPT-4.1 52.076 19.608 72.093 68.918 67.538 53.796 53.616 36.096
13 DeepSeek V3 0324 52.035 18.675 67.872 63.787 71.469 47.044 61.969 36.148
14 ChatGPT-4o 50.672 19.608 76.313 69.120 63.046 49.641 47.784 39.761
15 Claude 3.5 Sonnet 49.116 23.343 72.793 59.553 60.729 55.280 44.373 35.245
16 Qwen2.5 Max 48.269 5.602 65.771 68.027 66.085 57.852 49.577 31.446
17 Mistral Medium 3 46.889 18.675 60.509 58.997 62.563 44.758 51.926 34.146
18 GPT-4.1 Mini 46.757 9.337 71.033 61.936 61.633 37.479 50.445 43.628
19 Llama 4 Maverick 17B 128E Instruct 45.576 8.403 53.430 52.195 66.418 48.271 52.427 35.768
20 GPT-4o 43.314 11.205 68.231 64.829 56.915 44.516 35.902 32.544
21 Gemini 2.0 Flash Lite 43.045 5.602 58.390 68.081 67.194 33.658 47.484 26.314
22 Command A 41.887 7.470 53.450 48.854 72.705 37.640 39.425 29.570
23 Hunyuan Turbos 40.963 3.735 49.589 48.978 66.801 34.069 49.604 30.898
24 Gemma 3 27B 39.775 5.602 48.188 39.517 65.642 40.951 44.747 27.874
25 Mistral Large 39.765 1.867 61.910 55.114 59.565 41.106 37.171 27.518
26 Qwen2.5 72B Instruct Turbo 39.174 3.735 56.307 52.786 56.468 36.685 44.957 27.766
27 Mistral Small 37.922 11.205 48.888 54.359 55.793 34.642 33.630 30.115
28 Claude 3.5 Haiku 36.259 7.470 52.389 54.980 54.206 39.031 30.089 21.281
29 GPT-4.1 Nano 35.791 5.602 62.951 44.596 50.469 29.263 35.936 28.819
30 Gemma 3 12B 34.817 1.867 41.507 31.821 64.687 31.012 41.041 23.061
31 Command R Plus 28.038 1.867 26.725 47.369 50.488 30.769 19.982 17.587
32 Gemma 3n E4B IT 26.893 1.867 30.965 16.394 56.741 25.164 27.430 17.597
33 Command R 25.565 1.867 25.684 38.369 48.774 27.808 16.265 16.758
34 Gemma 3 4B 23.128 0.000 15.463 18.080 55.733 14.992 26.631 15.888
35 Gemma 3n E2B IT 21.258 1.867 16.182 13.220 50.274 15.200 22.323 15.694

Category Performance Comparison

Category Scores by Model

Model Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
gpt-5 35.000 68.968 72.375 88.988 78.988 89.954 96.583
Claude 4 Opus 31.667 73.576 66.510 78.379 76.114 78.790 56.444
Qwen 3 235B A22B Instruct 2507 13.333 66.411 65.241 75.704 66.292 79.179 86.889
Claude 4 Sonnet 25.000 78.245 64.684 77.246 67.181 76.390 54.861
DeepSeek V3.1 21.667 68.524 65.424 84.375 59.801 78.868 59.167
Kimi K2 Instruct 20.000 71.785 63.405 82.467 63.853 74.414 62.972
Qwen 3 Coder 480B A35B Instruct 25.000 73.194 64.683 74.163 64.262 67.282 54.583
Claude 3.7 Sonnet 21.667 74.280 59.965 76.492 63.194 64.654 49.111
GPT-5 Chat 11.667 76.776 64.482 73.004 62.963 73.456 63.139
GPT-4.5 Preview 15.000 76.072 60.070 72.325 64.759 67.940 54.417

Categories and Benchmarks

Category Benchmark
Agentic Coding javascript, python, typescript
Coding code_completion, code_generation
Data Analysis tablejoin, tablereformat
IF paraphrase, simplify, story_generation, summarize
Language connections, plot_unscrambling, typos
Mathematics AMPS_Hard, math_comp, olympiad
Reasoning spatial, web_of_lies_v3, zebra_puzzle

Model Details

Select a model to view detailed performance.

NormScore - Livebench Calculation Method and Advantages

NormScore - LiveBench is a score for comparing model performance through normalization based on LiveBench.ai data. This score helps provide more discriminative comparisons between models by correcting for varying difficulty levels across different benchmark metrics through normalization.

The calculation method is as follows.

1

Check the scores of models within each benchmark.

2

The highest score for each benchmark is normalized to 100 points, and the scores of other models are calculated as ratios relative to that highest score.

Normalization Formula
\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)
3

Calculate the average of normalized benchmark scores for each model to produce the primary NormScore.

Primary NormScore Formula
\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)
4

Find the model with the highest primary NormScore.

5

Check the average score of the top-ranking model based on LiveBench.ai original data.

6

Calculate the adjustment ratio by dividing the original average score of the 1st place model by the 1st place model's primary NormScore.

Adjustment Ratio Formula
\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)
7

The final NormScore - Livebench and adjusted category scores are calculated by multiplying the primary NormScore and category average scores of all models by this adjustment ratio.

Final NormScore Formula
\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

Through this approach, NormScore - Livebench can more accurately evaluate the actual capabilities of models by reflecting relative performance within each benchmark while maintaining the average score scale of the original LiveBench.ai data.

License Information

The LiveBench.ai data provided on this website is Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) is provided under.

Terms of Use:

Attribution

You must provide proper attribution, provide a license link, and indicate if changes were made.

ShareAlike

If you remix, transform, or build upon this material, you must distribute the resulting work under the same license.

Terms of Reuse:

When reusing, redistributing, or creating derivative works from this data, you must include the following attribution:

1
Source Data: LiveBench.ai
2
Processed and Provided by: Topllms.com

and the same license ( CC BY-SA 4.0 ) must be applied.