Top Model Performance Overview

Top Model Benchmark: NormScore - Livebench

Rank Model Name NormScore - Livebench Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
1 o3 High 74.416 75.198 74.690 70.842 77.254 72.556 72.775 75.551
2 Gemini 2.5 Pro Preview 71.968 64.295 70.992 75.861 74.844 70.194 75.954 71.649
3 Claude 4 Opus Thinking 71.033 57.902 71.379 76.350 72.377 71.600 75.683 73.378
4 o3 Medium 70.102 55.646 75.849 71.949 75.564 69.613 69.069 73.733
5 Claude 4 Sonnet Thinking 69.497 50.759 71.673 74.800 72.068 68.579 73.044 77.194
6 o4-Mini High 68.986 52.262 77.891 73.109 76.158 61.483 72.566 71.387
7 Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 68.074 39.479 71.967 77.684 69.335 73.707 72.119 76.307
8 Claude 3.7 Sonnet Thinking 65.873 48.503 71.286 74.460 72.832 66.876 67.699 61.800
9 Gemini 2.5 Pro Preview 65.074 21.807 68.858 77.851 70.432 72.865 71.500 75.940
10 DeepSeek R1 64.970 31.207 69.538 76.825 71.665 62.045 72.998 73.755
11 Claude 4 Opus 64.257 52.639 71.673 69.958 70.254 74.036 67.806 45.810
12 o4-Mini Medium 62.919 34.592 72.262 72.432 73.291 57.839 69.208 63.635
13 Gemini 2.5 Flash Preview 61.360 30.832 61.187 74.817 71.268 56.912 72.059 63.568
14 DeepSeek R1 60.850 29.327 74.101 73.507 72.104 52.611 66.648 62.539
15 Qwen 3 235B A22B 60.569 18.423 64.682 73.075 78.621 57.033 68.491 63.181
16 Grok 3 Mini Beta (High) 59.890 30.455 53.037 67.533 70.540 57.643 65.659 71.078
17 Gemini 2.5 Flash Preview 59.555 29.327 58.759 68.258 70.826 59.024 69.985 59.586
18 Claude 4 Sonnet 59.423 31.583 76.236 67.717 69.255 66.067 65.692 44.504
19 Qwen 3 32B 59.283 14.663 62.549 72.116 76.345 53.180 68.412 67.395
20 Claude 3.7 Sonnet 56.673 37.975 72.354 63.237 68.547 62.597 55.647 39.808
21 Qwen 3 30B A3B 55.454 16.543 46.231 70.108 74.630 52.942 65.304 57.838
22 GPT-4.5 Preview 54.959 23.687 74.101 62.494 64.766 62.856 58.234 44.198
23 Grok 3 Beta 53.018 18.423 71.673 59.423 75.952 52.945 53.885 39.534
24 DeepSeek V3.1 52.294 20.303 67.110 63.786 73.092 46.518 61.130 35.955
25 GPT-4.1 51.961 18.423 71.286 68.847 68.998 53.374 53.200 35.988
26 ChatGPT-4o 50.521 18.423 75.463 69.048 64.410 49.122 47.469 39.575
27 Claude 3.5 Sonnet 49.100 23.687 71.967 59.464 62.123 54.588 43.610 34.998
28 Qwen2.5 Max 48.522 7.144 65.069 67.927 67.586 57.361 48.933 31.267
29 Mistral Medium 3 47.138 20.303 59.917 59.021 63.941 44.307 51.104 33.995
30 GPT-4.1 Mini 47.079 10.904 70.220 61.915 62.995 37.203 50.177 43.494
31 Llama 4 Maverick 17B 128E Instruct 45.467 7.144 52.742 52.063 67.869 48.015 51.967 35.620
32 Phi-4 Reasoning Plus 44.739 5.640 58.961 55.110 65.560 29.314 53.040 46.802
33 DeepSeek R1 Distill Llama 70B 44.629 7.520 45.366 62.851 62.660 36.017 49.866 48.434
34 GPT-4o 43.572 12.783 67.496 64.790 58.202 44.108 35.496 32.300
35 Gemini 2.0 Flash Lite 43.164 5.640 57.784 68.007 68.731 33.389 47.072 26.159
36 Hunyuan Turbos 41.116 3.760 49.045 48.948 68.308 33.843 49.234 30.865
37 Gemma 3 27B 40.190 7.144 47.684 39.495 67.205 40.639 44.605 27.823
38 Mistral Large 39.770 1.880 61.279 55.086 60.862 40.592 36.492 27.444
39 Qwen2.5 72B Instruct Turbo 39.238 3.760 55.834 52.716 57.767 36.282 44.498 27.661
40 Mistral Small 38.191 12.783 48.365 54.298 57.068 34.309 32.974 30.006
41 DeepSeek R1 Distill Qwen 32B 38.140 5.640 45.752 51.763 49.938 29.706 51.188 36.138
42 Claude 3.5 Haiku 36.339 7.520 51.768 54.953 55.439 38.798 29.747 21.131
43 GPT-4.1 Nano 36.153 7.144 61.573 44.721 51.557 29.293 36.258 28.721
44 Command R Plus 28.108 1.880 26.418 47.410 51.556 30.501 19.539 17.507
45 Command R 25.641 1.880 25.443 38.400 49.807 27.597 15.845 16.648

Category Performance Comparison

Category Scores by Model

Model Agentic Coding Coding Data Analysis IF Language Mathematics Reasoning
o3 High 36.667 76.715 67.020 86.175 75.996 85.004 93.333
Gemini 2.5 Pro Preview 30.000 72.872 68.848 83.504 71.811 88.628 88.250
Claude 4 Opus Thinking 33.333 73.255 70.731 80.742 73.721 88.247 90.472
o3 Medium 28.333 77.863 68.193 84.321 73.481 80.657 91.000
Claude 4 Sonnet Thinking 30.000 73.576 69.837 80.434 70.188 85.250 95.250
o4-Mini High 28.333 79.976 68.328 84.958 66.055 84.895 88.111
Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 20.000 73.898 71.501 77.354 75.440 84.193 94.278
Claude 3.7 Sonnet Thinking 25.000 73.194 69.107 81.254 68.269 78.999 76.167
Gemini 2.5 Pro Preview 13.333 70.698 71.597 78.538 74.522 83.329 93.722
DeepSeek R1 21.667 71.402 71.539 79.954 64.823 85.258 91.083

Categories and Benchmarks

Category Benchmark
Agentic Coding javascript, python, typescript
Coding code_completion, code_generation
Data Analysis tablejoin, tablereformat
IF paraphrase, simplify, story_generation, summarize
Language connections, plot_unscrambling, typos
Mathematics AMPS_Hard, math_comp, olympiad
Reasoning spatial, web_of_lies_v3, zebra_puzzle

Model Details

Select a model to view detailed performance.

NormScore - Livebench Calculation Method and Advantages

NormScore - LiveBench is a score for comparing model performance through normalization based on LiveBench.ai data. This score helps provide more discriminative comparisons between models by correcting for varying difficulty levels across different benchmark metrics through normalization.

The calculation method is as follows.

1

Check the scores of models within each benchmark.

2

The highest score for each benchmark is normalized to 100 points, and the scores of other models are calculated as ratios relative to that highest score.

Normalization Formula
\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)
3

Calculate the average of normalized benchmark scores for each model to produce the primary NormScore.

Primary NormScore Formula
\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)
4

Find the model with the highest primary NormScore.

5

Check the average score of the top-ranking model based on LiveBench.ai original data.

6

Calculate the adjustment ratio by dividing the original average score of the 1st place model by the 1st place model's primary NormScore.

Adjustment Ratio Formula
\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)
7

The final NormScore - Livebench and adjusted category scores are calculated by multiplying the primary NormScore and category average scores of all models by this adjustment ratio.

Final NormScore Formula
\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

Through this approach, NormScore - Livebench can more accurately evaluate the actual capabilities of models by reflecting relative performance within each benchmark while maintaining the average score scale of the original LiveBench.ai data.

License Information

The LiveBench.ai data provided on this website is Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) is provided under.

Terms of Use:

Attribution

You must provide proper attribution, provide a license link, and indicate if changes were made.

ShareAlike

If you remix, transform, or build upon this material, you must distribute the resulting work under the same license.

Terms of Reuse:

When reusing, redistributing, or creating derivative works from this data, you must include the following attribution:

1
Source Data: LiveBench.ai
2
Processed and Provided by: Topllms.com

and the same license ( CC BY-SA 4.0 ) must be applied.