顶级模型性能概览

顶级模型基准: NormScore - LiveBench

排名 模型名称 NormScore - LiveBench 智能体编程 编程 数据分析 如果 语言 数学 推理
1 o3 High 74.416 75.198 74.690 70.842 77.254 72.556 72.775 75.551
2 Gemini 2.5 Pro Preview 71.968 64.295 70.992 75.861 74.844 70.194 75.954 71.649
3 Claude 4 Opus Thinking 71.033 57.902 71.379 76.350 72.377 71.600 75.683 73.378
4 o3 Medium 70.102 55.646 75.849 71.949 75.564 69.613 69.069 73.733
5 Claude 4 Sonnet Thinking 69.497 50.759 71.673 74.800 72.068 68.579 73.044 77.194
6 o4-Mini High 68.986 52.262 77.891 73.109 76.158 61.483 72.566 71.387
7 Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 68.074 39.479 71.967 77.684 69.335 73.707 72.119 76.307
8 Claude 3.7 Sonnet Thinking 65.873 48.503 71.286 74.460 72.832 66.876 67.699 61.800
9 Gemini 2.5 Pro Preview 65.074 21.807 68.858 77.851 70.432 72.865 71.500 75.940
10 DeepSeek R1 64.970 31.207 69.538 76.825 71.665 62.045 72.998 73.755
11 Claude 4 Opus 64.257 52.639 71.673 69.958 70.254 74.036 67.806 45.810
12 o4-Mini Medium 62.919 34.592 72.262 72.432 73.291 57.839 69.208 63.635
13 Gemini 2.5 Flash Preview 61.360 30.832 61.187 74.817 71.268 56.912 72.059 63.568
14 DeepSeek R1 60.850 29.327 74.101 73.507 72.104 52.611 66.648 62.539
15 Qwen 3 235B A22B 60.569 18.423 64.682 73.075 78.621 57.033 68.491 63.181
16 Grok 3 Mini Beta (High) 59.890 30.455 53.037 67.533 70.540 57.643 65.659 71.078
17 Gemini 2.5 Flash Preview 59.555 29.327 58.759 68.258 70.826 59.024 69.985 59.586
18 Claude 4 Sonnet 59.423 31.583 76.236 67.717 69.255 66.067 65.692 44.504
19 Qwen 3 32B 59.283 14.663 62.549 72.116 76.345 53.180 68.412 67.395
20 Claude 3.7 Sonnet 56.673 37.975 72.354 63.237 68.547 62.597 55.647 39.808
21 Qwen 3 30B A3B 55.454 16.543 46.231 70.108 74.630 52.942 65.304 57.838
22 GPT-4.5 Preview 54.959 23.687 74.101 62.494 64.766 62.856 58.234 44.198
23 Grok 3 Beta 53.018 18.423 71.673 59.423 75.952 52.945 53.885 39.534
24 DeepSeek V3.1 52.294 20.303 67.110 63.786 73.092 46.518 61.130 35.955
25 GPT-4.1 51.961 18.423 71.286 68.847 68.998 53.374 53.200 35.988
26 ChatGPT-4o 50.521 18.423 75.463 69.048 64.410 49.122 47.469 39.575
27 Claude 3.5 Sonnet 49.100 23.687 71.967 59.464 62.123 54.588 43.610 34.998
28 Qwen2.5 Max 48.522 7.144 65.069 67.927 67.586 57.361 48.933 31.267
29 Mistral Medium 3 47.138 20.303 59.917 59.021 63.941 44.307 51.104 33.995
30 GPT-4.1 Mini 47.079 10.904 70.220 61.915 62.995 37.203 50.177 43.494
31 Llama 4 Maverick 17B 128E Instruct 45.467 7.144 52.742 52.063 67.869 48.015 51.967 35.620
32 Phi-4 Reasoning Plus 44.739 5.640 58.961 55.110 65.560 29.314 53.040 46.802
33 DeepSeek R1 Distill Llama 70B 44.629 7.520 45.366 62.851 62.660 36.017 49.866 48.434
34 GPT-4o 43.572 12.783 67.496 64.790 58.202 44.108 35.496 32.300
35 Gemini 2.0 Flash Lite 43.164 5.640 57.784 68.007 68.731 33.389 47.072 26.159
36 Hunyuan Turbos 41.116 3.760 49.045 48.948 68.308 33.843 49.234 30.865
37 Gemma 3 27B 40.190 7.144 47.684 39.495 67.205 40.639 44.605 27.823
38 Mistral Large 39.770 1.880 61.279 55.086 60.862 40.592 36.492 27.444
39 Qwen2.5 72B Instruct Turbo 39.238 3.760 55.834 52.716 57.767 36.282 44.498 27.661
40 Mistral Small 38.191 12.783 48.365 54.298 57.068 34.309 32.974 30.006
41 DeepSeek R1 Distill Qwen 32B 38.140 5.640 45.752 51.763 49.938 29.706 51.188 36.138
42 Claude 3.5 Haiku 36.339 7.520 51.768 54.953 55.439 38.798 29.747 21.131
43 GPT-4.1 Nano 36.153 7.144 61.573 44.721 51.557 29.293 36.258 28.721
44 Command R Plus 28.108 1.880 26.418 47.410 51.556 30.501 19.539 17.507
45 Command R 25.641 1.880 25.443 38.400 49.807 27.597 15.845 16.648

类别性能比较

按模型分类的类别分数

模型 智能编程 编程 数据分析 如果 语言 数学 推理
o3 High 36.667 76.715 67.020 86.175 75.996 85.004 93.333
Gemini 2.5 Pro Preview 30.000 72.872 68.848 83.504 71.811 88.628 88.250
Claude 4 Opus Thinking 33.333 73.255 70.731 80.742 73.721 88.247 90.472
o3 Medium 28.333 77.863 68.193 84.321 73.481 80.657 91.000
Claude 4 Sonnet Thinking 30.000 73.576 69.837 80.434 70.188 85.250 95.250
o4-Mini High 28.333 79.976 68.328 84.958 66.055 84.895 88.111
Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 20.000 73.898 71.501 77.354 75.440 84.193 94.278
Claude 3.7 Sonnet Thinking 25.000 73.194 69.107 81.254 68.269 78.999 76.167
Gemini 2.5 Pro Preview 13.333 70.698 71.597 78.538 74.522 83.329 93.722
DeepSeek R1 21.667 71.402 71.539 79.954 64.823 85.258 91.083

分类和基准测试

类别 基准测试
智能编程 javascript, python, typescript
编程 code_completion, code_generation
数据分析 tablejoin, tablereformat
如果 paraphrase, simplify, story_generation, summarize
语言 connections, plot_unscrambling, typos
数学 AMPS_Hard, math_comp, olympiad
推理 spatial, web_of_lies_v3, zebra_puzzle

模型详情

选择模型以查看详细性能。

NormScore - Livebench 计算方式及优势

NormScore - LiveBench 是基于 LiveBench.ai 数据对模型性能进行标准化比较的评分。该评分通过标准化校正各个基准评估指标的难度差异,从而帮助更有区分度地比较模型间的性能。

计算方式如下。

1

查看每个基准测试中模型的得分。

2

将每个基准测试的最高分数标准化为100分,其他模型的分数按照与该最高分数的比例进行计算。

归一化公式
\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)
3

计算每个模型的标准化基准测试分数的平均值,以产生一次NormScore。

一级 NormScore 公式
\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)
4

找到一级NormScore最高的模型。

5

查看该第一名模型基于LiveBench.ai原始数据的平均得分。

6

将第1名模型的原始平均分数除以第1名模型的第一次NormScore来计算调整比例。

调整比率公式
\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)
7

通过将所有模型的一次NormScore和分类平均分数乘以此调整比例,计算出最终的NormScore - Livebench和调整后的分类分数。

最终 NormScore 公式
\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

通过这种方式,NormScore - Livebench 能够在保持 LiveBench.ai 原始数据平均分数规模的同时,反映各基准测试内的相对性能,从而更准确地评估模型的实际能力。

许可证信息

使用条款:

署名

您必须提供适当的来源说明,提供许可证链接,如有更改需予以说明。

相同方式共享

如果您对此材料进行混合、转换或基于此材料创作,您必须在相同许可证下分发结果作品。

重复使用条件:

在重复使用、重新分发或创建基于此数据的衍生作品时,必须注明以下出处:

1
原始数据: LiveBench.ai
2
处理和提供: Topllms.com

并采用相同许可证( CC BY-SA 4.0 )必须应用。