頂級模型效能概覽

頂級模型基準: NormScore - LiveBench

排名 模型名稱 NormScore - LiveBench 智慧編程 程式設計 資料分析 如果 語言 數學 推理
1 o3 High 74.416 75.198 74.690 70.842 77.254 72.556 72.775 75.551
2 Gemini 2.5 Pro Preview 71.968 64.295 70.992 75.861 74.844 70.194 75.954 71.649
3 Claude 4 Opus Thinking 71.033 57.902 71.379 76.350 72.377 71.600 75.683 73.378
4 o3 Medium 70.102 55.646 75.849 71.949 75.564 69.613 69.069 73.733
5 Claude 4 Sonnet Thinking 69.497 50.759 71.673 74.800 72.068 68.579 73.044 77.194
6 o4-Mini High 68.986 52.262 77.891 73.109 76.158 61.483 72.566 71.387
7 Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 68.074 39.479 71.967 77.684 69.335 73.707 72.119 76.307
8 Claude 3.7 Sonnet Thinking 65.873 48.503 71.286 74.460 72.832 66.876 67.699 61.800
9 Gemini 2.5 Pro Preview 65.074 21.807 68.858 77.851 70.432 72.865 71.500 75.940
10 DeepSeek R1 64.970 31.207 69.538 76.825 71.665 62.045 72.998 73.755
11 Claude 4 Opus 64.257 52.639 71.673 69.958 70.254 74.036 67.806 45.810
12 o4-Mini Medium 62.919 34.592 72.262 72.432 73.291 57.839 69.208 63.635
13 Gemini 2.5 Flash Preview 61.360 30.832 61.187 74.817 71.268 56.912 72.059 63.568
14 DeepSeek R1 60.850 29.327 74.101 73.507 72.104 52.611 66.648 62.539
15 Qwen 3 235B A22B 60.569 18.423 64.682 73.075 78.621 57.033 68.491 63.181
16 Grok 3 Mini Beta (High) 59.890 30.455 53.037 67.533 70.540 57.643 65.659 71.078
17 Gemini 2.5 Flash Preview 59.555 29.327 58.759 68.258 70.826 59.024 69.985 59.586
18 Claude 4 Sonnet 59.423 31.583 76.236 67.717 69.255 66.067 65.692 44.504
19 Qwen 3 32B 59.283 14.663 62.549 72.116 76.345 53.180 68.412 67.395
20 Claude 3.7 Sonnet 56.673 37.975 72.354 63.237 68.547 62.597 55.647 39.808
21 Qwen 3 30B A3B 55.454 16.543 46.231 70.108 74.630 52.942 65.304 57.838
22 GPT-4.5 Preview 54.959 23.687 74.101 62.494 64.766 62.856 58.234 44.198
23 Grok 3 Beta 53.018 18.423 71.673 59.423 75.952 52.945 53.885 39.534
24 DeepSeek V3.1 52.294 20.303 67.110 63.786 73.092 46.518 61.130 35.955
25 GPT-4.1 51.961 18.423 71.286 68.847 68.998 53.374 53.200 35.988
26 ChatGPT-4o 50.521 18.423 75.463 69.048 64.410 49.122 47.469 39.575
27 Claude 3.5 Sonnet 49.100 23.687 71.967 59.464 62.123 54.588 43.610 34.998
28 Qwen2.5 Max 48.522 7.144 65.069 67.927 67.586 57.361 48.933 31.267
29 Mistral Medium 3 47.138 20.303 59.917 59.021 63.941 44.307 51.104 33.995
30 GPT-4.1 Mini 47.079 10.904 70.220 61.915 62.995 37.203 50.177 43.494
31 Llama 4 Maverick 17B 128E Instruct 45.467 7.144 52.742 52.063 67.869 48.015 51.967 35.620
32 Phi-4 Reasoning Plus 44.739 5.640 58.961 55.110 65.560 29.314 53.040 46.802
33 DeepSeek R1 Distill Llama 70B 44.629 7.520 45.366 62.851 62.660 36.017 49.866 48.434
34 GPT-4o 43.572 12.783 67.496 64.790 58.202 44.108 35.496 32.300
35 Gemini 2.0 Flash Lite 43.164 5.640 57.784 68.007 68.731 33.389 47.072 26.159
36 Hunyuan Turbos 41.116 3.760 49.045 48.948 68.308 33.843 49.234 30.865
37 Gemma 3 27B 40.190 7.144 47.684 39.495 67.205 40.639 44.605 27.823
38 Mistral Large 39.770 1.880 61.279 55.086 60.862 40.592 36.492 27.444
39 Qwen2.5 72B Instruct Turbo 39.238 3.760 55.834 52.716 57.767 36.282 44.498 27.661
40 Mistral Small 38.191 12.783 48.365 54.298 57.068 34.309 32.974 30.006
41 DeepSeek R1 Distill Qwen 32B 38.140 5.640 45.752 51.763 49.938 29.706 51.188 36.138
42 Claude 3.5 Haiku 36.339 7.520 51.768 54.953 55.439 38.798 29.747 21.131
43 GPT-4.1 Nano 36.153 7.144 61.573 44.721 51.557 29.293 36.258 28.721
44 Command R Plus 28.108 1.880 26.418 47.410 51.556 30.501 19.539 17.507
45 Command R 25.641 1.880 25.443 38.400 49.807 27.597 15.845 16.648

類別效能比較

依模型分類的類別分數

模型 智慧編程 程式設計 資料分析 如果 語言 數學 推理
o3 High 36.667 76.715 67.020 86.175 75.996 85.004 93.333
Gemini 2.5 Pro Preview 30.000 72.872 68.848 83.504 71.811 88.628 88.250
Claude 4 Opus Thinking 33.333 73.255 70.731 80.742 73.721 88.247 90.472
o3 Medium 28.333 77.863 68.193 84.321 73.481 80.657 91.000
Claude 4 Sonnet Thinking 30.000 73.576 69.837 80.434 70.188 85.250 95.250
o4-Mini High 28.333 79.976 68.328 84.958 66.055 84.895 88.111
Gemini 2.5 Pro Preview (2025-06-05 Max Thinking) 20.000 73.898 71.501 77.354 75.440 84.193 94.278
Claude 3.7 Sonnet Thinking 25.000 73.194 69.107 81.254 68.269 78.999 76.167
Gemini 2.5 Pro Preview 13.333 70.698 71.597 78.538 74.522 83.329 93.722
DeepSeek R1 21.667 71.402 71.539 79.954 64.823 85.258 91.083

分類和基準測試

類別 基準測試
智慧編程 javascript, python, typescript
程式設計 code_completion, code_generation
資料分析 tablejoin, tablereformat
如果 paraphrase, simplify, story_generation, summarize
語言 connections, plot_unscrambling, typos
數學 AMPS_Hard, math_comp, olympiad
推理 spatial, web_of_lies_v3, zebra_puzzle

模型詳情

選擇模型以查看詳細效能。

NormScore - Livebench 計算方式及優勢

NormScore - LiveBench 是基於 LiveBench.ai 資料對模型效能進行標準化比較的評分。該評分透過標準化校正各個基準評估指標的難度差異,從而幫助更有區別性地比較模型間的效能。

計算方式如下。

1

查看每個基準測試中模型的得分。

2

將每個基準測試的最高分數標準化為100分,其他模型的分數按照與該最高分數的比例進行計算。

正規化公式
\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)
3

計算每個模型的標準化基準測試分數的平均值,以產生一次NormScore。

一級 NormScore 公式
\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)
4

找到一級NormScore最高的模型。

5

查看該第一名模型基於LiveBench.ai原始數據的平均得分。

6

將第1名模型的原始平均分數除以第1名模型的第一次NormScore來計算調整比例。

調整比率公式
\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)
7

透過將所有模型的一次NormScore和分類平均分數乘以此調整比例,計算出最終的NormScore - Livebench和調整後的分類分數。

最終 NormScore 公式
\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

透過這種方式,NormScore - Livebench 能夠在保持 LiveBench.ai 原始資料平均分數規模的同時,反映各基準測試內的相對性能,從而更準確地評估模型的實際能力。

授權資訊

使用條款:

姓名標示

您必須提供適當的來源說明,提供授權連結,如有更改需予以說明。

相同方式分享

如果您對此材料進行混合、轉換或基於此材料創作,您必須在相同授權條款下分發結果作品。

重複使用條件:

在重複使用、重新分發或創建基於此資料的衍生作品時,必須註明以下出處:

1
原始資料: LiveBench.ai
2
處理和提供: Topllms.com

並採用相同授權條款( CC BY-SA 4.0 )必須應用。