LiveBench.ai Data Analysis

Top Model Performance Overview

Rank	Model Name	NormScore - Livebench	Agentic Coding	Coding	Data Analysis	IF	Language	Mathematics	Reasoning
1	Claude 4 Opus	65.926	61.854	64.279	68.221	63.611	69.585	67.974	66.945
2	Claude 4 Sonnet	59.511	30.926	68.342	65.945	62.716	61.693	65.923	65.050
3	Claude 3.7 Sonnet	57.830	45.838	64.900	61.713	62.087	58.365	55.784	58.352
4	GPT-4.5 Preview	55.833	27.061	66.462	60.747	58.694	58.954	58.757	64.380
5	Grok 3 Beta	53.041	19.329	64.279	58.205	68.786	49.599	54.187	57.118
6	DeepSeek V3.1	52.191	20.986	60.216	61.200	66.141	43.521	61.935	52.367
7	GPT-4.1	52.055	19.329	63.960	66.856	62.524	49.943	54.166	53.017
8	ChatGPT-4o	50.828	19.329	67.704	67.074	58.360	45.549	48.402	57.907
9	Claude 3.5 Sonnet	49.348	27.061	64.581	58.090	56.260	50.788	43.491	50.849
10	Qwen2.5 Max	48.310	9.388	58.351	66.332	61.182	53.474	48.996	45.510
11	GPT-4.1 Mini	48.048	12.702	63.020	59.642	57.066	34.557	50.924	64.275
12	Mistral Medium 3	47.190	20.986	53.683	56.340	57.942	41.271	51.866	49.873
13	Llama 4 Maverick 17B 128E Instruct	45.985	9.388	47.403	51.480	61.473	44.940	52.278	52.078
14	GPT-4o	43.537	14.359	60.534	62.607	52.721	40.975	35.926	46.600
15	Gemini 2.0 Flash Lite	42.450	4.970	51.803	66.102	62.212	30.878	47.523	38.076
16	Hunyuan Turbos	40.985	3.314	43.995	47.302	61.802	30.953	49.634	46.068
17	Gemma 3 27B	40.269	9.388	42.752	38.143	60.831	37.435	45.258	41.343
18	Mistral Large	39.234	1.657	54.926	53.176	55.130	37.634	36.244	40.448
19	Qwen2.5 72B Instruct Turbo	38.957	3.314	49.955	51.395	52.285	33.752	44.800	40.568
20	Mistral Small	38.475	14.359	43.373	52.796	51.680	31.664	33.235	44.220
21	GPT-4.1 Nano	36.307	9.388	55.228	41.440	46.686	27.019	36.438	42.509
22	Claude 3.5 Haiku	35.509	6.627	46.479	53.032	50.222	35.685	30.253	30.858
23	Command R Plus	27.517	1.657	23.710	44.996	46.747	28.105	19.835	25.715
24	Command R	24.988	1.657	22.786	36.477	45.116	25.156	15.836	24.272

Category Performance Comparison

Model	Agentic Coding	Coding	Data Analysis	IF	Language	Mathematics	Reasoning
Claude 4 Opus	31.667	73.576	66.510	78.379	76.114	78.790	56.444
Claude 4 Sonnet	25.000	78.245	64.684	77.246	67.181	76.390	54.861
Claude 3.7 Sonnet	21.667	74.280	59.965	76.492	63.194	64.654	49.111
GPT-4.5 Preview	15.000	76.072	60.070	72.325	64.759	67.940	54.417
Grok 3 Beta	13.333	73.576	55.629	84.738	53.797	62.752	48.528
DeepSeek V3.1	15.000	68.907	64.019	81.471	46.823	71.437	44.278
GPT-4.1	13.333	73.194	66.404	77.046	54.551	62.386	44.389
ChatGPT-4o	13.333	77.480	66.520	71.921	49.428	55.717	48.806
Claude 3.5 Sonnet	15.000	73.898	56.186	69.296	54.477	50.543	43.222
Qwen2.5 Max	3.333	66.794	64.271	75.346	58.369	56.868	38.528

Category	Benchmark
Agentic Coding	javascript, python, typescript
Coding	code_completion, code_generation
Data Analysis	tablejoin, tablereformat
IF	paraphrase, simplify, story_generation, summarize
Language	connections, plot_unscrambling, typos
Mathematics	AMPS_Hard, math_comp, olympiad
Reasoning	spatial, web_of_lies_v3, zebra_puzzle

Select a model to view detailed performance.

NormScore - Livebench Calculation Method and Advantages

NormScore - LiveBench is a score for comparing model performance through normalization based on LiveBench.ai data. This score helps provide more discriminative comparisons between models by correcting for varying difficulty levels across different benchmark metrics through normalization.

The calculation method is as follows.

Check the scores of models within each benchmark.

The highest score for each benchmark is normalized to 100 points, and the scores of other models are calculated as ratios relative to that highest score.

Normalization Formula

\( S_{norm} = \frac{S_{raw}}{S_{max}} \times 100 \)

Calculate the average of normalized benchmark scores for each model to produce the primary NormScore.

Primary NormScore Formula

\( NormScore_{1st} = \frac{\sum S_{norm}}{N_{benchmarks}} \)

Find the model with the highest primary NormScore.

Check the average score of the top-ranking model based on LiveBench.ai original data.

Calculate the adjustment ratio by dividing the original average score of the 1st place model by the 1st place model's primary NormScore.

Adjustment Ratio Formula

\( Adjustment\_Ratio = \frac{Avg\_Score_{1st}}{NormScore_{1st\_1st}} \)

The final NormScore - Livebench and adjusted category scores are calculated by multiplying the primary NormScore and category average scores of all models by this adjustment ratio.

Final NormScore Formula

\( NormScore_{Livebench} = NormScore_{1st} \times Adjustment\_Ratio \)

Through this approach, NormScore - Livebench can more accurately evaluate the actual capabilities of models by reflecting relative performance within each benchmark while maintaining the average score scale of the original LiveBench.ai data.

Benchmark Data Analysis LiveBench.ai Real-time Performance Comparison Platform

Detailed Analysis

Top Model Performance Overview

Top Model Benchmark: NormScore - Livebench

Category Performance Comparison

Category Scores by Model

Categories and Benchmarks

NormScore - Livebench Calculation Method and Advantages

The calculation method is as follows.

Model Details

NormScore - Livebench Calculation Method and Advantages

The calculation method is as follows.