Chatbot Arena

This leaderboard is based on the following benchmarks.

Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.1M+ user votes to compute Elo ratings.
MMLU - a test to measure a model’s multitask accuracy on 57 tasks.
Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

Best Open LM

Model	Arena Elo	MMLU	License
DeepSeek-R1-0528	`1424`	`90.8`	MIT
Qwen3-235B-A22B-no-thinking	`1388`	`88.5`	Apache 2.0
DeepSeek-V3-0324	`1385`	`88.5`	MIT
Minimax-M1	`1373`		Apache 2.0
Qwen3-235B-A22B	`1365`	`88.5`	Apache 2.0
Gemma-3-27B-it	`1357`		Gemma

Full Leaderboard

Model	Arena Elo	Coding	Vision	Arena Hard	MMLU	Votes	Organization	License
🥇 Gemini-2.5-Pro	`1477`	`1492`	`1342`	`96.4`		`12327`	Google	Proprietary
🥇 o3-2025-04-16	`1428`	`1443`	`1296`			`18205`	OpenAI	Proprietary
🥇 ChatGPT-4o-latest (2025-03-26)	`1428`	`1440`	`1308`			`22488`	OpenAI	Proprietary
🥇 DeepSeek-R1-0528	`1424`	`1431`		`93.2`	`90.8`	`11871`	DeepSeek	MIT
🥇 Grok-3-Preview-02-24	`1422`	`1433`			`92.7`	`24316`	xAI	Proprietary
🥇 Gemini-2.5-Flash	`1420`	`1430`	`1299`			`17535`	Google	Proprietary
🥇 GPT-4.5-Preview	`1415`	`1418`	`1253`			`15271`	OpenAI	Proprietary
🥈 Gemini-2.0-Flash-Thinking-Exp-01-21	`1398`	`1381`	`1275`			`27618`	Google	Proprietary
🥈 Gemini-2.0-Pro-Exp-02-05	`1397`	`1396`	`1239`			`20120`	Google	Proprietary
🥈 Qwen3-235B-A22B-no-thinking	`1388`	`1410`		`95.6`	`88.5`	`12320`	Alibaba	Apache 2.0
🥈 GPT-4.1-2025-04-14	`1386`	`1392`	`1274`			`16362`	OpenAI	Proprietary
🥈 DeepSeek-V3-0324	`1385`	`1400`		`85.5`	`88.5`	`19091`	DeepSeek	MIT
🥈 Hunyuan-Turbos-20250416	`1376`	`1380`				`7816`	Tencent	Proprietary
🥈 DeepSeek-R1	`1375`	`1381`		`93.2`	`90.8`	`19430`	DeepSeek	MIT
🥈 Claude Opus 4 (20250514)	`1373`	`1415`	`1231`			`18287`	Anthropic	Proprietary
🥈 Minimax-M1	`1373`	`1382`				`3895`	MiniMax	Apache 2.0
🥈 Mistral Medium 3	`1369`	`1389`	`1192`			`16637`	Mistral	Proprietary
🥈 o1-2024-12-17	`1367`	`1376`	`1228`	`92.1`	`91.8`	`29038`	OpenAI	Proprietary
🥈 Qwen3-235B-A22B	`1365`	`1390`		`95.6`	`88.5`	`13002`	Alibaba	Apache 2.0
🥈 Gemini-2.0-Flash-001	`1364`	`1363`	`1215`			`35894`	Google	Proprietary
🥈 Grok-3-Mini-beta	`1363`	`1386`				`8715`	xAI	Proprietary
🥈 o4-mini-2025-04-16	`1363`	`1381`	`1251`			`16112`	OpenAI	Proprietary
🥈 Qwen2.5-Max	`1363`	`1367`				`31170`	Alibaba	Proprietary
🥈 Gemma-3-27B-it	`1357`	`1338`	`1201`			`25323`	Google	Gemma
🥈 Claude Sonnet 4 (20250514)	`1345`	`1384`	`1222`			`14984`	Anthropic	Proprietary
🥈 o3-mini-high	`1342`	`1380`				`19404`	OpenAI	Proprietary
🥈 GPT-4.1-mini-2025-04-14	`1339`	`1375`	`1237`			`15337`	OpenAI	Proprietary
🥉 Gemma-3-12B-it	`1338`	`1308`				`3976`	Google	Gemma
🥉 DeepSeek-V3	`1336`	`1337`		`85.5`	`88.5`	`22841`	DeepSeek	DeepSeek
🥉 QwQ-32B	`1334`	`1349`				`17462`	Alibaba	Apache 2.0
🥉 Gemini-2.0-Flash-Lite	`1330`	`1338`	`1156`			`26104`	Google	Proprietary
🥉 Amazon-Nova-Experimental-Chat-05-14	`1328`	`1342`				`5213`	Amazon	Proprietary
🥉 Qwen-Plus-0125	`1328`	`1337`				`6055`	Alibaba	Proprietary
🥉 GLM-4-Plus-0111	`1328`	`1307`				`6028`	Zhipu	Proprietary
🥉 Command A (03-2025)	`1327`	`1335`				`22851`	Cohere	CC-BY-NC-4.0
🥉 o3-mini	`1323`	`1363`				`35063`	OpenAI	Proprietary
🥉 Step-2-16K-Exp	`1322`	`1312`				`5126`	StepFun	Proprietary
🥉 o1-mini	`1321`	`1370`		`92`		`54951`	OpenAI	Proprietary
🥉 Gemini-1.5-Pro-002	`1320`	`1307`	`1222`			`58645`	Google	Proprietary
🥉 Claude 3.7 Sonnet (thinking-32k)	`1317`	`1349`	`1220`			`24159`	Anthropic	Proprietary
🥉 Hunyuan-Turbo-0110	`1314`	`1333`				`2510`	Tencent	Proprietary
🥉 Llama-3.3-Nemotron-Super-49B-v1	`1314`	`1318`		`88.3`	`86`	`2371`	Nvidia	Nvidia
🥉 Claude 3.7 Sonnet	`1309`	`1346`	`1208`			`28664`	Anthropic	Proprietary
🥉 Grok-2-08-13	`1305`	`1299`			`87.5`	`67084`	xAI	Proprietary
🥉 Yi-Lightning	`1304`	`1319`		`81.5`		`28968`	01 AI	Proprietary
🥉 Gemma-3n-e4b-it	`1303`	`1274`				`5282`	Google	Gemma
🥉 GPT-4o-2024-05-13	`1302`	`1309`	`1206`	`79.21`	`88.7`	`117747`	OpenAI	Proprietary
🥉 Claude 3.5 Sonnet (20241022)	`1301`	`1341`	`1187`	`85.2`	`88.7`	`75986`	Anthropic	Proprietary
Deepseek-v2.5-1210	`1297`	`1314`				`7243`	DeepSeek	DeepSeek
Athene-v2-Chat-72B	`1293`	`1317`		`85`		`26074`	NexusFlow	NexusFlow
Llama-4-Maverick-17B-128E-Instruct	`1292`	`1309`	`1164`			`15906`	Meta	Llama 4
Gemma-3-4B-it	`1292`	`1264`				`4321`	Google	Gemma
Hunyuan-Large-2025-02-10	`1289`	`1310`				`3856`	Tencent	Proprietary
GPT-4o-mini-2024-07-18	`1289`	`1299`	`1123`	`74.94`	`82`	`72536`	OpenAI	Proprietary
Gemini-1.5-Flash-002	`1289`	`1271`	`1206`			`37021`	Google	Proprietary
GPT-4.1-nano-2025-04-14	`1288`	`1311`	`1116`			`6302`	OpenAI	Proprietary
Llama-3.1-405B-Instruct-bf16	`1286`	`1297`			`88.6`	`43788`	Meta	Llama 3.1
Llama-3.1-Nemotron-70B-Instruct	`1286`	`1288`		`84.9`		`7577`	Nvidia	Llama 3.1
Llama-3.1-405B-Instruct-fp8	`1285`	`1293`		`69.3`	`88.6`	`63038`	Meta	Llama 3.1
Grok-2-Mini-08-13	`1284`	`1279`				`55442`	xAI	Proprietary
Yi-Lightning-lite	`1282`	`1283`				`17067`	01 AI	Proprietary
Qwen-Max-0919	`1281`	`1296`				`17432`	Alibaba	Qwen
Hunyuan-Standard-2025-02-10	`1278`	`1287`				`4014`	Tencent	Proprietary
Qwen2.5-72B-Instruct	`1275`	`1299`		`78`		`41519`	Alibaba	Qwen
Llama-3.3-70B-Instruct	`1275`	`1275`				`46558`	Meta	Llama-3.3
GPT-4-Turbo-2024-04-09	`1274`	`1279`	`1151`	`82.63`		`102133`	OpenAI	Proprietary
Mistral-Small-3.1-24B-Instruct-2503	`1271`	`1288`	`1162`			`4963`	Mistral	Apache 2.0
Llama-4-Scout-17B-16E-Instruct	`1271`	`1283`	`1157`			`4998`	Meta	Llama
Athene-70B	`1268`	`1270`		`77.6`		`20580`	NexusFlow	CC-BY-NC-4.0
GPT-4-1106-preview	`1267`	`1269`				`103748`	OpenAI	Proprietary
Mistral-Large-2411	`1266`	`1283`		`70.42`		`29633`	Mistral	MRL
Llama-3.1-70B-Instruct	`1265`	`1268`		`55.73`	`86`	`58637`	Meta	Llama 3.1
Claude 3 Opus	`1265`	`1267`	`1076`	`60.36`	`86.8`	`202641`	Anthropic	Proprietary
magistral-medium-2506	`1263`	`1323`				`3089`	Mistral	Proprietary
Amazon Nova Pro 1.0	`1262`	`1280`	`1044`			`26371`	Amazon	Proprietary
GPT-4-0125-preview	`1262`	`1260`		`77.96`		`97079`	OpenAI	Proprietary
Llama-3.1-Tulu-3-70B	`1262`	`1250`				`3010`	Ai2	Llama 3.1
Claude 3.5 Haiku (20241022)	`1256`	`1285`	`1156`			`47551`	Anthropic	Propretary
Reka-Core-20240904	`1253`	`1238`				`7948`	Reka AI	Proprietary
Gemini-1.5-Flash-001	`1244`	`1248`	`1072`	`49.61`	`78.9`	`65661`	Google	Proprietary
Jamba-1.5-Large	`1239`	`1244`			`81.2`	`9125`	AI21 Labs	Jamba Open
Deepseek-v2-API-0628	`1237`	`1258`				`19508`	DeepSeek AI	DeepSeek
Gemma-2-27B-it	`1237`	`1226`		`57.51`		`79538`	Google	Gemma license
Qwen2.5-Coder-32B-Instruct	`1235`	`1278`				`5730`	Alibaba	Apache 2.0
Mistral-Small-24B-Instruct-2501	`1235`	`1249`				`15321`	Mistral	Apache 2.0
Amazon Nova Lite 1.0	`1234`	`1252`	`1061`			`20646`	Amazon	Proprietary
Gemma-2-9B-it-SimPO	`1234`	`1213`				`10548`	Princeton	MIT
Command R+ (08-2024)	`1233`	`1198`				`10535`	Cohere	CC-BY-NC-4.0
Deepseek-Coder-v2-0724	`1232`	`1283`		`62.3`		`11725`	DeepSeek	Proprietary
Gemini-1.5-Flash-8B-001	`1230`	`1225`	`1107`			`37697`	Google	Proprietary
Llama-3.1-Nemotron-51B-Instruct	`1229`	`1227`				`3889`	Nvidia	Llama 3.1
Nemotron-4-340B-Instruct	`1227`	`1215`				`20608`	Nvidia	Nvidia
Aya-Expanse-32B	`1227`	`1209`				`28768`	Cohere	CC-BY-NC-4.0
GLM-4-0520	`1224`	`1233`		`63.84`		`10221`	Zhipu AI	Proprietary
Llama-3-70B-Instruct	`1224`	`1216`		`46.57`	`82`	`163629`	Meta	Llama 3
Phi-4	`1223`	`1239`				`25213`	Microsoft	MIT
OLMo-2-0325-32B-Instruct	`1223`	`1214`				`3460`	Allen AI	Apache-2.0
Reka-Flash-20240904	`1223`	`1208`				`8132`	Reka AI	Proprietary
Hunyuan-Large-Vision	`1219`	`1233`	`1187`			`3478`	Tencent	Proprietary
Claude 3 Sonnet	`1218`	`1229`	`1048`	`46.8`	`79`	`113067`	Anthropic	Proprietary
Amazon Nova Micro 1.0	`1215`	`1227`				`20654`	Amazon	Proprietary
Gemma-2-9B-it	`1209`	`1190`				`57197`	Google	Gemma license
Hunyuan-Standard-256K	`1206`	`1243`				`2901`	Tencent	Proprietary
Qwen2-72B-Instruct	`1205`	`1203`		`46.86`	`84.2`	`38872`	Alibaba	Qianwen LICENSE
GPT-4-0314	`1204`	`1212`		`50`	`86.4`	`55962`	OpenAI	Proprietary
Llama-3.1-Tulu-3-8B	`1203`	`1195`				`3074`	Ai2	Llama 3.1
Ministral-8B-2410	`1200`	`1218`				`5111`	Mistral	MRL
Claude 3 Haiku	`1197`	`1206`	`1000`	`41.47`	`75.2`	`122309`	Anthropic	Proprietary
Aya-Expanse-8B	`1197`	`1182`				`10391`	Cohere	CC-BY-NC-4.0
Command R (08-2024)	`1197`	`1178`				`10851`	Cohere	CC-BY-NC-4.0
DeepSeek-Coder-V2-Instruct	`1196`	`1256`				`15753`	DeepSeek AI	DeepSeek License
Llama-3.1-8B-Instruct	`1193`	`1203`		`21.34`	`73`	`52578`	Meta	Llama 3.1
Jamba-1.5-Mini	`1193`	`1197`			`69.7`	`9274`	AI21 Labs	Jamba Open
GPT-4-0613	`1181`	`1183`		`37.9`		`91614`	OpenAI	Proprietary
Qwen1.5-110B-Chat	`1179`	`1191`			`80.4`	`27430`	Alibaba	Qianwen LICENSE
Yi-1.5-34B-Chat	`1175`	`1179`			`76.8`	`25135`	01 AI	Apache-2.0
Llama-3-8B-Instruct	`1169`	`1162`		`20.56`	`68.4`	`109056`	Meta	Llama 3
InternLM2.5-20B-chat	`1166`	`1175`				`10599`	InternLM	Other
Claude-1	`1166`	`1152`			`77`	`21149`	Anthropic	Proprietary
Qwen1.5-72B-Chat	`1165`	`1176`		`36.12`	`77.5`	`40658`	Alibaba	Qianwen LICENSE
Mixtral-8x22b-Instruct-v0.1	`1165`	`1169`		`36.36`	`77.8`	`53751`	Mistral	Apache 2.0
Mistral Medium	`1165`	`1169`		`31.9`	`75.3`	`35556`	Mistral	Proprietary
Gemma-2-2b-it	`1161`	`1124`			`51.3`	`48892`	Google	Gemma license
Granite-3.1-8B-Instruct	`1160`	`1190`				`3289`	IBM	Apache 2.0
Claude-2.0	`1149`	`1151`		`23.99`	`78.5`	`12763`	Anthropic	Proprietary
Gemini-1.0-Pro-001	`1149`	`1119`			`71.8`	`18800`	Google	Proprietary
Zephyr-ORPO-141b-A35b-v0.1	`1145`	`1141`				`4854`	HuggingFace	Apache 2.0
Qwen1.5-32B-Chat	`1143`	`1166`			`73.4`	`22765`	Alibaba	Qianwen LICENSE
Phi-3-Medium-4k-Instruct	`1140`	`1142`		`33.37`	`78`	`26105`	Microsoft	MIT
Granite-3.1-2B-Instruct	`1137`	`1164`				`3380`	IBM	Apache 2.0
Claude-2.1	`1136`	`1148`		`22.77`		`37699`	Anthropic	Proprietary
Starling-LM-7B-beta	`1136`	`1146`		`23.01`		`16676`	Nexusflow	Apache-2.0
GPT-3.5-Turbo-0613	`1134`	`1151`		`24.82`		`38955`	OpenAI	Proprietary
Mixtral-8x7B-Instruct-v0.1	`1131`	`1131`		`23.4`	`70.6`	`76126`	Mistral	Apache 2.0
Claude-Instant-1	`1129`	`1125`			`73.4`	`20631`	Anthropic	Proprietary
Yi-34B-Chat	`1129`	`1123`		`23.15`	`73.5`	`15917`	01 AI	Yi License
Qwen1.5-14B-Chat	`1126`	`1142`			`67.6`	`18687`	Alibaba	Qianwen LICENSE
WizardLM-70B-v1.0	`1124`	`1087`			`63.7`	`8383`	Microsoft	Llama 2
DBRX-Instruct-Preview	`1121`	`1134`		`24.63`	`73.7`	`33743`	Databricks	DBRX LICENSE
Llama-3.2-3B-Instruct	`1120`	`1097`				`8390`	Meta	Llama 3.2
Phi-3-Small-8k-Instruct	`1119`	`1124`		`29.77`	`75.7`	`18476`	Microsoft	MIT
Tulu-2-DPO-70B	`1116`	`1110`		`14.99`		`6658`	AllenAI/UW	AI2 ImpACT Low-risk
Granite-3.0-8B-Instruct	`1111`	`1114`				`7002`	IBM	Apache 2.0
Llama-2-70B-chat	`1110`	`1089`		`11.55`	`63`	`39595`	Meta	Llama 2
OpenChat-3.5-0106	`1109`	`1119`			`65.8`	`12990`	OpenChat	Apache-2.0
Vicuna-33B	`1108`	`1084`		`8.63`	`59.2`	`22936`	LMSYS	Non-commercial
Snowflake Arctic Instruct	`1107`	`1093`		`17.61`	`67.3`	`34173`	Snowflake	Apache 2.0
Starling-LM-7B-alpha	`1106`	`1096`		`12.8`	`63.9`	`10415`	UC Berkeley	CC-BY-NC-4.0
Nous-Hermes-2-Mixtral-8x7B-DPO	`1102`	`1096`				`3836`	NousResearch	Apache-2.0
Gemma-1.1-7B-it	`1101`	`1101`		`12.09`	`64.3`	`25070`	Google	Gemma license
NV-Llama2-70B-SteerLM-Chat	`1098`	`1039`			`68.5`	`3636`	Nvidia	Llama 2
pplx-70B-online	`1095`	`1044`				`6898`	Perplexity AI	Proprietary
DeepSeek-LLM-67B-Chat	`1094`	`1096`			`71.3`	`4988`	DeepSeek AI	DeepSeek License
OpenChat-3.5	`1094`	`1070`			`64.3`	`8106`	OpenChat	Apache-2.0
OpenHermes-2.5-Mistral-7B	`1092`	`1074`				`5088`	NousResearch	Apache-2.0
Granite-3.0-2B-Instruct	`1091`	`1104`				`7191`	IBM	Apache 2.0
Mistral-7B-Instruct-v0.2	`1090`	`1090`		`12.57`		`20067`	Mistral	Apache-2.0
Phi-3-Mini-4K-Instruct-June-24	`1088`	`1098`			`70.9`	`12808`	Microsoft	MIT
Qwen1.5-7B-Chat	`1087`	`1106`			`61`	`4872`	Alibaba	Qianwen LICENSE
Phi-3-Mini-4k-Instruct	`1084`	`1102`			`68.8`	`21097`	Microsoft	MIT
Llama-2-13b-chat	`1081`	`1068`			`53.6`	`19722`	Meta	Llama 2
SOLAR-10.7B-Instruct-v1.0	`1080`	`1064`			`66.2`	`4286`	Upstage AI	CC-BY-NC-4.0
Dolphin-2.2.1-Mistral-7B	`1080`	`1042`				`1714`	Cognitive Computations	Apache-2.0
WizardLM-13b-v1.2	`1076`	`1042`			`52.7`	`7176`	Microsoft	Llama 2
Llama-3.2-1B-Instruct	`1071`	`1063`				`8523`	Meta	Llama 3.2
Gemini-2.5-Flash-Lite-Preview-06-17-Thinking			`1238`			`1442`	Google	Proprietary
Qwen2.5-VL-32B-Instruct			`1212`			`1505`	Alibaba	Apache 2.0
Step-1o-Vision-32k (highres)			`1185`			`2891`	StepFun	Proprietary
Qwen2.5-VL-72B-Instruct			`1168`			`3884`	Alibaba	Qwen
Pixtral-Large-2411			`1153`			`5546`	Mistral	MRL
Qwen-VL-Max-1119			`1128`			`1449`	Alibaba	Proprietary
Step-1V-32K			`1112`			`1553`	StepFun	Proprietary
Qwen2-VL-72b-Instruct			`1111`			`6028`	Alibaba	Qwen
Molmo-72B-0924			`1076`			`3092`	AI2	Apache 2.0
Pixtral-12B-2409			`1072`			`7623`	Mistral	Apache 2.0
Llama-3.2-90B-Vision-Instruct			`1070`			`8829`	Meta	Llama 3.2
InternVL2-26B			`1067`			`5265`	OpenGVLab	MIT
Hunyuan-Standard-Vision-2024-12-31			`1063`			`811`	Tencent	Proprietary
Aya-Vision-32B			`1058`			`849`	Cohere	CC-BY-NC-4.0
Qwen2-VL-7B-Instruct			`1054`			`5854`	Aliaba	Apache 2.0
Yi-Vision			`1046`			`1237`	01 AI	Proprietary
Llama-3.2-11B-Vision-Instruct			`1032`			`4893`	Meta	Llama 3.2

If you want to see more models, please help us add them.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of R_A and player B a rating of R_B, the probability of player A winning is

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

This algorithm has two distinct features:

It can be computed asynchronously by players around the world.
It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.

Chatbot Arena

LMSYS • June 24, 2025

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model