Chatbot Arena +

This leaderboard is based on the following benchmarks.

Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.8M+ user votes to compute Elo ratings.
AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations.
ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.

🔍

Open LM

	Model	Arena Elo	Coding	Vision	AAII	MMLU-Pro	ARC-AGI	Organization	License
🏆	Gemini-2.5-Pro	`1468`	`1470`	`1303`	`65`	`86.2`	`4.9`	Google	Proprietary
🏆	Grok-4-0709	`1433`	`1443`	`1271`	`68`	`86.6`	`16`	xAI	Proprietary
🏆	GPT-5	`1432`	`1462`	`1282`	`68`	`87.3`	`9.9`	OpenAI	Proprietary
🥇	DeepSeek-V3.1-thinking				`60`	`85.1`		DeepSeek	MIT
🥇	GLM-4.5	`1431`	`1451`		`56`	`83.5`		Z.ai	MIT
🥇	ChatGPT-4o-latest (2025-03-26)	`1429`	`1434`	`1299`	`40`	`80.3`		OpenAI	Proprietary
🥇	DeepSeek-R1-0528	`1425`	`1437`		`59`	`84.9`	`1.3`	DeepSeek	MIT
🥇	o3-2025-04-16	`1424`	`1441`	`1261`	`67`	`85.3`	`6.5`	OpenAI	Proprietary
🥇	Grok-3-Preview-02-24	`1423`	`1439`		`46`	`79.9`		xAI	Proprietary
🥇	Claude Opus 4.1 (thinking-16k)	`1421`	`1478`		`61`	`87.8`		Anthropic	Proprietary
🥇	Qwen3-235B-A22B-Instruct-2507	`1420`	`1461`		`51`	`82.8`	`1.3`	Alibaba	Apache 2.0
🥇	Claude Opus 4.1	`1419`	`1469`		`49`	`87.3`		Anthropic	Proprietary
🥇	Qwen3-235B-A22B-Thinking-2507	`1418`	`1447`		`64`	`84.3`		Alibaba	Apache 2.0
🥈	GPT-4.5-Preview	`1415`	`1419`	`1240`	`42`	`81`	`0.8`	OpenAI	Proprietary
🥈	GPT-5-chat	`1410`	`1417`	`1289`	`67`	`86.5`	`7.5`	OpenAI	Proprietary
🥈	Gemini-2.5-Flash	`1409`	`1419`	`1275`	`58`	`83.2`	`2.5`	Google	Proprietary
🥈	Gemini-2.0-Pro-Exp-02-05	`1398`	`1396`	`1220`	`38`	`80.5`		Google	Proprietary
🥈	Gemini-2.0-Flash-Thinking-Exp-01-21	`1397`	`1383`	`1258`	`42`	`79.8`		Google	Proprietary
🥈	DeepSeek-V3.1				`49`	`83.3`		DeepSeek	MIT
🥈	GLM-4.5-Air	`1391`	`1416`		`49`	`81.5`		Z.ai	MIT
🥈	Qwen3-30B-A3B-Instruct-2507	`1382`	`1425`		`46`	`77.7`		Alibaba	Apache 2.0
🥈	Qwen-VL-Max-2025-08-13	`1381`	`1440`	`1254`				Alibaba	Apache 2.0
🥈	kimi-k2-0711-preview	`1380`	`1402`		`49`	`82.4`		Moonshot	Modified MIT
🥈	GPT-4.1-2025-04-14	`1379`	`1396`	`1247`	`47`	`80.6`	`0.4`	OpenAI	Proprietary
🥈	DeepSeek-V3-0324	`1377`	`1391`		`44`	`81.9`		DeepSeek	MIT
🥈	Hunyuan-Turbos-20250416	`1377`	`1390`			`78`		Tencent	Proprietary
🥈	Claude Opus 4 (thinking-16k)	`1376`	`1434`	`1226`	`59`	`87.3`	`8.6`	Anthropic	Proprietary
🥈	GPT-5-mini	`1375`	`1415`	`1281`	`64`	`82.8`	`4.4`	OpenAI	Proprietary
🥈	DeepSeek-R1	`1373`	`1382`		`50`	`84.4`	`1.3`	DeepSeek	MIT
🥈	Gemini-2.0-Flash-Exp	`1370`	`1371`	`1241`	`38`	`78.2`	`1.3`	Google	Proprietary
🥈	Qwen3-235B-A22B	`1369`	`1394`		`48`	`82.8`		Alibaba	Apache 2.0
🥈	Mistral Medium 3	`1369`	`1387`	`1203`	`39`	`76`		Mistral	Proprietary
🥈	gpt-oss-120b	`1368`	`1409`		`61`	`80.8`		OpenAI	Apache 2.0
🥈	Qwen2.5-Max	`1367`	`1373`		`34`	`76.2`		Alibaba	Proprietary
🥈	Claude Opus 4 (20250514)	`1366`	`1405`	`1214`	`47`	`86`	`1.3`	Anthropic	Proprietary
🥈	Grok-3-mini-high	`1366`	`1380`		`58`	`82.8`		xAI	Proprietary
🥈	o1-2024-12-17	`1366`	`1378`	`1216`	`52`	`84.1`	`1.3`	OpenAI	Proprietary
🥈	o4-mini-2025-04-16	`1362`	`1385`	`1242`	`65`	`83.2`	`6.1`	OpenAI	Proprietary
🥈	Step-3	`1360`	`1400`	`1244`				StepFun	Proprietary
🥈	Qwen3-Coder-480B-A35B-Instruct	`1358`	`1406`		`45`	`78.8`		Alibaba	Apache 2.0
🥉	Gemma-3-27B-it	`1357`	`1350`	`1222`	`25`	`66.9`		Google	Gemma
🥉	Claude Sonnet 4 (thinking-32k)	`1351`	`1412`	`1225`	`59`	`84.2`	`5.9`	Anthropic	Proprietary
🥉	Minimax-M1	`1351`	`1369`		`53`	`81.6`		MiniMax	Apache 2.0
🥉	Qwen3-32B	`1342`	`1376`		`44`	`79.8`		Alibaba	Apache 2.0
🥉	Llama-3.3-Nemotron-Super-49B-v1.5	`1340`	`1359`		`52`	`81.4`		Nvidia	Nvidia Open
🥉	Step-1o-Turbo-202506	`1339`	`1361`	`1229`				StepFun	Proprietary
🥉	o3-mini-high	`1338`	`1380`		`55`	`80.2`	`3`	OpenAI	Proprietary
🥉	GPT-4.1-mini-2025-04-14	`1338`	`1370`	`1231`	`42`	`78.1`		OpenAI	Proprietary
🥉	Gemini-2.5-Flash-Lite				`44`	`75.9`		Google	Proprietary
🥉	Mistral-Small-3.2-2506	`1337`	`1361`	`1195`	`32`	`68.1`		Mistral	Apache 2.0
🥉	Claude Sonnet 4 (20250514)	`1335`	`1384`	`1213`	`46`	`83.7`	`1.3`	Anthropic	Proprietary
🥉	Gemma-3-12B-it	`1335`	`1310`		`24`	`59.5`		Google	Gemma
🥉	DeepSeek-V3	`1334`	`1337`		`35`	`75.2`		DeepSeek	DeepSeek
🥉	GPT-5-nano	`1333`	`1363`	`1246`	`54`	`77.2`	`2.6`	OpenAI	Proprietary
🥉	QwQ-32B	`1332`	`1351`		`48`	`76.4`		Alibaba	Apache 2.0
🥉	GLM-4-Plus-0111	`1332`	`1310`			`78.6`		Z.ai	Proprietary
🥉	Gemini-2.0-Flash-Lite	`1330`	`1338`	`1147`	`30`	`72.4`		Google	Proprietary
🥉	Qwen-Plus-0125	`1327`	`1339`					Alibaba	Proprietary
🥉	Command A (03-2025)	`1327`	`1336`		`32`	`71.2`		Cohere	CC-BY-NC-4.0
🥉	Amazon-Nova-Chat-05-14	`1324`	`1337`		`35`	`73.3`		Amazon	Proprietary
🥉	Llama-3.1-Nemotron-Ultra-253B-v1	`1321`	`1345`		`46`	`82.5`		Nvidia	Nvidia Open
🥉	Step-2-16K-Exp	`1321`	`1313`					StepFun	Proprietary
🥉	Qwen3-30B-A3B	`1320`	`1346`		`42`	`77.7`		Alibaba	Apache 2.0
🥉	Gemini-1.5-Pro-002	`1320`	`1311`	`1208`	`34`	`75`	`0.8`	Google	Proprietary
🥉	o1-mini	`1318`	`1366`		`43`	`74.2`	`0.8`	OpenAI	Proprietary
🥉	o3-mini	`1318`	`1361`		`53`	`79.1`	`2.1`	OpenAI	Proprietary
🥉	Claude 3.7 Sonnet (thinking-32k)	`1316`	`1355`	`1206`	`47`	`83.7`	`0.9`	Anthropic	Proprietary
🥉	gpt-oss-20b	`1315`	`1371`		`49`	`73.6`		OpenAI	Apache 2.0
🪙	Hunyuan-Turbo-0110	`1314`	`1335`					Tencent	Proprietary
🪙	Llama-3.3-Nemotron-Super-49B-v1	`1310`	`1320`		`40`	`78.5`		Nvidia	Nvidia Open
🪙	Grok-2-08-13	`1306`	`1298`		`28`	`70.9`		xAI	Proprietary
🪙	Gemma-3n-e4b-it	`1304`	`1297`		`18`	`48.8`		Google	Gemma
🪙	Yi-Lightning	`1303`	`1321`					01 AI	Proprietary
🪙	GPT-4o-2024-05-13	`1302`	`1307`	`1184`	`30`	`74.8`		OpenAI	Proprietary
🪙	Claude 3.7 Sonnet	`1301`	`1341`	`1195`	`37`	`80.3`		Anthropic	Proprietary
🪙	Claude 3.5 Sonnet (20241022)	`1299`	`1340`	`1172`	`33`	`77.2`		Anthropic	Proprietary
🪙	Deepseek-v2.5-1210	`1296`	`1316`		`24`	`67.2`		DeepSeek	DeepSeek
🪙	Athene-v2-Chat-72B	`1294`	`1320`					NexusFlow	NexusFlow
🪙	Gemma-3-4B-it	`1293`	`1265`		`14`	`41.7`		Google	Gemma
🪙	Llama-4-Maverick-17B-128E-Instruct	`1292`	`1312`	`1185`	`42`	`80.9`		Meta	Llama 4
🪙	GLM-4-Plus	`1292`	`1301`			`70.2`		Z.ai	Proprietary
🪙	Hunyuan-Large-2025-02-10	`1291`	`1311`					Tencent	Proprietary
🪙	Gemini-1.5-Flash-002	`1290`	`1273`	`1187`	`28`	`68`		Google	Proprietary
🪙	GPT-4o-mini-2024-07-18	`1289`	`1300`	`1113`	`24`	`64.8`		OpenAI	Proprietary
🪙	GPT-4.1-nano-2025-04-14	`1287`	`1312`	`1110`	`30`	`65.7`		OpenAI	Proprietary
🪙	Llama-3.1-405B-Instruct-bf16	`1286`	`1299`		`29`	`73.2`		Meta	Llama 3.1
🪙	Llama-3.1-Nemotron-70B-Instruct	`1285`	`1289`		`26`	`69`		Nvidia	Llama 3.1
🪙	Qwen-Max-0919	`1284`	`1296`					Alibaba	Qwen
🪙	Llama-3.1-405B-Instruct-fp8	`1284`	`1292`		`29`	`73.2`		Meta	Llama 3.1
🪙	Yi-Lightning-lite	`1284`	`1286`					01 AI	Proprietary
🪙	Claude 3.5 Sonnet (20240620)	`1283`	`1309`	`1167`	`29`	`75.1`		Anthropic	Proprietary
🪙	Grok-2-mini-08-13	`1283`	`1279`					xAI	Proprietary
🪙	Llama-4-Scout-17B-16E-Instruct	`1276`	`1290`	`1176`	`33`	`75.2`		Meta	Llama 4
🪙	Hunyuan-Standard-2025-02-10	`1276`	`1289`					Tencent	Proprietary
🪙	Llama-3.3-70B-Instruct	`1276`	`1279`		`31`	`71.3`		Meta	Llama 3.3
🪙	Deepseek-v2.5	`1275`	`1306`		`23`	`66.2`		DeepSeek	DeepSeek
🪙	GPT-4-Turbo-2024-04-09	`1275`	`1280`	`1137`	`28`	`69.4`		OpenAI	Proprietary
🪙	Qwen2.5-72B-Instruct	`1272`	`1302`		`29`	`72`		Alibaba	Qwen
🪙	Hunyuan-Large-Vision	`1270`	`1298`	`1239`				Tencent	Proprietary
🪙	Mistral-Small-3.1-24B-Instruct-2503	`1269`	`1295`	`1171`	`24`	`65.9`		Mistral	Apache 2.0
🪙	Mistral-Large-2411	`1269`	`1284`		`27`	`69.7`		Mistral	MRL
🪙	Athene-70B	`1268`	`1274`					NexusFlow	CC-BY-NC-4.0
🪙	GPT-4-1106-preview	`1267`	`1269`		`25`	`63.7`		OpenAI	Proprietary
🪙	GPT-4-0125-preview	`1266`	`1261`					OpenAI	Proprietary
🪙	Claude 3 Opus	`1265`	`1269`	`1070`	`24`	`69.6`		Anthropic	Proprietary
🪙	Llama-3.1-70B-Instruct	`1265`	`1268`		`24`	`67.6`		Meta	Llama 3.1
🪙	Amazon Nova Pro 1.0	`1262`	`1282`	`1029`	`29`	`69.1`		Amazon	Proprietary
🪙	Llama-3.1-Tulu-3-70B	`1260`	`1251`					Ai2	Llama 3.1
🪙	Claude 3.5 Haiku (20241022)	`1256`	`1287`	`1145`	`23`	`63.4`		Anthropic	Proprietary
🪙	magistral-medium-2506	`1253`	`1307`		`38`	`75.3`		Mistral	Proprietary
🪙	Reka-Core-20240904	`1252`	`1238`		`22`			Reka AI	Proprietary
🪙	Reka-Core-20240722	`1250`	`1226`					Reka AI	Proprietary
🪙	Qwen-Plus-0828	`1242`	`1263`					Alibaba	Proprietary
🪙	Jamba-1.5-Large	`1242`	`1244`		`18`	`57.2`		AI21 Labs	Jamba Open
🪙	Deepseek-v2-API-0628	`1240`	`1260`					DeepSeek	DeepSeek
🪙	Mistral-Small-3-24B-Instruct-2501	`1238`	`1251`		`24`	`65.2`		Mistral	Apache 2.0
🪙	Deepseek-Coder-v2-0724	`1237`	`1286`		`17`	`58.5`		DeepSeek	DeepSeek
🪙	Yi-Large	`1236`	`1238`		`16`	`58.6`		01 AI	Proprietary
🪙	Gemma-2-27B-it	`1236`	`1226`		`20`	`57.5`		Google	Gemma
🪙	Qwen2.5-Coder-32B-Instruct	`1235`	`1279`		`25`	`63.5`		Alibaba	Apache 2.0
🪙	Amazon Nova Lite 1.0	`1233`	`1253`	`1039`	`25`	`59`		Amazon	Proprietary
🪙	Gemma-2-9B-it-SimPO	`1233`	`1211`					Princeton	MIT
🪙	Command R+ (08-2024)	`1233`	`1200`		`9`	`43.2`		Cohere	CC-BY-NC-4.0
🪙	Gemini-1.5-Flash-8B-001	`1231`	`1228`	`1092`	`19`	`56.9`		Google	Proprietary
🪙	Llama-3.1-Nemotron-51B-Instruct	`1231`	`1227`					Nvidia	Llama 3.1
🪙	GLM-4-0520	`1230`	`1237`					Z.ai	Proprietary
🪙	Nemotron-4-340B-Instruct	`1229`	`1220`					Nvidia	Nvidia Open
🪙	Aya-Expanse-32B	`1229`	`1211`		`8`	`37.7`		Cohere	CC-BY-NC-4.0
🪙	Reka-Flash-20240904	`1225`	`1208`		`22`			Reka AI	Proprietary
🪙	Llama-3-70B-Instruct	`1224`	`1216`		`16`	`57.4`		Meta	Llama 3
🪙	Claude 3 Sonnet	`1223`	`1232`	`1033`	`16`	`57.9`		Anthropic	Proprietary
🪙	OLMo-2-0325-32B-Instruct	`1223`	`1215`					Ai2	Apache-2.0
🪙	Phi-4	`1222`	`1242`		`28`	`71.4`		Microsoft	MIT
🪙	Reka-Flash-20240722	`1218`	`1201`					Reka AI	Proprietary
🪙	Amazon Nova Micro 1.0	`1215`	`1228`		`20`	`53.1`		Amazon	Proprietary
🪙	Gemma-2-9B-it	`1213`	`1194`		`10`	`49.5`		Google	Gemma
🪙	Hunyuan-Standard-256K	`1209`	`1244`					Tencent	Proprietary
🪙	Command R+ (04-2024)	`1209`	`1184`		`8`	`42.7`		Cohere	CC-BY-NC-4.0
🪙	Qwen2-72B-Instruct	`1208`	`1206`		`21`	`62.2`		Alibaba	Qianwen
🪙	Claude 3 Haiku	`1200`	`1208`	`1000`	`12`	`50`		Anthropic	Proprietary
🪙	Llama-3.1-Tulu-3-8B	`1200`	`1197`					Ai2	Llama 3.1
🪙	Qwen-Max-0428	`1199`	`1208`					Alibaba	Proprietary
🪙	Ministral-8B-2410	`1198`	`1219`		`10`	`38.9`		Mistral	MRL
🪙	GLM-4-0116	`1198`	`1209`					Z.ai	Proprietary
🪙	DeepSeek-Coder-V2-Instruct	`1196`	`1259`					DeepSeek	DeepSeek
🪙	Command R (08-2024)	`1195`	`1180`		`3`	`33.8`		Cohere	CC-BY-NC-4.0
🪙	Llama-3.1-8B-Instruct	`1193`	`1203`		`12`	`47.6`		Meta	Llama 3.1
🪙	Jamba-1.5-Mini	`1193`	`1197`					AI21 Labs	Jamba Open
🪙	Aya-Expanse-8B	`1193`	`1184`		`4`	`31.2`		Cohere	CC-BY-NC-4.0
🪙	Qwen1.5-110B-Chat	`1180`	`1192`		`13`			Alibaba	Qianwen
🪙	Yi-1.5-34B-Chat	`1178`	`1181`					01 AI	Apache-2.0
🪙	Claude-1	`1178`	`1161`					Anthropic	Proprietary
🪙	Qwen1.5-72B-Chat	`1172`	`1175`					Alibaba	Qianwen
🪙	Mistral Medium	`1171`	`1172`		`11`	`49.1`		Mistral	Proprietary
🪙	Llama-3-8B-Instruct	`1171`	`1164`		`9`	`40.5`		Meta	Llama 3
🪙	Command R (04-2024)	`1169`	`1141`		`2`	`33.7`		Cohere	CC-BY-NC-4.0
🪙	InternLM2.5-20B-chat	`1168`	`1179`					InternLM	Other
🪙	Mixtral-8x22b-Instruct-v0.1	`1168`	`1175`		`14`	`53.7`		Mistral	Apache 2.0
🪙	Gemma-2-2b-it	`1163`	`1130`					Google	Gemma
🪙	Granite-3.1-8B-Instruct	`1158`	`1191`					IBM	Apache 2.0
🪙	Claude-2.0	`1158`	`1160`		`11`	`48.6`		Anthropic	Proprietary
🪙	Gemini-1.0-Pro-001	`1155`	`1125`					Google	Proprietary
🪙	Zephyr-ORPO-141b-A35b-v0.1	`1150`	`1144`					HuggingFace	Apache 2.0
🪙	Claude-2.1	`1146`	`1158`		`12`	`49.5`		Anthropic	Proprietary
🪙	GPT-3.5-Turbo-0613	`1145`	`1164`		`11`	`46.2`		OpenAI	Proprietary
🪙	Qwen1.5-32B-Chat	`1144`	`1163`					Alibaba	Qianwen
🪙	Phi-3-Medium-4k-Instruct	`1144`	`1146`		`13`	`54.3`		Microsoft	MIT
🪙	Starling-LM-7B-beta	`1139`	`1151`					Nexusflow	Apache-2.0
🪙	Mixtral-8x7B-Instruct-v0.1	`1138`	`1136`		`5`	`38.7`		Mistral	Apache 2.0
🪙	GPT-3.5-Turbo-0314	`1138`	`1136`					OpenAI	Proprietary
🪙	Granite-3.1-2B-Instruct	`1136`	`1166`					IBM	Apache 2.0
🪙	Qwen1.5-14B-Chat	`1135`	`1144`					Alibaba	Qianwen
🪙	Claude-Instant-1	`1135`	`1136`		`2`	`43.4`		Anthropic	Proprietary
🪙	Yi-34B-Chat	`1134`	`1129`					01 AI	Yi
🪙	Tulu-2-DPO-70B	`1127`	`1120`					Ai2	Ai2 ImpACT
🪙	DBRX-Instruct-Preview	`1126`	`1141`					Databricks	DBRX
🪙	WizardLM-70B-v1.0	`1126`	`1093`					Microsoft	Llama 2
🪙	Llama-2-70B-chat	`1122`	`1099`		`8`	`40.7`		Meta	Llama 2
🪙	Nous-Hermes-2-Mixtral-8x7B-DPO	`1119`	`1103`					NousResearch	Apache-2.0
🪙	Llama-3.2-3B-Instruct	`1118`	`1097`		`7`	`34.7`		Meta	Llama 3.2
🪙	Phi-3-Small-8k-Instruct	`1117`	`1123`					Microsoft	MIT
🪙	OpenChat-3.5-0106	`1114`	`1119`					OpenChat	Apache-2.0
🪙	Starling-LM-7B-alpha	`1114`	`1104`					UC Berkeley	CC-BY-NC-4.0
🪙	Vicuna-33B	`1113`	`1091`					LMSYS	Non-commercial
🪙	DeepSeek-LLM-67B-Chat	`1111`	`1106`		`8`			DeepSeek	DeepSeek
🪙	Snowflake Arctic Instruct	`1109`	`1101`					Snowflake	Apache 2.0
🪙	Granite-3.0-8B-Instruct	`1108`	`1115`					IBM	Apache 2.0
🪙	NV-Llama2-70B-SteerLM-Chat	`1106`	`1047`					Nvidia	Llama 2
🪙	OpenChat-3.5	`1103`	`1077`					OpenChat	Apache-2.0
🪙	Gemma-1.1-7B-it	`1102`	`1105`					Google	Gemma
🪙	OpenHermes-2.5-Mistral-7B	`1100`	`1083`					NousResearch	Apache-2.0
🪙	pplx-70B-online	`1099`	`1055`					Perplexity AI	Proprietary
🪙	Mistral-7B-Instruct-v0.2	`1097`	`1094`		`1`	`24.5`		Mistral	Apache-2.0
🪙	Llama-2-13b-chat	`1093`	`1077`		`8`	`40.6`		Meta	Llama 2
🪙	Granite-3.0-2B-Instruct	`1091`	`1104`					IBM	Apache 2.0
🪙	SOLAR-10.7B-Instruct-v1.0	`1091`	`1073`					Upstage AI	CC-BY-NC-4.0
🪙	Qwen1.5-7B-Chat	`1090`	`1110`					Alibaba	Qianwen
🪙	Phi-3-Mini-4K-Instruct-June-24	`1088`	`1098`					Microsoft	MIT
🪙	Dolphin-2.2.1-Mistral-7B	`1088`	`1049`					Cognitive	Apache-2.0
🪙	WizardLM-13b-v1.2	`1084`	`1048`					Microsoft	Llama 2
🪙	Phi-3-Mini-4k-Instruct	`1082`	`1102`					Microsoft	MIT
🪙	MPT-30B-chat	`1076`	`1055`					MosaicML	CC-BY-NC-SA-4.0
🪙	Zephyr-7B-beta	`1076`	`1053`					HuggingFace	MIT
🪙	CodeLlama-34B-instruct	`1073`	`1065`					Meta	Llama 2
🪙	Llama-3.2-1B-Instruct	`1067`	`1063`		`1`	`20`		Meta	Llama 3.2
🪙	Qwen2.5-VL-32B-Instruct			`1199`				Alibaba	Apache 2.0
🪙	Step-1o-Vision-32k (highres)			`1169`				StepFun	Proprietary
🪙	Qwen2.5-VL-72B-Instruct			`1154`				Alibaba	Qwen
🪙	Pixtral-Large-2411			`1138`	`26`	`70.1`		Mistral	MRL
🪙	Qwen-VL-Max-1119			`1106`				Alibaba	Proprietary
🪙	Qwen2-VL-72b-Instruct			`1094`				Alibaba	Qwen
🪙	Step-1V-32K			`1093`				StepFun	Proprietary
🪙	Molmo-72B-0924			`1060`				Ai2	Apache 2.0
🪙	Pixtral-12B-2409			`1056`	`11`	`47.3`		Mistral	Apache 2.0
🪙	InternVL2-26B			`1053`				OpenGVLab	MIT
🪙	Llama-3.2-90B-Vision-Instruct			`1047`	`22`	`67.1`		Meta	Llama 3.2
🪙	Hunyuan-Standard-Vision-2024-12-31			`1046`				Tencent	Proprietary
🪙	Aya-Vision-32B			`1042`				Cohere	CC-BY-NC-4.0
🪙	Qwen2-VL-7B-Instruct			`1038`				Alibaba	Apache 2.0
🪙	Yi-Vision			`1028`				01 AI	Proprietary
🪙	Llama-3.2-11B-Vision-Instruct			`1014`	`13`	`46.4`		Meta	Llama 3.2
🪙	Molmo-7B-D-0924			`1007`				Ai2	Apache 2.0

OproAI

SWE-bench + | GitHub | Stats

💡 AAII v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR.

💻 Arena Elo ratings are computed by this notebook. Higher values are better for all benchmarks. Empty cells mean not available.

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of R_A and player B a rating of R_B, the probability of player A winning is

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

This algorithm has two distinct features:

It can be computed asynchronously by players around the world.
It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. This provided consistent stable scores and allowed us to incorporate new models quickly.

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.

Chatbot Arena +

❖ OproAI 2025Aug 21

OproAI

Transition from online Elo rating system to Bradley-Terry model

❖ OproAI