Chatbot Arena +

❖ OproAI 2025Aug 21

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.8M+ user votes to compute Elo ratings.
  • AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations.
  • ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.

πŸ”

 

ModelArena EloCodingVisionAAIIMMLU-ProARC-AGIOrganizationLicense
πŸ†Gemini-2.5-Pro1468147013036586.24.9GoogleProprietary
πŸ†Grok-4-07091433144312716886.616xAIProprietary
πŸ†GPT-51432146212826887.39.9OpenAIProprietary
πŸ₯‡DeepSeek-V3.1-thinking6085.1DeepSeekMIT
πŸ₯‡GLM-4.5143114515683.5Z.aiMIT
πŸ₯‡ChatGPT-4o-latest (2025-03-26)1429143412994080.3OpenAIProprietary
πŸ₯‡DeepSeek-R1-0528142514375984.91.3DeepSeekMIT
πŸ₯‡o3-2025-04-161424144112616785.36.5OpenAIProprietary
πŸ₯‡Grok-3-Preview-02-24142314394679.9xAIProprietary
πŸ₯‡Claude Opus 4.1 (thinking-16k)142114786187.8AnthropicProprietary
πŸ₯‡Qwen3-235B-A22B-Instruct-2507142014615182.81.3AlibabaApache 2.0
πŸ₯‡Claude Opus 4.1141914694987.3AnthropicProprietary
πŸ₯‡Qwen3-235B-A22B-Thinking-2507141814476484.3AlibabaApache 2.0
πŸ₯ˆGPT-4.5-Preview14151419124042810.8OpenAIProprietary
πŸ₯ˆGPT-5-chat1410141712896786.57.5OpenAIProprietary
πŸ₯ˆGemini-2.5-Flash1409141912755883.22.5GoogleProprietary
πŸ₯ˆGemini-2.0-Pro-Exp-02-051398139612203880.5GoogleProprietary
πŸ₯ˆGemini-2.0-Flash-Thinking-Exp-01-211397138312584279.8GoogleProprietary
πŸ₯ˆDeepSeek-V3.14983.3DeepSeekMIT
πŸ₯ˆGLM-4.5-Air139114164981.5Z.aiMIT
πŸ₯ˆQwen3-30B-A3B-Instruct-2507138214254677.7AlibabaApache 2.0
πŸ₯ˆQwen-VL-Max-2025-08-13138114401254AlibabaApache 2.0
πŸ₯ˆkimi-k2-0711-preview138014024982.4MoonshotModified MIT
πŸ₯ˆGPT-4.1-2025-04-141379139612474780.60.4OpenAIProprietary
πŸ₯ˆDeepSeek-V3-0324137713914481.9DeepSeekMIT
πŸ₯ˆHunyuan-Turbos-202504161377139078TencentProprietary
πŸ₯ˆClaude Opus 4 (thinking-16k)1376143412265987.38.6AnthropicProprietary
πŸ₯ˆGPT-5-mini1375141512816482.84.4OpenAIProprietary
πŸ₯ˆDeepSeek-R1137313825084.41.3DeepSeekMIT
πŸ₯ˆGemini-2.0-Flash-Exp1370137112413878.21.3GoogleProprietary
πŸ₯ˆQwen3-235B-A22B136913944882.8AlibabaApache 2.0
πŸ₯ˆMistral Medium 31369138712033976MistralProprietary
πŸ₯ˆgpt-oss-120b136814096180.8OpenAIApache 2.0
πŸ₯ˆQwen2.5-Max136713733476.2AlibabaProprietary
πŸ₯ˆClaude Opus 4 (20250514)13661405121447861.3AnthropicProprietary
πŸ₯ˆGrok-3-mini-high136613805882.8xAIProprietary
πŸ₯ˆo1-2024-12-171366137812165284.11.3OpenAIProprietary
πŸ₯ˆo4-mini-2025-04-161362138512426583.26.1OpenAIProprietary
πŸ₯ˆStep-3136014001244StepFunProprietary
πŸ₯ˆQwen3-Coder-480B-A35B-Instruct135814064578.8AlibabaApache 2.0
πŸ₯‰Gemma-3-27B-it1357135012222566.9GoogleGemma
πŸ₯‰Claude Sonnet 4 (thinking-32k)1351141212255984.25.9AnthropicProprietary
πŸ₯‰Minimax-M1135113695381.6MiniMaxApache 2.0
πŸ₯‰Qwen3-32B134213764479.8AlibabaApache 2.0
πŸ₯‰Llama-3.3-Nemotron-Super-49B-v1.5134013595281.4NvidiaNvidia Open
πŸ₯‰Step-1o-Turbo-202506133913611229StepFunProprietary
πŸ₯‰o3-mini-high133813805580.23OpenAIProprietary
πŸ₯‰GPT-4.1-mini-2025-04-141338137012314278.1OpenAIProprietary
πŸ₯‰Gemini-2.5-Flash-Lite4475.9GoogleProprietary
πŸ₯‰Mistral-Small-3.2-25061337136111953268.1MistralApache 2.0
πŸ₯‰Claude Sonnet 4 (20250514)1335138412134683.71.3AnthropicProprietary
πŸ₯‰Gemma-3-12B-it133513102459.5GoogleGemma
πŸ₯‰DeepSeek-V3133413373575.2DeepSeekDeepSeek
πŸ₯‰GPT-5-nano1333136312465477.22.6OpenAIProprietary
πŸ₯‰QwQ-32B133213514876.4AlibabaApache 2.0
πŸ₯‰GLM-4-Plus-01111332131078.6Z.aiProprietary
πŸ₯‰Gemini-2.0-Flash-Lite1330133811473072.4GoogleProprietary
πŸ₯‰Qwen-Plus-012513271339AlibabaProprietary
πŸ₯‰Command A (03-2025)132713363271.2CohereCC-BY-NC-4.0
πŸ₯‰Amazon-Nova-Chat-05-14132413373573.3AmazonProprietary
πŸ₯‰Llama-3.1-Nemotron-Ultra-253B-v1132113454682.5NvidiaNvidia Open
πŸ₯‰Step-2-16K-Exp13211313StepFunProprietary
πŸ₯‰Qwen3-30B-A3B132013464277.7AlibabaApache 2.0
πŸ₯‰Gemini-1.5-Pro-00213201311120834750.8GoogleProprietary
πŸ₯‰o1-mini131813664374.20.8OpenAIProprietary
πŸ₯‰o3-mini131813615379.12.1OpenAIProprietary
πŸ₯‰Claude 3.7 Sonnet (thinking-32k)1316135512064783.70.9AnthropicProprietary
πŸ₯‰gpt-oss-20b131513714973.6OpenAIApache 2.0
πŸͺ™Hunyuan-Turbo-011013141335TencentProprietary
πŸͺ™Llama-3.3-Nemotron-Super-49B-v1131013204078.5NvidiaNvidia Open
πŸͺ™Grok-2-08-13130612982870.9xAIProprietary
πŸͺ™Gemma-3n-e4b-it130412971848.8GoogleGemma
πŸͺ™Yi-Lightning1303132101 AIProprietary
πŸͺ™GPT-4o-2024-05-131302130711843074.8OpenAIProprietary
πŸͺ™Claude 3.7 Sonnet1301134111953780.3AnthropicProprietary
πŸͺ™Claude 3.5 Sonnet (20241022)1299134011723377.2AnthropicProprietary
πŸͺ™Deepseek-v2.5-1210129613162467.2DeepSeekDeepSeek
πŸͺ™Athene-v2-Chat-72B12941320NexusFlowNexusFlow
πŸͺ™Gemma-3-4B-it129312651441.7GoogleGemma
πŸͺ™Llama-4-Maverick-17B-128E-Instruct1292131211854280.9MetaLlama 4
πŸͺ™GLM-4-Plus1292130170.2Z.aiProprietary
πŸͺ™Hunyuan-Large-2025-02-1012911311TencentProprietary
πŸͺ™Gemini-1.5-Flash-0021290127311872868GoogleProprietary
πŸͺ™GPT-4o-mini-2024-07-181289130011132464.8OpenAIProprietary
πŸͺ™GPT-4.1-nano-2025-04-141287131211103065.7OpenAIProprietary
πŸͺ™Llama-3.1-405B-Instruct-bf16128612992973.2MetaLlama 3.1
πŸͺ™Llama-3.1-Nemotron-70B-Instruct128512892669NvidiaLlama 3.1
πŸͺ™Qwen-Max-091912841296AlibabaQwen
πŸͺ™Llama-3.1-405B-Instruct-fp8128412922973.2MetaLlama 3.1
πŸͺ™Yi-Lightning-lite1284128601 AIProprietary
πŸͺ™Claude 3.5 Sonnet (20240620)1283130911672975.1AnthropicProprietary
πŸͺ™Grok-2-mini-08-1312831279xAIProprietary
πŸͺ™Llama-4-Scout-17B-16E-Instruct1276129011763375.2MetaLlama 4
πŸͺ™Hunyuan-Standard-2025-02-1012761289TencentProprietary
πŸͺ™Llama-3.3-70B-Instruct127612793171.3MetaLlama 3.3
πŸͺ™Deepseek-v2.5127513062366.2DeepSeekDeepSeek
πŸͺ™GPT-4-Turbo-2024-04-091275128011372869.4OpenAIProprietary
πŸͺ™Qwen2.5-72B-Instruct127213022972AlibabaQwen
πŸͺ™Hunyuan-Large-Vision127012981239TencentProprietary
πŸͺ™Mistral-Small-3.1-24B-Instruct-25031269129511712465.9MistralApache 2.0
πŸͺ™Mistral-Large-2411126912842769.7MistralMRL
πŸͺ™Athene-70B12681274NexusFlowCC-BY-NC-4.0
πŸͺ™GPT-4-1106-preview126712692563.7OpenAIProprietary
πŸͺ™GPT-4-0125-preview12661261OpenAIProprietary
πŸͺ™Claude 3 Opus1265126910702469.6AnthropicProprietary
πŸͺ™Llama-3.1-70B-Instruct126512682467.6MetaLlama 3.1
πŸͺ™Amazon Nova Pro 1.01262128210292969.1AmazonProprietary
πŸͺ™Llama-3.1-Tulu-3-70B12601251Ai2Llama 3.1
πŸͺ™Claude 3.5 Haiku (20241022)1256128711452363.4AnthropicProprietary
πŸͺ™magistral-medium-2506125313073875.3MistralProprietary
πŸͺ™Reka-Core-202409041252123822Reka AIProprietary
πŸͺ™Reka-Core-2024072212501226Reka AIProprietary
πŸͺ™Qwen-Plus-082812421263AlibabaProprietary
πŸͺ™Jamba-1.5-Large124212441857.2AI21 LabsJamba Open
πŸͺ™Deepseek-v2-API-062812401260DeepSeekDeepSeek
πŸͺ™Mistral-Small-3-24B-Instruct-2501123812512465.2MistralApache 2.0
πŸͺ™Deepseek-Coder-v2-0724123712861758.5DeepSeekDeepSeek
πŸͺ™Yi-Large123612381658.601 AIProprietary
πŸͺ™Gemma-2-27B-it123612262057.5GoogleGemma
πŸͺ™Qwen2.5-Coder-32B-Instruct123512792563.5AlibabaApache 2.0
πŸͺ™Amazon Nova Lite 1.01233125310392559AmazonProprietary
πŸͺ™Gemma-2-9B-it-SimPO12331211PrincetonMIT
πŸͺ™Command R+ (08-2024)12331200943.2CohereCC-BY-NC-4.0
πŸͺ™Gemini-1.5-Flash-8B-0011231122810921956.9GoogleProprietary
πŸͺ™Llama-3.1-Nemotron-51B-Instruct12311227NvidiaLlama 3.1
πŸͺ™GLM-4-052012301237Z.aiProprietary
πŸͺ™Nemotron-4-340B-Instruct12291220NvidiaNvidia Open
πŸͺ™Aya-Expanse-32B12291211837.7CohereCC-BY-NC-4.0
πŸͺ™Reka-Flash-202409041225120822Reka AIProprietary
πŸͺ™Llama-3-70B-Instruct122412161657.4MetaLlama 3
πŸͺ™Claude 3 Sonnet1223123210331657.9AnthropicProprietary
πŸͺ™OLMo-2-0325-32B-Instruct12231215Ai2Apache-2.0
πŸͺ™Phi-4122212422871.4MicrosoftMIT
πŸͺ™Reka-Flash-2024072212181201Reka AIProprietary
πŸͺ™Amazon Nova Micro 1.0121512282053.1AmazonProprietary
πŸͺ™Gemma-2-9B-it121311941049.5GoogleGemma
πŸͺ™Hunyuan-Standard-256K12091244TencentProprietary
πŸͺ™Command R+ (04-2024)12091184842.7CohereCC-BY-NC-4.0
πŸͺ™Qwen2-72B-Instruct120812062162.2AlibabaQianwen
πŸͺ™Claude 3 Haiku1200120810001250AnthropicProprietary
πŸͺ™Llama-3.1-Tulu-3-8B12001197Ai2Llama 3.1
πŸͺ™Qwen-Max-042811991208AlibabaProprietary
πŸͺ™Ministral-8B-2410119812191038.9MistralMRL
πŸͺ™GLM-4-011611981209Z.aiProprietary
πŸͺ™DeepSeek-Coder-V2-Instruct11961259DeepSeekDeepSeek
πŸͺ™Command R (08-2024)11951180333.8CohereCC-BY-NC-4.0
πŸͺ™Llama-3.1-8B-Instruct119312031247.6MetaLlama 3.1
πŸͺ™Jamba-1.5-Mini11931197AI21 LabsJamba Open
πŸͺ™Aya-Expanse-8B11931184431.2CohereCC-BY-NC-4.0
πŸͺ™Qwen1.5-110B-Chat1180119213AlibabaQianwen
πŸͺ™Yi-1.5-34B-Chat1178118101 AIApache-2.0
πŸͺ™Claude-111781161AnthropicProprietary
πŸͺ™Qwen1.5-72B-Chat11721175AlibabaQianwen
πŸͺ™Mistral Medium117111721149.1MistralProprietary
πŸͺ™Llama-3-8B-Instruct11711164940.5MetaLlama 3
πŸͺ™Command R (04-2024)11691141233.7CohereCC-BY-NC-4.0
πŸͺ™InternLM2.5-20B-chat11681179InternLMOther
πŸͺ™Mixtral-8x22b-Instruct-v0.1116811751453.7MistralApache 2.0
πŸͺ™Gemma-2-2b-it11631130GoogleGemma
πŸͺ™Granite-3.1-8B-Instruct11581191IBMApache 2.0
πŸͺ™Claude-2.0115811601148.6AnthropicProprietary
πŸͺ™Gemini-1.0-Pro-00111551125GoogleProprietary
πŸͺ™Zephyr-ORPO-141b-A35b-v0.111501144HuggingFaceApache 2.0
πŸͺ™Claude-2.1114611581249.5AnthropicProprietary
πŸͺ™GPT-3.5-Turbo-0613114511641146.2OpenAIProprietary
πŸͺ™Qwen1.5-32B-Chat11441163AlibabaQianwen
πŸͺ™Phi-3-Medium-4k-Instruct114411461354.3MicrosoftMIT
πŸͺ™Starling-LM-7B-beta11391151NexusflowApache-2.0
πŸͺ™Mixtral-8x7B-Instruct-v0.111381136538.7MistralApache 2.0
πŸͺ™GPT-3.5-Turbo-031411381136OpenAIProprietary
πŸͺ™Granite-3.1-2B-Instruct11361166IBMApache 2.0
πŸͺ™Qwen1.5-14B-Chat11351144AlibabaQianwen
πŸͺ™Claude-Instant-111351136243.4AnthropicProprietary
πŸͺ™Yi-34B-Chat1134112901 AIYi
πŸͺ™Tulu-2-DPO-70B11271120Ai2Ai2 ImpACT
πŸͺ™DBRX-Instruct-Preview11261141DatabricksDBRX
πŸͺ™WizardLM-70B-v1.011261093MicrosoftLlama 2
πŸͺ™Llama-2-70B-chat11221099840.7MetaLlama 2
πŸͺ™Nous-Hermes-2-Mixtral-8x7B-DPO11191103NousResearchApache-2.0
πŸͺ™Llama-3.2-3B-Instruct11181097734.7MetaLlama 3.2
πŸͺ™Phi-3-Small-8k-Instruct11171123MicrosoftMIT
πŸͺ™OpenChat-3.5-010611141119OpenChatApache-2.0
πŸͺ™Starling-LM-7B-alpha11141104UC BerkeleyCC-BY-NC-4.0
πŸͺ™Vicuna-33B11131091LMSYSNon-commercial
πŸͺ™DeepSeek-LLM-67B-Chat111111068DeepSeekDeepSeek
πŸͺ™Snowflake Arctic Instruct11091101SnowflakeApache 2.0
πŸͺ™Granite-3.0-8B-Instruct11081115IBMApache 2.0
πŸͺ™NV-Llama2-70B-SteerLM-Chat11061047NvidiaLlama 2
πŸͺ™OpenChat-3.511031077OpenChatApache-2.0
πŸͺ™Gemma-1.1-7B-it11021105GoogleGemma
πŸͺ™OpenHermes-2.5-Mistral-7B11001083NousResearchApache-2.0
πŸͺ™pplx-70B-online10991055Perplexity AIProprietary
πŸͺ™Mistral-7B-Instruct-v0.210971094124.5MistralApache-2.0
πŸͺ™Llama-2-13b-chat10931077840.6MetaLlama 2
πŸͺ™Granite-3.0-2B-Instruct10911104IBMApache 2.0
πŸͺ™SOLAR-10.7B-Instruct-v1.010911073Upstage AICC-BY-NC-4.0
πŸͺ™Qwen1.5-7B-Chat10901110AlibabaQianwen
πŸͺ™Phi-3-Mini-4K-Instruct-June-2410881098MicrosoftMIT
πŸͺ™Dolphin-2.2.1-Mistral-7B10881049CognitiveApache-2.0
πŸͺ™WizardLM-13b-v1.210841048MicrosoftLlama 2
πŸͺ™Phi-3-Mini-4k-Instruct10821102MicrosoftMIT
πŸͺ™MPT-30B-chat10761055MosaicMLCC-BY-NC-SA-4.0
πŸͺ™Zephyr-7B-beta10761053HuggingFaceMIT
πŸͺ™CodeLlama-34B-instruct10731065MetaLlama 2
πŸͺ™Llama-3.2-1B-Instruct10671063120MetaLlama 3.2
πŸͺ™Qwen2.5-VL-32B-Instruct1199AlibabaApache 2.0
πŸͺ™Step-1o-Vision-32k (highres)1169StepFunProprietary
πŸͺ™Qwen2.5-VL-72B-Instruct1154AlibabaQwen
πŸͺ™Pixtral-Large-241111382670.1MistralMRL
πŸͺ™Qwen-VL-Max-11191106AlibabaProprietary
πŸͺ™Qwen2-VL-72b-Instruct1094AlibabaQwen
πŸͺ™Step-1V-32K1093StepFunProprietary
πŸͺ™Molmo-72B-09241060Ai2Apache 2.0
πŸͺ™Pixtral-12B-240910561147.3MistralApache 2.0
πŸͺ™InternVL2-26B1053OpenGVLabMIT
πŸͺ™Llama-3.2-90B-Vision-Instruct10472267.1MetaLlama 3.2
πŸͺ™Hunyuan-Standard-Vision-2024-12-311046TencentProprietary
πŸͺ™Aya-Vision-32B1042CohereCC-BY-NC-4.0
πŸͺ™Qwen2-VL-7B-Instruct1038AlibabaApache 2.0
πŸͺ™Yi-Vision102801 AIProprietary
πŸͺ™Llama-3.2-11B-Vision-Instruct10141346.4MetaLlama 3.2
πŸͺ™Molmo-7B-D-09241007Ai2Apache 2.0

OproAI

SWE-bench + | GitHub | Stats

πŸ’‘ AAII v2.2 incorporates 8 evaluations: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME, IFBench, AA-LCR.

πŸ’» Arena Elo ratings are computed by this notebook. Higher values are better for all benchmarks. Empty cells mean not available.

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance β€œconverges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. This provided consistent stable scores and allowed us to incorporate new models quickly.

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.