Chatbot Arena +

Attribution OproAI   2025Aug 1

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.5M+ user votes to compute Elo ratings.
  • AAI - Artificial Analysis Intelligence Index aggregating 7 challenging evaluations.
  • MMLU-Pro - an enhanced version of MMLU with 12,000 graduate-level questions across 14 subject areas.

πŸ”

ModelArena EloCodingVisionAAIMMLU-ProVotesOrganizationLicense
πŸ₯‡Gemini-2.5-Pro14701474131870.586.225480GoogleProprietary
πŸ₯‡Grok-4-070914351442127273.286.612591xAIProprietary
πŸ₯‡Qwen3-235B-A22B-Instruct-25071431146460.482.83386AlibabaApache 2.0
πŸ₯‡ChatGPT-4o-latest (2025-03-26)14291435129750.380.330344OpenAIProprietary
πŸ₯‡o3-2025-04-161428144512727085.331450OpenAIProprietary
πŸ₯‡DeepSeek-R1-05281425143668.384.917934DeepSeekMIT
πŸ₯‡Grok-3-Preview-02-241424143956.179.931348xAIProprietary
πŸ₯‡GPT-4.5-Preview141514191239538115271OpenAIProprietary
πŸ₯‡Qwen3-235B-A22B-Thinking-25071413144268.984.32777AlibabaApache 2.0
πŸ₯‡Gemini-2.5-Flash14131421128165.183.230765GoogleProprietary
πŸ₯ˆGemini-2.0-Pro-Exp-02-0513971398122049.280.520120GoogleProprietary
πŸ₯ˆGemini-2.0-Flash-Thinking-Exp-01-2113971383125852.379.827555GoogleProprietary
πŸ₯ˆQwen3-235B-A22B-no-thinking1389141160.482.824015AlibabaApache 2.0
πŸ₯ˆkimi-k2-0711-preview1381140757.682.410934MoonshotModified MIT
πŸ₯ˆGPT-4.1-2025-04-1413811397125452.980.624381OpenAIProprietary
πŸ₯ˆDeepSeek-V3-03241376139153.281.926999DeepSeekMIT
πŸ₯ˆHunyuan-Turbos-20250416137613887811682TencentProprietary
πŸ₯ˆClaude Opus 4 (thinking-16k)13751433122364.487.317512AnthropicProprietary
πŸ₯ˆDeepSeek-R11373138260.284.419430DeepSeekMIT
πŸ₯ˆMistral Medium 3137013871195497627612MistralProprietary
πŸ₯ˆGemini-2.0-Flash-Exp13701371124148.178.222500GoogleProprietary
πŸ₯ˆClaude Opus 4 (20250514)13671406121257.78625729AnthropicProprietary
πŸ₯ˆQwen2.5-Max1367137345.376.235504AlibabaProprietary
πŸ₯ˆo1-2024-12-1713661378121661.984.129038OpenAIProprietary
πŸ₯ˆQwen3-235B-A22B1365139462.382.820118AlibabaApache 2.0
πŸ₯ˆGrok-3-mini-high1362137666.782.89780xAIProprietary
πŸ₯ˆo4-mini-2025-04-1613611385123969.883.224037OpenAIProprietary
πŸ₯ˆQwen3-Coder-480B-A35B-Instruct1357140256.778.85547AlibabaApache 2.0
πŸ₯ˆGemma-3-27B-it13571348121237.666.931799GoogleGemma
πŸ₯ˆMinimax-M1135513696381.616941MiniMaxApache 2.0
πŸ₯ˆClaude Sonnet 4 (thinking-32k)13511413122462.984.216700AnthropicProprietary
πŸ₯ˆQwen3-32B1342137659.279.84074AlibabaApache 2.0
πŸ₯ˆClaude Sonnet 4 (20250514)1340138712125383.722210AnthropicProprietary
πŸ₯ˆo3-mini-high1338138065.580.219404OpenAIProprietary
πŸ₯‰GPT-4.1-mini-2025-04-1413361368122952.678.123459OpenAIProprietary
πŸ₯‰Mistral-Small-250613351362118642.368.18714MistralApache 2.0
πŸ₯‰Gemma-3-12B-it1335131033.859.53976GoogleGemma
πŸ₯‰DeepSeek-V31334133745.675.222841DeepSeekDeepSeek
πŸ₯‰GLM-4-Plus-01111332131078.66028ZhipuProprietary
πŸ₯‰QwQ-32B1331134958.176.421864AlibabaApache 2.0
πŸ₯‰Gemini-2.0-Flash-Lite13311338114741.472.426104GoogleProprietary
πŸ₯‰Qwen-Plus-0125132813396055AlibabaProprietary
πŸ₯‰Command A (03-2025)132613344071.230558CohereCC-BY-NC-4.0
πŸ₯‰Amazon-Nova-Chat-05-141322133442.673.313462AmazonProprietary
πŸ₯‰Qwen3-30B-A3B1321134655.677.719956AlibabaApache 2.0
πŸ₯‰Llama-3.1-Nemotron-Ultra-253B-v11321134560.882.52656NvidiaNvidia Open Model
πŸ₯‰Step-2-16K-Exp132113135126StepFunProprietary
πŸ₯‰Gemini-1.5-Pro-00213201310120844.67558645GoogleProprietary
πŸ₯‰o1-mini1319136653.874.254951OpenAIProprietary
πŸ₯‰o3-mini1318136062.979.142792OpenAIProprietary
πŸ₯‰Claude 3.7 Sonnet (thinking-32k)13181355120557.483.732207AnthropicProprietary
πŸ₯‰Hunyuan-Turbo-0110131413352510TencentProprietary
πŸ₯‰Llama-3.3-Nemotron-Super-49B-v11310132051.278.52371NvidiaNvidia Open Model
πŸ₯‰Grok-2-08-131306129839.270.967084xAIProprietary
πŸ₯‰Gemma-3n-e4b-it130412972848.812786GoogleGemma
πŸ₯‰Yi-Lightning130313212896801 AIProprietary
πŸ₯‰Claude 3.7 Sonnet13021342119548.280.337199AnthropicProprietary
πŸ₯‰GPT-4o-2024-05-1313021307118441.574.8117747OpenAIProprietary
πŸ₯‰Claude 3.5 Sonnet (20241022)12991341117244.477.284101AnthropicProprietary
πŸͺ™Deepseek-v2.5-12101296131635.367.27243DeepSeekDeepSeek
πŸͺ™Athene-v2-Chat-72B1294132026074NexusFlowNexusFlow
πŸͺ™Gemma-3-4B-it1293126525.441.74321GoogleGemma
πŸͺ™Llama-4-Maverick-17B-128E-Instruct12921312118550.580.924088MetaLlama 4
πŸͺ™GLM-4-Plus1292130170.227788ZhipuProprietary
πŸͺ™Hunyuan-Large-2025-02-10129113113856TencentProprietary
πŸͺ™Gemini-1.5-Flash-002129012731187396837021GoogleProprietary
πŸͺ™GPT-4o-mini-2024-07-1812891300111335.764.872429OpenAIProprietary
πŸͺ™GPT-4.1-nano-2025-04-141287131211094165.76302OpenAIProprietary
πŸͺ™Llama-3.1-405B-Instruct-bf161286129940.573.243788MetaLlama 3.1
πŸͺ™Llama-3.1-Nemotron-70B-Instruct1285128937.3697577NvidiaLlama 3.1
πŸͺ™Qwen-Max-09191284129617432AlibabaQwen
πŸͺ™Llama-3.1-405B-Instruct-fp81284129240.573.263038MetaLlama 3.1
πŸͺ™Yi-Lightning-lite128412861706701 AIProprietary
πŸͺ™Claude 3.5 Sonnet (20240620)1283130911674075.186159AnthropicProprietary
πŸͺ™Grok-2-Mini-08-131283127955442xAIProprietary
πŸͺ™Hunyuan-Standard-2025-02-10127612894014TencentProprietary
πŸͺ™Deepseek-v2.51275130634.766.226344DeepSeekDeepSeek
πŸͺ™Llama-4-Scout-17B-16E-Instruct1275128911694375.213727MetaLlama 4
πŸͺ™GPT-4-Turbo-2024-04-0912751280113738.869.4102133OpenAIProprietary
πŸͺ™Llama-3.3-70B-Instruct1275127941.171.351606MetaLlama 3.3
πŸͺ™Qwen2.5-72B-Instruct1272130240.47241519AlibabaQwen
πŸͺ™Hunyuan-Large-Vision1270129812386115TencentProprietary
πŸͺ™Mistral-Small-3.1-24B-Instruct-250312691293116235.365.913698MistralApache 2.0
πŸͺ™Mistral-Large-24111269128438.369.729633MistralMRL
πŸͺ™Athene-70B1268127420580NexusFlowCC-BY-NC-4.0
πŸͺ™GPT-4-1106-preview126712683663.7103748OpenAIProprietary
πŸͺ™GPT-4-0125-preview1266126197079OpenAIProprietary
πŸͺ™Claude 3 Opus12651269107035.169.6202641AnthropicProprietary
πŸͺ™Llama-3.1-70B-Instruct1265126835.467.658637MetaLlama 3.1
πŸͺ™Amazon Nova Pro 1.012621282102937.169.126371AmazonProprietary
πŸͺ™Llama-3.1-Tulu-3-70B126012513010Ai2Llama 3.1
πŸͺ™Claude 3.5 Haiku (20241022)12561287114434.763.455498AnthropicProprietary
πŸͺ™magistral-medium-2506125213075675.38240MistralProprietary
πŸͺ™Reka-Core-202409041252123833.97948Reka AIProprietary
πŸͺ™Reka-Core-202407221250122613279Reka AIProprietary
πŸͺ™Qwen-Plus-08281242126314626AlibabaProprietary
πŸͺ™Jamba-1.5-Large1242124429.357.29125AI21 LabsJamba Open
πŸͺ™Deepseek-v2-API-06281240126019508DeepSeekDeepSeek
πŸͺ™Mistral-Small-24B-Instruct-25011238125035.365.215321MistralApache 2.0
πŸͺ™Deepseek-Coder-v2-0724123712862958.511725DeepSeekDeepSeek
πŸͺ™Yi-Large1236123827.958.61662401 AIProprietary
πŸͺ™Gemma-2-27B-it1236122631.757.579538GoogleGemma
πŸͺ™Qwen2.5-Coder-32B-Instruct1235127936.363.55730AlibabaApache 2.0
πŸͺ™Amazon Nova Lite 1.012331253103932.55920646AmazonProprietary
πŸͺ™Gemma-2-9B-it-SimPO1233121110548PrincetonMIT
πŸͺ™Command R+ (08-2024)1233120021.543.210535CohereCC-BY-NC-4.0
πŸͺ™GLM-4-05201231123610221ZhipuProprietary
πŸͺ™Gemini-1.5-Flash-8B-00112311228109230.856.937697GoogleProprietary
πŸͺ™Llama-3.1-Nemotron-51B-Instruct123112273889NvidiaLlama 3.1
πŸͺ™Nemotron-4-340B-Instruct1229122020608NvidiaNvidia Open Model
πŸͺ™Aya-Expanse-32B1229121120.137.728768CohereCC-BY-NC-4.0
πŸͺ™Llama-3-70B-Instruct1225121627.557.4163629MetaLlama 3
πŸͺ™Reka-Flash-202409041225120833.68132Reka AIProprietary
πŸͺ™Claude 3 Sonnet12231232103327.857.9113067AnthropicProprietary
πŸͺ™OLMo-2-0325-32B-Instruct122312153460Ai2Apache-2.0
πŸͺ™Phi-41222124240.271.425213MicrosoftMIT
πŸͺ™Reka-Flash-202407221218120113725Reka AIProprietary
πŸͺ™Amazon Nova Micro 1.01215122828.353.120654AmazonProprietary
πŸͺ™Gemma-2-9B-it1213119422.249.557197GoogleGemma
πŸͺ™Hunyuan-Standard-256K120912442901TencentProprietary
πŸͺ™Command R+ (04-2024)1209118419.942.780846CohereCC-BY-NC-4.0
πŸͺ™Qwen2-72B-Instruct1208120632.662.238872AlibabaQianwen
πŸͺ™Claude 3 Haiku12001208100024.150122309AnthropicProprietary
πŸͺ™Llama-3.1-Tulu-3-8B120011973074Ai2Llama 3.1
πŸͺ™Qwen-Max-04281199120825696AlibabaProprietary
πŸͺ™Ministral-8B-24101198121922.338.95111MistralMRL
πŸͺ™GLM-4-0116119812097579ZhipuProprietary
πŸͺ™DeepSeek-Coder-V2-Instruct1197125915753DeepSeekDeepSeek
πŸͺ™Command R (08-2024)1195118014.833.810851CohereCC-BY-NC-4.0
πŸͺ™Llama-3.1-8B-Instruct1193120323.747.652578MetaLlama 3.1
πŸͺ™Jamba-1.5-Mini119311969274AI21 LabsJamba Open
πŸͺ™Aya-Expanse-8B119311841631.210391CohereCC-BY-NC-4.0
πŸͺ™Qwen1.5-110B-Chat118011922527430AlibabaQianwen
πŸͺ™Claude-11179116121149AnthropicProprietary
πŸͺ™Yi-1.5-34B-Chat117811812513501 AIApache-2.0
πŸͺ™Qwen1.5-72B-Chat1172117540658AlibabaQianwen
πŸͺ™Mistral Medium1171117222.849.135556MistralProprietary
πŸͺ™Llama-3-8B-Instruct1171116421.540.5109056MetaLlama 3
πŸͺ™Command R (04-2024)1170114114.733.756398CohereCC-BY-NC-4.0
πŸͺ™InternLM2.5-20B-chat1168117910599InternLMOther
πŸͺ™Mixtral-8x22b-Instruct-v0.11168117526.253.753751MistralApache 2.0
πŸͺ™Gemma-2-2b-it1163113048892GoogleGemma
πŸͺ™Granite-3.1-8B-Instruct115811913289IBMApache 2.0
πŸͺ™Claude-2.0115811602348.612763AnthropicProprietary
πŸͺ™Gemini-1.0-Pro-0011155112518800GoogleProprietary
πŸͺ™Zephyr-ORPO-141b-A35b-v0.1115011444854HuggingFaceApache 2.0
πŸͺ™GPT-3.5-Turbo-06131146116422.746.238955OpenAIProprietary
πŸͺ™Claude-2.1114611582449.537699AnthropicProprietary
πŸͺ™Qwen1.5-32B-Chat1144116322765AlibabaQianwen
πŸͺ™Phi-3-Medium-4k-Instruct1144114524.554.326105MicrosoftMIT
πŸͺ™Starling-LM-7B-beta1139115116676NexusflowApache-2.0
πŸͺ™Mixtral-8x7B-Instruct-v0.1113811361738.776126MistralApache 2.0
πŸͺ™GPT-3.5-Turbo-0314113811365640OpenAIProprietary
πŸͺ™Granite-3.1-2B-Instruct113611663380IBMApache 2.0
πŸͺ™Qwen1.5-14B-Chat1135114418687AlibabaQianwen
πŸͺ™Claude-Instant-11135113620631AnthropicProprietary
πŸͺ™Yi-34B-Chat113411291591701 AIYi
πŸͺ™Tulu-2-DPO-70B112711206658Ai2Ai2 ImpACT
πŸͺ™DBRX-Instruct-Preview1126114133743DatabricksDBRX
πŸͺ™WizardLM-70B-v1.0112610938383MicrosoftLlama 2
πŸͺ™Llama-2-70B-chat112210992040.739595MetaLlama 2
πŸͺ™Nous-Hermes-2-Mixtral-8x7B-DPO111911033836NousResearchApache-2.0
πŸͺ™Llama-3.2-3B-Instruct1118109719.534.78390MetaLlama 3.2
πŸͺ™Phi-3-Small-8k-Instruct1117112218476MicrosoftMIT
πŸͺ™OpenChat-3.5-01061114111912990OpenChatApache-2.0
πŸͺ™Starling-LM-7B-alpha1114110410415UC BerkeleyCC-BY-NC-4.0
πŸͺ™Vicuna-33B1113109122936LMSYSNon-commercial
πŸͺ™DeepSeek-LLM-67B-Chat11111106204988DeepSeekDeepSeek
πŸͺ™Snowflake Arctic Instruct1109110134173SnowflakeApache 2.0
πŸͺ™Granite-3.0-8B-Instruct110811157002IBMApache 2.0
πŸͺ™NV-Llama2-70B-SteerLM-Chat110610473636NvidiaLlama 2
πŸͺ™Gemma-1.1-7B-it1103110525070GoogleGemma
πŸͺ™OpenChat-3.5110310778106OpenChatApache-2.0
πŸͺ™OpenHermes-2.5-Mistral-7B110010835088NousResearchApache-2.0
πŸͺ™pplx-70B-online109910546898Perplexity AIProprietary
πŸͺ™Mistral-7B-Instruct-v0.21097109410.124.520067MistralApache-2.0
πŸͺ™Llama-2-13b-chat1093107719.940.619722MetaLlama 2
πŸͺ™SOLAR-10.7B-Instruct-v1.0109210734286Upstage AICC-BY-NC-4.0
πŸͺ™Granite-3.0-2B-Instruct109111047191IBMApache 2.0
πŸͺ™Qwen1.5-7B-Chat109011104872AlibabaQianwen
πŸͺ™Phi-3-Mini-4K-Instruct-June-241088109812808MicrosoftMIT
πŸͺ™Dolphin-2.2.1-Mistral-7B108810491714CognitiveApache-2.0
πŸͺ™WizardLM-13b-v1.2108410487176MicrosoftLlama 2
πŸͺ™Phi-3-Mini-4k-Instruct1082110221097MicrosoftMIT
πŸͺ™MPT-30B-chat107610542644MosaicMLCC-BY-NC-SA-4.0
πŸͺ™Zephyr-7B-beta1076105311321HuggingFaceMIT
πŸͺ™CodeLlama-34B-instruct107310657509MetaLlama 2
πŸͺ™Llama-3.2-1B-Instruct106710639.7208523MetaLlama 3.2
πŸͺ™Qwen2.5-VL-32B-Instruct11981505AlibabaApache 2.0
πŸͺ™Step-1o-Vision-32k (highres)11692891StepFunProprietary
πŸͺ™Qwen2.5-VL-72B-Instruct11543884AlibabaQwen
πŸͺ™Pixtral-Large-2411113737.470.15546MistralMRL
πŸͺ™Qwen-VL-Max-111911061449AlibabaProprietary
πŸͺ™Qwen2-VL-72b-Instruct10946028AlibabaQwen
πŸͺ™Step-1V-32K10931553StepFunProprietary
πŸͺ™Molmo-72B-092410603092Ai2Apache 2.0
πŸͺ™Pixtral-12B-2409105623.447.37623MistralApache 2.0
πŸͺ™InternVL2-26B10535265OpenGVLabMIT
πŸͺ™Llama-3.2-90B-Vision-Instruct104733.467.18829MetaLlama 3.2
πŸͺ™Hunyuan-Standard-Vision-2024-12-311046811TencentProprietary
πŸͺ™Aya-Vision-32B1042849CohereCC-BY-NC-4.0
πŸͺ™Qwen2-VL-7B-Instruct10385854AlibabaApache 2.0
πŸͺ™Yi-Vision1028123701 AIProprietary
πŸͺ™Llama-3.2-11B-Vision-Instruct10152546.44893MetaLlama 3.2
πŸͺ™Molmo-7B-D-092410072854Ai2Apache 2.0

πŸ”— Arena Statistics

Vote | Blog | GitHub | Paper | Dataset

πŸ’» Code: The Arena Elo ratings are computed by this notebook. Higher values are better for all benchmarks. Empty cells mean not available.

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance β€œconverges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. This provided consistent stable scores and allowed us to incorporate new models quickly.

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.