Chatbot Arena

Attribution LMSYS β€’ June 24, 2025

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.1M+ user votes to compute Elo ratings.
  • MMLU - a test to measure a model’s multitask accuracy on 57 tasks.
  • Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best Open LM

ModelArena EloMMLULicense
DeepSeek DeepSeek-R1-0528142490.8MIT
Qwen Qwen3-235B-A22B-no-thinking138888.5Apache 2.0
DeepSeek DeepSeek-V3-0324138588.5MIT
Minimax Minimax-M11373Apache 2.0
Qwen Qwen3-235B-A22B136588.5Apache 2.0
Gemini Gemma-3-27B-it1357Gemma

Full Leaderboard
ModelArena EloCodingVisionArena HardMMLUVotesOrganizationLicense
πŸ₯‡ Gemini-2.5-Pro14771492134296.412327GoogleProprietary
πŸ₯‡ o3-2025-04-1614281443129618205OpenAIProprietary
πŸ₯‡ ChatGPT-4o-latest (2025-03-26)14281440130822488OpenAIProprietary
πŸ₯‡ DeepSeek-R1-05281424143193.290.811871DeepSeekMIT
πŸ₯‡ Grok-3-Preview-02-241422143392.724316xAIProprietary
πŸ₯‡ Gemini-2.5-Flash14201430129917535GoogleProprietary
πŸ₯‡ GPT-4.5-Preview14151418125315271OpenAIProprietary
πŸ₯ˆ Gemini-2.0-Flash-Thinking-Exp-01-2113981381127527618GoogleProprietary
πŸ₯ˆ Gemini-2.0-Pro-Exp-02-0513971396123920120GoogleProprietary
πŸ₯ˆ Qwen3-235B-A22B-no-thinking1388141095.688.512320AlibabaApache 2.0
πŸ₯ˆ GPT-4.1-2025-04-1413861392127416362OpenAIProprietary
πŸ₯ˆ DeepSeek-V3-03241385140085.588.519091DeepSeekMIT
πŸ₯ˆ Hunyuan-Turbos-20250416137613807816TencentProprietary
πŸ₯ˆ DeepSeek-R11375138193.290.819430DeepSeekMIT
πŸ₯ˆ Claude Opus 4 (20250514)13731415123118287AnthropicProprietary
πŸ₯ˆ Minimax-M1137313823895MiniMaxApache 2.0
πŸ₯ˆ Mistral Medium 313691389119216637MistralProprietary
πŸ₯ˆ o1-2024-12-1713671376122892.191.829038OpenAIProprietary
πŸ₯ˆ Qwen3-235B-A22B1365139095.688.513002AlibabaApache 2.0
πŸ₯ˆ Gemini-2.0-Flash-00113641363121535894GoogleProprietary
πŸ₯ˆ Grok-3-Mini-beta136313868715xAIProprietary
πŸ₯ˆ o4-mini-2025-04-1613631381125116112OpenAIProprietary
πŸ₯ˆ Qwen2.5-Max1363136731170AlibabaProprietary
πŸ₯ˆ Gemma-3-27B-it13571338120125323GoogleGemma
πŸ₯ˆ Claude Sonnet 4 (20250514)13451384122214984AnthropicProprietary
πŸ₯ˆ o3-mini-high1342138019404OpenAIProprietary
πŸ₯ˆ GPT-4.1-mini-2025-04-1413391375123715337OpenAIProprietary
πŸ₯‰ Gemma-3-12B-it133813083976GoogleGemma
πŸ₯‰ DeepSeek-V31336133785.588.522841DeepSeekDeepSeek
πŸ₯‰ QwQ-32B1334134917462AlibabaApache 2.0
πŸ₯‰ Gemini-2.0-Flash-Lite13301338115626104GoogleProprietary
πŸ₯‰ Amazon-Nova-Experimental-Chat-05-14132813425213AmazonProprietary
πŸ₯‰ Qwen-Plus-0125132813376055AlibabaProprietary
πŸ₯‰ GLM-4-Plus-0111132813076028ZhipuProprietary
πŸ₯‰ Command A (03-2025)1327133522851CohereCC-BY-NC-4.0
πŸ₯‰ o3-mini1323136335063OpenAIProprietary
πŸ₯‰ Step-2-16K-Exp132213125126StepFunProprietary
πŸ₯‰ o1-mini132113709254951OpenAIProprietary
πŸ₯‰ Gemini-1.5-Pro-00213201307122258645GoogleProprietary
πŸ₯‰ Claude 3.7 Sonnet (thinking-32k)13171349122024159AnthropicProprietary
πŸ₯‰ Hunyuan-Turbo-0110131413332510TencentProprietary
πŸ₯‰ Llama-3.3-Nemotron-Super-49B-v11314131888.3862371NvidiaNvidia
πŸ₯‰ Claude 3.7 Sonnet13091346120828664AnthropicProprietary
πŸ₯‰ Grok-2-08-131305129987.567084xAIProprietary
πŸ₯‰ Yi-Lightning1304131981.52896801 AIProprietary
πŸ₯‰ Gemma-3n-e4b-it130312745282GoogleGemma
πŸ₯‰ GPT-4o-2024-05-1313021309120679.2188.7117747OpenAIProprietary
πŸ₯‰ Claude 3.5 Sonnet (20241022)13011341118785.288.775986AnthropicProprietary
Deepseek-v2.5-1210129713147243DeepSeekDeepSeek
Athene-v2-Chat-72B129313178526074NexusFlowNexusFlow
Llama-4-Maverick-17B-128E-Instruct12921309116415906MetaLlama 4
Gemma-3-4B-it129212644321GoogleGemma
Hunyuan-Large-2025-02-10128913103856TencentProprietary
GPT-4o-mini-2024-07-1812891299112374.948272536OpenAIProprietary
Gemini-1.5-Flash-00212891271120637021GoogleProprietary
GPT-4.1-nano-2025-04-141288131111166302OpenAIProprietary
Llama-3.1-405B-Instruct-bf161286129788.643788MetaLlama 3.1
Llama-3.1-Nemotron-70B-Instruct1286128884.97577NvidiaLlama 3.1
Llama-3.1-405B-Instruct-fp81285129369.388.663038MetaLlama 3.1
Grok-2-Mini-08-131284127955442xAIProprietary
Yi-Lightning-lite128212831706701 AIProprietary
Qwen-Max-09191281129617432AlibabaQwen
Hunyuan-Standard-2025-02-10127812874014TencentProprietary
Qwen2.5-72B-Instruct127512997841519AlibabaQwen
Llama-3.3-70B-Instruct1275127546558MetaLlama-3.3
GPT-4-Turbo-2024-04-0912741279115182.63102133OpenAIProprietary
Mistral-Small-3.1-24B-Instruct-25031271128811624963MistralApache 2.0
Llama-4-Scout-17B-16E-Instruct1271128311574998MetaLlama
Athene-70B1268127077.620580NexusFlowCC-BY-NC-4.0
GPT-4-1106-preview12671269103748OpenAIProprietary
Mistral-Large-24111266128370.4229633MistralMRL
Llama-3.1-70B-Instruct1265126855.738658637MetaLlama 3.1
Claude 3 Opus12651267107660.3686.8202641AnthropicProprietary
magistral-medium-2506126313233089MistralProprietary
Amazon Nova Pro 1.012621280104426371AmazonProprietary
GPT-4-0125-preview1262126077.9697079OpenAIProprietary
Llama-3.1-Tulu-3-70B126212503010Ai2Llama 3.1
Claude 3.5 Haiku (20241022)12561285115647551AnthropicPropretary
Reka-Core-20240904125312387948Reka AIProprietary
Gemini-1.5-Flash-00112441248107249.6178.965661GoogleProprietary
Jamba-1.5-Large1239124481.29125AI21 LabsJamba Open
Deepseek-v2-API-06281237125819508DeepSeek AIDeepSeek
Gemma-2-27B-it1237122657.5179538GoogleGemma license
Qwen2.5-Coder-32B-Instruct123512785730AlibabaApache 2.0
Mistral-Small-24B-Instruct-25011235124915321MistralApache 2.0
Amazon Nova Lite 1.012341252106120646AmazonProprietary
Gemma-2-9B-it-SimPO1234121310548PrincetonMIT
Command R+ (08-2024)1233119810535CohereCC-BY-NC-4.0
Deepseek-Coder-v2-07241232128362.311725DeepSeekProprietary
Gemini-1.5-Flash-8B-00112301225110737697GoogleProprietary
Llama-3.1-Nemotron-51B-Instruct122912273889NvidiaLlama 3.1
Nemotron-4-340B-Instruct1227121520608NvidiaNvidia
Aya-Expanse-32B1227120928768CohereCC-BY-NC-4.0
GLM-4-05201224123363.8410221Zhipu AIProprietary
Llama-3-70B-Instruct1224121646.5782163629MetaLlama 3
Phi-41223123925213MicrosoftMIT
OLMo-2-0325-32B-Instruct122312143460Allen AIApache-2.0
Reka-Flash-20240904122312088132Reka AIProprietary
Hunyuan-Large-Vision1219123311873478TencentProprietary
Claude 3 Sonnet12181229104846.879113067AnthropicProprietary
Amazon Nova Micro 1.01215122720654AmazonProprietary
Gemma-2-9B-it1209119057197GoogleGemma license
Hunyuan-Standard-256K120612432901TencentProprietary
Qwen2-72B-Instruct1205120346.8684.238872AlibabaQianwen LICENSE
GPT-4-0314120412125086.455962OpenAIProprietary
Llama-3.1-Tulu-3-8B120311953074Ai2Llama 3.1
Ministral-8B-2410120012185111MistralMRL
Claude 3 Haiku11971206100041.4775.2122309AnthropicProprietary
Aya-Expanse-8B1197118210391CohereCC-BY-NC-4.0
Command R (08-2024)1197117810851CohereCC-BY-NC-4.0
DeepSeek-Coder-V2-Instruct1196125615753DeepSeek AIDeepSeek License
Llama-3.1-8B-Instruct1193120321.347352578MetaLlama 3.1
Jamba-1.5-Mini1193119769.79274AI21 LabsJamba Open
GPT-4-06131181118337.991614OpenAIProprietary
Qwen1.5-110B-Chat1179119180.427430AlibabaQianwen LICENSE
Yi-1.5-34B-Chat1175117976.82513501 AIApache-2.0
Llama-3-8B-Instruct1169116220.5668.4109056MetaLlama 3
InternLM2.5-20B-chat1166117510599InternLMOther
Claude-1116611527721149AnthropicProprietary
Qwen1.5-72B-Chat1165117636.1277.540658AlibabaQianwen LICENSE
Mixtral-8x22b-Instruct-v0.11165116936.3677.853751MistralApache 2.0
Mistral Medium1165116931.975.335556MistralProprietary
Gemma-2-2b-it1161112451.348892GoogleGemma license
Granite-3.1-8B-Instruct116011903289IBMApache 2.0
Claude-2.01149115123.9978.512763AnthropicProprietary
Gemini-1.0-Pro-0011149111971.818800GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1114511414854HuggingFaceApache 2.0
Qwen1.5-32B-Chat1143116673.422765AlibabaQianwen LICENSE
Phi-3-Medium-4k-Instruct1140114233.377826105MicrosoftMIT
Granite-3.1-2B-Instruct113711643380IBMApache 2.0
Claude-2.11136114822.7737699AnthropicProprietary
Starling-LM-7B-beta1136114623.0116676NexusflowApache-2.0
GPT-3.5-Turbo-06131134115124.8238955OpenAIProprietary
Mixtral-8x7B-Instruct-v0.11131113123.470.676126MistralApache 2.0
Claude-Instant-11129112573.420631AnthropicProprietary
Yi-34B-Chat1129112323.1573.51591701 AIYi License
Qwen1.5-14B-Chat1126114267.618687AlibabaQianwen LICENSE
WizardLM-70B-v1.01124108763.78383MicrosoftLlama 2
DBRX-Instruct-Preview1121113424.6373.733743DatabricksDBRX LICENSE
Llama-3.2-3B-Instruct112010978390MetaLlama 3.2
Phi-3-Small-8k-Instruct1119112429.7775.718476MicrosoftMIT
Tulu-2-DPO-70B1116111014.996658AllenAI/UWAI2 ImpACT Low-risk
Granite-3.0-8B-Instruct111111147002IBMApache 2.0
Llama-2-70B-chat1110108911.556339595MetaLlama 2
OpenChat-3.5-01061109111965.812990OpenChatApache-2.0
Vicuna-33B110810848.6359.222936LMSYSNon-commercial
Snowflake Arctic Instruct1107109317.6167.334173SnowflakeApache 2.0
Starling-LM-7B-alpha1106109612.863.910415UC BerkeleyCC-BY-NC-4.0
Nous-Hermes-2-Mixtral-8x7B-DPO110210963836NousResearchApache-2.0
Gemma-1.1-7B-it1101110112.0964.325070GoogleGemma license
NV-Llama2-70B-SteerLM-Chat1098103968.53636NvidiaLlama 2
pplx-70B-online109510446898Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1094109671.34988DeepSeek AIDeepSeek License
OpenChat-3.51094107064.38106OpenChatApache-2.0
OpenHermes-2.5-Mistral-7B109210745088NousResearchApache-2.0
Granite-3.0-2B-Instruct109111047191IBMApache 2.0
Mistral-7B-Instruct-v0.21090109012.5720067MistralApache-2.0
Phi-3-Mini-4K-Instruct-June-241088109870.912808MicrosoftMIT
Qwen1.5-7B-Chat10871106614872AlibabaQianwen LICENSE
Phi-3-Mini-4k-Instruct1084110268.821097MicrosoftMIT
Llama-2-13b-chat1081106853.619722MetaLlama 2
SOLAR-10.7B-Instruct-v1.01080106466.24286Upstage AICC-BY-NC-4.0
Dolphin-2.2.1-Mistral-7B108010421714Cognitive ComputationsApache-2.0
WizardLM-13b-v1.21076104252.77176MicrosoftLlama 2
Llama-3.2-1B-Instruct107110638523MetaLlama 3.2
Gemini-2.5-Flash-Lite-Preview-06-17-Thinking12381442GoogleProprietary
Qwen2.5-VL-32B-Instruct12121505AlibabaApache 2.0
Step-1o-Vision-32k (highres)11852891StepFunProprietary
Qwen2.5-VL-72B-Instruct11683884AlibabaQwen
Pixtral-Large-241111535546MistralMRL
Qwen-VL-Max-111911281449AlibabaProprietary
Step-1V-32K11121553StepFunProprietary
Qwen2-VL-72b-Instruct11116028AlibabaQwen
Molmo-72B-092410763092AI2Apache 2.0
Pixtral-12B-240910727623MistralApache 2.0
Llama-3.2-90B-Vision-Instruct10708829MetaLlama 3.2
InternVL2-26B10675265OpenGVLabMIT
Hunyuan-Standard-Vision-2024-12-311063811TencentProprietary
Aya-Vision-32B1058849CohereCC-BY-NC-4.0
Qwen2-VL-7B-Instruct10545854AliabaApache 2.0
Yi-Vision1046123701 AIProprietary
Llama-3.2-11B-Vision-Instruct10324893MetaLlama 3.2

If you want to see more models, please help us add them.

πŸ’» Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

πŸ”— Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance β€œconverges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.