Chatbot Arena

Attribution LMSYS β€’ May 22, 2025

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3M+ user votes to compute Elo ratings.
  • MMLU - a test to measure a model’s multitask accuracy on 57 tasks.
  • Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best Open LM

ModelArena EloMMLULicense
DeepSeek DeepSeek-V3-0324136888.5MIT
DeepSeek DeepSeek-R1135490.8MIT
Gemini Gemma-3-27B-it1339Gemma
Qwen Qwen3-235B-A22B133788.5Apache 2.0
Qwen Qwen3-32B1324Apache 2.0

Full Leaderboard
ModelArena EloCodingVisionArena HardMMLUVotesOrganizationLicense
πŸ₯‡ Gemini-2.5-Pro-Preview-05-0614461446137596.46115GoogleProprietary
πŸ₯‡ Gemini-2.5-Flash-Preview-05-201418143113183892GoogleProprietary
πŸ₯‡ o3-2025-04-161409141913037921OpenAIProprietary
πŸ₯‡ ChatGPT-4o-latest (2025-03-26)14051395131010280OpenAIProprietary
πŸ₯‡ Grok-3-Preview-02-241399139692.714840xAIProprietary
πŸ₯‡ GPT-4.5-Preview13941389125615276OpenAIProprietary
πŸ₯ˆ Gemini-2.0-Flash-Thinking-Exp-01-2113771353127826885GoogleProprietary
πŸ₯ˆ Gemini-2.0-Pro-Exp-02-0513761368124020124GoogleProprietary
πŸ₯ˆ DeepSeek-V3-03241368138285.588.59741DeepSeekMIT
πŸ₯ˆ GPT-4.1-2025-04-141365135812836094OpenAIProprietary
πŸ₯ˆ Hunyuan-Turbos-20250416135613515111TencentProprietary
πŸ₯ˆ DeepSeek-R11354135293.290.819339DeepSeekMIT
πŸ₯ˆ Gemini-2.0-Flash-00113511341122424928GoogleProprietary
πŸ₯ˆ o1-2024-12-1713461347123192.191.829041OpenAIProprietary
πŸ₯ˆ o4-mini-2025-04-161343135712706102OpenAIProprietary
πŸ₯ˆ Mistral Medium 3134313463327MistralProprietary
πŸ₯ˆ Gemma-3-27B-it1339130412989GoogleGemma
πŸ₯ˆ Qwen3-235B-A22B1337135195.688.54942AlibabaApache 2.0
πŸ₯ˆ Qwen2.5-Max1337133123170AlibabaProprietary
πŸ₯ˆ Qwen3-32B132413453960AlibabaApache 2.0
πŸ₯ˆ o3-mini-high1321135119403OpenAIProprietary
πŸ₯ˆ GPT-4.1-mini-2025-04-141319134912365929OpenAIProprietary
πŸ₯‰ Gemma-3-12B-it131712783882GoogleGemma
πŸ₯‰ QwQ-32B131013079936AlibabaApache 2.0
πŸ₯‰ Gemini-2.0-Flash-Lite13091309115925984GoogleProprietary
πŸ₯‰ Qwen-Plus-0125130713096056AlibabaProprietary
πŸ₯‰ GLM-4-Plus-0111130712796028ZhipuProprietary
πŸ₯‰ Command A (03-2025)1303130111530CohereCC-BY-NC-4.0
πŸ₯‰ o3-mini1302133824916OpenAIProprietary
πŸ₯‰ Step-2-16K-Exp130112845128StepFunProprietary
πŸ₯‰ o1-mini130013429254953OpenAIProprietary
πŸ₯‰ Llama-3.1-Nemotron-Ultra-253B-v1130013142658NvidiaNvidia
πŸ₯‰ Gemini-1.5-Pro-00212991280122258639GoogleProprietary
πŸ₯‰ Claude 3.7 Sonnet (thinking-32k)1296132413027AnthropicProprietary
πŸ₯‰ Hunyuan-Turbo-0110129313042513TencentProprietary
πŸ₯‰ Llama-3.3-Nemotron-Super-49B-v11293128888.3862368NvidiaNvidia
πŸ₯‰ Claude 3.7 Sonnet12871315120818395AnthropicProprietary
πŸ₯‰ Qwen3-30B-A3B128612904159AlibabaApache 2.0
πŸ₯‰ Gemma-3n-e4b-it128612553583GoogleGemma
πŸ₯‰ Yi-Lightning1284129281.52897101 AIProprietary
πŸ₯‰ Grok-2-08-131284127187.567081xAIProprietary
πŸ₯‰ GPT-4o-2024-05-1312811282120679.2188.7117749OpenAIProprietary
πŸ₯‰ Claude 3.5 Sonnet (20241022)12801315118385.288.765436AnthropicProprietary
Deepseek-v2.5-1210127612867245DeepSeekDeepSeek
Athene-v2-Chat-72B127212898526074NexusFlowNexusFlow
Gemma-3-4B-it127212334206GoogleGemma
GPT-4.1-nano-2025-04-141269128211236181OpenAIProprietary
GPT-4o-mini-2024-07-1812691272112474.948271356OpenAIProprietary
Hunyuan-Large-2025-02-10126812823858TencentProprietary
Gemini-1.5-Flash-00212681243120637020GoogleProprietary
Llama-4-Maverick-17B-128E-Instruct1266127811818797MetaLlama 4
Llama-3.1-405B-Instruct-bf161265126988.643790MetaLlama 3.1
Llama-3.1-Nemotron-70B-Instruct1265126084.97579NvidiaLlama 3.1
Llama-3.1-405B-Instruct-fp81264126569.388.663035MetaLlama 3.1
Grok-2-Mini-08-131263125155442xAIProprietary
Yi-Lightning-lite126112561706601 AIProprietary
Hunyuan-Standard-2025-02-10125712594013TencentProprietary
Qwen2.5-72B-Instruct125412727841520AlibabaQwen
GPT-4-Turbo-2024-04-0912531252115182.63102136OpenAIProprietary
Llama-3.3-70B-Instruct1253124738078MetaLlama-3.3
Athene-70B1247124277.620581NexusFlowCC-BY-NC-4.0
Mistral-Small-3.1-24B-Instruct-2503124612512352MistralApache 2.0
GPT-4-1106-preview12461242103743OpenAIProprietary
Mistral-Large-24111245125470.4229641MistralMRL
Llama-3.1-70B-Instruct1244124055.738658638MetaLlama 3.1
Claude 3 Opus12441239107660.3686.8202635AnthropicProprietary
Amazon Nova Pro 1.012411252104426298AmazonProprietary
GPT-4-0125-preview1241123377.9697077OpenAIProprietary
Llama-3.1-Tulu-3-70B124012223012Ai2Llama 3.1
Llama-4-Scout-17B-16E-Instruct1239124011613397MetaLlama
Claude 3.5 Haiku (20241022)1234125537402AnthropicPropretary
Reka-Core-20240904123212107950Reka AIProprietary
Gemini-1.5-Flash-00112231221107249.6178.965662GoogleProprietary
Jamba-1.5-Large1218121681.29125AI21 LabsJamba Open
Deepseek-v2-API-06281216123119511DeepSeek AIDeepSeek
Gemma-2-27B-it1216119857.5179529GoogleGemma license
Qwen2.5-Coder-32B-Instruct121412505731AlibabaApache 2.0
Mistral-Small-24B-Instruct-25011214122015322MistralApache 2.0
Amazon Nova Lite 1.012131224106120643AmazonProprietary
Gemma-2-9B-it-SimPO1213118510547PrincetonMIT
Command R+ (08-2024)1212117010536CohereCC-BY-NC-4.0
Deepseek-Coder-v2-07241211125662.311725DeepSeekProprietary
Gemini-1.5-Flash-8B-00112091197110637699GoogleProprietary
Llama-3.1-Nemotron-51B-Instruct120812003888NvidiaLlama 3.1
Nemotron-4-340B-Instruct1206118720610NvidiaNvidia
Aya-Expanse-32B1206118228760CohereCC-BY-NC-4.0
GLM-4-05201203120563.8410222Zhipu AIProprietary
Llama-3-70B-Instruct1203118946.5782163623MetaLlama 3
Phi-41202121025213MicrosoftMIT
OLMo-2-0325-32B-Instruct120211843456Allen AIApache-2.0
Reka-Flash-20240904120211808135Reka AIProprietary
Claude 3 Sonnet11981202104846.879113058AnthropicProprietary
Amazon Nova Micro 1.01194119920656AmazonProprietary
Gemma-2-9B-it1189116257200GoogleGemma license
Hunyuan-Standard-256K118512162900TencentProprietary
Qwen2-72B-Instruct1184117646.8684.238873AlibabaQianwen LICENSE
GPT-4-0314118311845086.455961OpenAIProprietary
Llama-3.1-Tulu-3-8B118211683075Ai2Llama 3.1
Ministral-8B-2410117911905109MistralMRL
Qwen-Max-04281179117825693AlibabaProprietary
Claude 3 Haiku11761178100041.4775.2122311AnthropicProprietary
Aya-Expanse-8B1176115410391CohereCC-BY-NC-4.0
Command R (08-2024)1176115010852CohereCC-BY-NC-4.0
DeepSeek-Coder-V2-Instruct1175122815753DeepSeek AIDeepSeek License
Jamba-1.5-Mini1173117069.79273AI21 LabsJamba Open
Llama-3.1-8B-Instruct1172117521.347352585MetaLlama 3.1
GPT-4-06131160115637.991619OpenAIProprietary
Qwen1.5-110B-Chat1158116480.427431AlibabaQianwen LICENSE
Yi-1.5-34B-Chat1154115176.82513601 AIApache-2.0
Llama-3-8B-Instruct1148113520.5668.4109061MetaLlama 3
InternLM2.5-20B-chat1146114710599InternLMOther
Claude-1114511257721149AnthropicProprietary
Qwen1.5-72B-Chat1144114936.1277.540655AlibabaQianwen LICENSE
Mixtral-8x22b-Instruct-v0.11144114236.3677.853756MistralApache 2.0
Mistral Medium1144114231.975.335562MistralProprietary
Gemma-2-2b-it1140109651.348894GoogleGemma license
Granite-3.1-8B-Instruct113911623289IBMApache 2.0
Claude-2.01128112423.9978.512764AnthropicProprietary
Gemini-1.0-Pro-0011128109271.818801GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1112411134854HuggingFaceApache 2.0
Qwen1.5-32B-Chat1122113873.422765AlibabaQianwen LICENSE
Phi-3-Medium-4k-Instruct1119111433.377826106MicrosoftMIT
Granite-3.1-2B-Instruct111611363380IBMApache 2.0
Claude-2.11115112122.7737697AnthropicProprietary
Starling-LM-7B-beta1115111823.0116675NexusflowApache-2.0
GPT-3.5-Turbo-06131113112424.8238955OpenAIProprietary
Mixtral-8x7B-Instruct-v0.11111110323.470.676134MistralApache 2.0
Claude-Instant-11108109873.420630AnthropicProprietary
Yi-34B-Chat1108109523.1573.51591601 AIYi License
Qwen1.5-14B-Chat1105111567.618682AlibabaQianwen LICENSE
GPT-3.5-Turbo-03141104110418.05705640OpenAIProprietary
WizardLM-70B-v1.01103106063.78383MicrosoftLlama 2
GPT-3.5-Turbo-01251102111323.3468874OpenAIProprietary
DBRX-Instruct-Preview1100110724.6373.733743DatabricksDBRX LICENSE
Llama-3.2-3B-Instruct109910698390MetaLlama 3.2
Phi-3-Small-8k-Instruct1098109629.7775.718473MicrosoftMIT
Tulu-2-DPO-70B1096108214.996659AllenAI/UWAI2 ImpACT Low-risk
Granite-3.0-8B-Instruct109010867002IBMApache 2.0
Llama-2-70B-chat1089106111.556339594MetaLlama 2
OpenChat-3.5-01061088109165.812990OpenChatApache-2.0
Vicuna-33B108710568.6359.222936LMSYSNon-commercial
Snowflake Arctic Instruct1086106617.6167.334175SnowflakeApache 2.0
Starling-LM-7B-alpha1085106912.863.910415UC BerkeleyCC-BY-NC-4.0
Nous-Hermes-2-Mixtral-8x7B-DPO108110683838NousResearchApache-2.0
Gemma-1.1-7B-it1080107312.0964.325072GoogleGemma license
NV-Llama2-70B-SteerLM-Chat1077101268.53636NvidiaLlama 2
pplx-70B-online107410176896Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1073106871.34988DeepSeek AIDeepSeek License
OpenChat-3.51073104364.38107OpenChatApache-2.0
OpenHermes-2.5-Mistral-7B107110475088NousResearchApache-2.0
Granite-3.0-2B-Instruct107010777189IBMApache 2.0
Mistral-7B-Instruct-v0.21069106312.5720071MistralApache-2.0
Phi-3-Mini-4K-Instruct-June-241067107170.912808MicrosoftMIT
Qwen1.5-7B-Chat10661078614872AlibabaQianwen LICENSE
GPT-3.5-Turbo-11061064108418.8717035OpenAIProprietary
Phi-3-Mini-4k-Instruct1063107568.821097MicrosoftMIT
Llama-2-13b-chat1060104053.619725MetaLlama 2
SOLAR-10.7B-Instruct-v1.01059103666.24286Upstage AICC-BY-NC-4.0
Dolphin-2.2.1-Mistral-7B105910141714Cognitive ComputationsApache-2.0
WizardLM-13b-v1.21055101552.77176MicrosoftLlama 2
Llama-3.2-1B-Instruct105010358523MetaLlama 3.2
Qwen2.5-VL-32B-Instruct1214AlibabaApache 2.0
Step-1o-Vision-32k (highres)1187StepFunProprietary
Qwen2.5-VL-72B-Instruct1171AlibabaQwen
Pixtral-Large-24111154MistralMRL
Qwen-VL-Max-11191128AlibabaProprietary
Qwen2-VL-72b-Instruct1111AlibabaQwen
Step-1V-32K1111StepFunProprietary
Molmo-72B-09241076AI2Apache 2.0
Pixtral-12B-24091072MistralApache 2.0
Llama-3.2-90B-Vision-Instruct1070MetaLlama 3.2
Aya-Vision-8B1069CohereCC-BY-NC-4.0
InternVL2-26B1067OpenGVLabMIT
Hunyuan-Standard-Vision-2024-12-311066TencentProprietary
Aya-Vision-32B1060CohereCC-BY-NC-4.0
Qwen2-VL-7B-Instruct1054AliabaApache 2.0
Yi-Vision104501 AIProprietary
Llama-3.2-11B-Vision-Instruct1032MetaLlama 3.2

If you want to see more models, please help us add them.

πŸ’» Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

πŸ”— Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance β€œconverges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.