Chatbot Arena

Attribution LMSYS β€’ March 25, 2025

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 2.8M+ user votes to compute Elo ratings.
  • MMLU - a test to measure a model’s multitask accuracy on 57 tasks.
  • Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best Open LM

ModelArena EloMMLULicense
DeepSeek DeepSeek-R1136090.8MIT
Gemini Gemma-3-27B-it1340Gemma
DeepSeek DeepSeek-V3131888.5DeepSeek
Qwen QwQ-32B1315Apache 2.0
Nvidia Llama-3.3-Nemotron-Super-49B-v1129686Nvidia

Full Leaderboard
ModelArena EloCodingVisionArena HardMMLUVotesOrganizationLicense
πŸ₯‡ Gemini-2.5-Pro-Exp-03-251443142713272540GoogleProprietary
πŸ₯‡ Grok-3-Preview-02-241404141492.710398xAIProprietary
πŸ₯‡ GPT-4.5-Preview13981405126910615OpenAIProprietary
πŸ₯ˆ Gemini-2.0-Flash-Thinking-Exp-01-2113811364127622659GoogleProprietary
πŸ₯ˆ Gemini-2.0-Pro-Exp-02-0513801380124120293GoogleProprietary
πŸ₯ˆ ChatGPT-4o-latest (2025-01-29)13741366128022517OpenAIProprietary
πŸ₯ˆ DeepSeek-R11360136890.812772DeepSeekMIT
πŸ₯ˆ Gemini-2.0-Flash-Exp13551353125722520GoogleProprietary
πŸ₯ˆ o1-2024-12-1713511359123090.491.825044OpenAIProprietary
πŸ₯ˆ Qwen2.5-Max1340134517124AlibabaProprietary
πŸ₯ˆ Gemma-3-27B-it134013076974GoogleGemma
πŸ₯ˆ o3-mini-high1326136514274OpenAIProprietary
πŸ₯‰ DeepSeek-V31318132088.522845DeepSeekDeepSeek
πŸ₯‰ QwQ-32B131513274050AlibabaApache 2.0
πŸ₯‰ Command A (03-2025)131113253415CohereCC-BY-NC-4.0
πŸ₯‰ Qwen-Plus-0125131013206059AlibabaProprietary
πŸ₯‰ Gemini-2.0-Flash-Lite13101318114918090GoogleProprietary
πŸ₯‰ GLM-4-Plus-0111131012906034ZhipuProprietary
πŸ₯‰ o1-mini130413539254975OpenAIProprietary
πŸ₯‰ o3-mini1304134920765OpenAIProprietary
πŸ₯‰ Claude 3.7 Sonnet (thinking-32k)130413354917AnthropicProprietary
πŸ₯‰ Step-2-16K-Exp130412955131StepFunProprietary
πŸ₯‰ Hunyuan-TurboS-20250226130313282460TencentProprietary
πŸ₯‰ Gemini-1.5-Pro-00213021291122258693GoogleProprietary
πŸ₯‰ Claude 3.7 Sonnet12961338122010197AnthropicProprietary
πŸ₯‰ Hunyuan-Turbo-0110129613152512TencentProprietary
πŸ₯‰ Llama-3.3-Nemotron-Super-49B-v11296130288.32381NvidiaNvidia
πŸ₯‰ Grok-2-08-131288128267107xAIProprietary
πŸ₯‰ Yi-Lightning1287130281.52897201 AIProprietary
πŸ₯‰ GPT-4o-2024-05-1312851293120679.2188.7117760OpenAIProprietary
πŸ₯‰ Claude 3.5 Sonnet (20241022)12831326118385.288.762603AnthropicProprietary
Deepseek-v2.5-1210127912977249DeepSeekDeepSeek
Athene-v2-Chat-72B127513008526110NexusFlowNexusFlow
Hunyuan-Large-2025-02-10127212933859TencentProprietary
GPT-4o-mini-2024-07-1812721283112474.948269408OpenAIProprietary
Gemini-1.5-Flash-00212711254120537012GoogleProprietary
Llama-3.1-405B-Instruct-bf161268127988.639742MetaLlama 3.1
Llama-3.1-Nemotron-70B-Instruct1268127184.97578NvidiaLlama 3.1
Llama-3.1-405B-Instruct-fp81267127669.388.663067MetaLlama 3.1
Grok-2-Mini-08-131266126255449xAIProprietary
Yi-Lightning-lite126412671707101 AIProprietary
Hunyuan-Standard-2025-02-10125912684021TencentProprietary
Qwen2.5-72B-Instruct125712837841552AlibabaQwen
Llama-3.3-70B-Instruct1257125934033MetaLlama-3.3
GPT-4-Turbo-2024-04-0912561263115182.63102184OpenAIProprietary
Mistral-Large-24071251126970.4248233MistralMistral Research
GPT-4-1106-preview12501253103764OpenAIProprietary
Athene-70B1250125377.620597NexusFlowCC-BY-NC-4.0
Mistral-Large-24111248126526946MistralMRL
Llama-3.1-70B-Instruct1248125155.738658683MetaLlama 3.1
Claude 3 Opus12471250107660.3686.8202710AnthropicProprietary
Amazon Nova Pro 1.012461262104422197AmazonProprietary
GPT-4-0125-preview1245124377.9697094OpenAIProprietary
Llama-3.1-Tulu-3-70B124412333015Ai2Llama 3.1
Yi-Large-preview1240124571.485165201 AIProprietary
Claude 3.5 Haiku (20241022)1237126429154AnthropicPropretary
Reka-Core-20240904123512227941Reka AIProprietary
Reka-Core-202407221231120813290Reka AIProprietary
Qwen-Plus-08281227124514623AlibabaProprietary
Gemini-1.5-Flash-00112271232107249.6178.965683GoogleProprietary
Jamba-1.5-Large1221122781.29129AI21 LabsJamba Open
Deepseek-v2-API-06281220124219514DeepSeek AIDeepSeek
Gemma-2-27B-it1220120957.5179537GoogleGemma license
Qwen2.5-Coder-32B-Instruct121712615730AlibabaApache 2.0
Amazon Nova Lite 1.012171236106120235AmazonProprietary
Gemma-2-9B-it-SimPO1216119610554PrincetonMIT
Command R+ (08-2024)1215118110540CohereCC-BY-NC-4.0
Deepseek-Coder-v2-07241214126662.311733DeepSeekProprietary
Mistral-Small-24B-Instruct-25011214122712426MistralApache 2.0
Yi-Large1213122063.71663101 AIProprietary
Gemini-1.5-Flash-8B-00112121208110637695GoogleProprietary
Llama-3.1-Nemotron-51B-Instruct121112113886NvidiaLlama 3.1
Nemotron-4-340B-Instruct1209119820608NvidiaNVIDIA Open Model
Aya-Expanse-32B1209119328759CohereCC-BY-NC-4.0
Gemini App (2024-01-24)1208117111839GoogleProprietary
Llama-3-70B-Instruct1207119946.5782163682MetaLlama 3
GLM-4-05201206121663.8410218Zhipu AIProprietary
Reka-Flash-20240904120511918136Reka AIProprietary
Gemini-1.5-Flash-8B-Exp-082712051189111225357GoogleProprietary
Phi-41204122321497MicrosoftMIT
Claude 3 Sonnet12011213104846.879113083AnthropicProprietary
Reka-Flash-202407221201118713731Reka AIProprietary
Reka-Core-2024050111991190101583.262586Reka AIProprietary
Amazon Nova Micro 1.01198121120253AmazonProprietary
Gemma-2-9B-it1192117357201GoogleGemma license
Command R+ (04-2024)1190116433.0780880CohereCC-BY-NC-4.0
Hunyuan-Standard-256K118912272902TencentProprietary
Qwen2-72B-Instruct1187118746.8684.238888AlibabaQianwen LICENSE
GPT-4-0314118611955086.455981OpenAIProprietary
Llama-3.1-Tulu-3-8B118511793073Ai2Llama 3.1
GLM-4-01161183119155.727580Zhipu AIProprietary
Qwen-Max-04281183118925693AlibabaProprietary
Ministral-8B-2410118212015115MistralMRL
Aya-Expanse-8B1180116510390CohereCC-BY-NC-4.0
Claude 3 Haiku11791189100041.4775.2122349AnthropicProprietary
Command R (08-2024)1179116110849CohereCC-BY-NC-4.0
DeepSeek-Coder-V2-Instruct1178123915753DeepSeek AIDeepSeek License
Llama-3.1-8B-Instruct1176118621.347352607MetaLlama 3.1
Jamba-1.5-Mini1176118169.79272AI21 LabsJamba Open
Reka-Flash-Preview-2024061111651155102420430Reka AIProprietary
GPT-4-06131163116737.991641OpenAIProprietary
Qwen1.5-110B-Chat1161117480.427440AlibabaQianwen LICENSE
Mistral-Large-24021157116937.7181.264931MistralProprietary
Yi-1.5-34B-Chat1157116276.82513801 AIApache-2.0
Reka-Flash-21B-online1156114716022Reka AIProprietary
QwQ-32B-Preview115311473409AlibabaApache 2.0
Llama-3-8B-Instruct1152114620.5668.4109093MetaLlama 3
InternLM2.5-20B-chat1149115810604InternLMOther
Claude-1114911357721151AnthropicProprietary
Command R (04-2024)1149112317.0256386CohereCC-BY-NC-4.0
Mistral Medium1148115231.975.335561MistralProprietary
Qwen1.5-72B-Chat1147116036.1277.540666AlibabaQianwen LICENSE
Mixtral-8x22b-Instruct-v0.11147115236.3677.853769MistralApache 2.0
Reka-Flash-21B1147114173.525807Reka AIProprietary
Gemma-2-2b-it1144110751.348921GoogleGemma license
Granite-3.1-8B-Instruct114211723299IBMApache 2.0
Claude-2.01132113523.9978.512758AnthropicProprietary
Gemini-1.0-Pro-0011131110371.818809GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1112711244862HuggingFaceApache 2.0
Qwen1.5-32B-Chat1125114973.422769AlibabaQianwen LICENSE
Mistral-Next1124113227.3712377MistralProprietary
Phi-3-Medium-4k-Instruct1123112533.377826107MicrosoftMIT
Granite-3.1-2B-Instruct111911473384IBMApache 2.0
Starling-LM-7B-beta1119112923.0116676NexusflowApache-2.0
Claude-2.11118113222.7737697AnthropicProprietary
GPT-3.5-Turbo-06131117113524.8238958OpenAIProprietary
Mixtral-8x7B-Instruct-v0.11114111423.470.676141MistralApache 2.0
Claude-Instant-11111110873.420625AnthropicProprietary
Yi-34B-Chat1111110623.1573.51592501 AIYi License
Gemini Pro1111109217.871.86559GoogleProprietary
Qwen1.5-14B-Chat1109112667.618686AlibabaQianwen LICENSE
GPT-3.5-Turbo-03141107111518.05705639OpenAIProprietary
GPT-3.5-Turbo-01251106112423.3468884OpenAIProprietary
WizardLM-70B-v1.01106107163.78382MicrosoftLlama 2
DBRX-Instruct-Preview1103111824.6373.733739DatabricksDBRX LICENSE
Llama-3.2-3B-Instruct110310808394MetaLlama 3.2
Phi-3-Small-8k-Instruct1102110729.7775.718476MicrosoftMIT
Tulu-2-DPO-70B1099109314.996659AllenAI/UWAI2 ImpACT Low-risk
Granite-3.0-8B-Instruct109310977002IBMApache 2.0
Llama-2-70B-chat1093107211.556339617MetaLlama 2
OpenChat-3.5-01061091110265.812990OpenChatApache-2.0
Snowflake Arctic Instruct1090107717.6167.334175SnowflakeApache 2.0
Vicuna-33B109010678.6359.222941LMSYSNon-commercial
Starling-LM-7B-alpha1088108012.863.910420UC BerkeleyCC-BY-NC-4.0
Gemma-1.1-7B-it1084108412.0964.325066GoogleGemma license
Nous-Hermes-2-Mixtral-8x7B-DPO108410793837NousResearchApache-2.0
NV-Llama2-70B-SteerLM-Chat1081102268.53637NvidiaLlama 2
pplx-70B-online107810286897Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1077107971.34988DeepSeek AIDeepSeek License
OpenChat-3.51076105464.38110OpenChatApache-2.0
Granite-3.0-2B-Instruct107410887190IBMApache 2.0
OpenHermes-2.5-Mistral-7B107410575089NousResearchApache-2.0
Mistral-7B-Instruct-v0.21072107412.5720066MistralApache-2.0
Phi-3-Mini-4K-Instruct-June-241071108270.912811MicrosoftMIT
Qwen1.5-7B-Chat10701089614872AlibabaQianwen LICENSE
GPT-3.5-Turbo-11061068109518.8717037OpenAIProprietary
Phi-3-Mini-4k-Instruct1066108668.821099MicrosoftMIT
Llama-2-13b-chat1063105153.619724MetaLlama 2
Dolphin-2.2.1-Mistral-7B106310251713Cognitive ComputationsApache-2.0
SOLAR-10.7B-Instruct-v1.01062104766.24289Upstage AICC-BY-NC-4.0
WizardLM-13b-v1.21059102552.77175MicrosoftLlama 2
Llama-3.2-1B-Instruct105410468524MetaLlama 3.2
Step-1o-Vision-32k (highres)1186StepFunProprietary
Qwen2.5-VL-72B-Instruct1168AlibabaQwen
Pixtral-Large-24111154MistralMRL
Qwen-VL-Max-11191128AlibabaProprietary
Step-1V-32K1111StepFunProprietary
Qwen2-VL-72b-Instruct1110AlibabaQwen
Molmo-72B-09241075AI2Apache 2.0
Pixtral-12B-24091072MistralApache 2.0
Llama-3.2-90B-Vision-Instruct1069MetaLlama 3.2
InternVL2-26B1067OpenGVLabMIT
Hunyuan-Standard-Vision-2024-12-311064TencentProprietary
Qwen2-VL-7B-Instruct1054AliabaApache 2.0
Yi-Vision104501 AIProprietary
Llama-3.2-11B-Vision-Instruct1032MetaLlama 3.2

If you want to see more models, please help us add them.

πŸ’» Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

πŸ”— Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance β€œconverges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.