Chatbot Arena

Attribution LMSYS February 21, 2025

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 2.7M+ user votes to compute Elo ratings.
  • MMLU - a test to measure a model’s multitask accuracy on 57 tasks.
  • Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best Open LM

ModelArena EloMMLULicense
DeepSeek DeepSeek-R1136290.8MIT
DeepSeek DeepSeek-V3131888.5DeepSeek
Qwen Qwen2.5-72B-Instruct125786.8Qwen
Meta Llama-3.3-70B-Instruct125686Llama 3.3

Full Leaderboard
ModelArena EloCodingVisionArena HardMMLUVotesOrganizationLicense
🥇🏆chocolate (Early Grok-3)140314039992xAIProprietary
🥈 Gemini-2.0-Flash-Thinking-Exp-01-2113851368128215083GoogleProprietary
🥈 Gemini-2.0-Pro-Exp-02-0513801373125113000GoogleProprietary
🥈 ChatGPT-4o-latest (2025-01-29)13771364127613470OpenAIProprietary
🥈 DeepSeek-R1136213696581DeepSeekMIT
🥈 Gemini-2.0-Flash-00113581355123210862GoogleProprietary
🥈 o1-2024-12-1713521360121690.417248OpenAIProprietary
🥈 Qwen2.5-Max133413379282AlibabaProprietary
🥈 o3-mini-high133213735954OpenAIProprietary
🥉 DeepSeek-V31318131619461DeepSeekDeepSeek
🥉 Qwen-Plus-0125131113175112AlibabaProprietary
🥉 GLM-4-Plus-0111131012865134ZhipuProprietary
🥉 Gemini-2.0-Flash-Lite-Preview-02-0513091323115410262GoogleProprietary
🥉 o3-mini1306135512179OpenAIProprietary
🥉 o1-mini130413539254944OpenAIProprietary
🥉 Step-2-16K-Exp130412955130StepFunProprietary
🥉 Gemini-1.5-Pro-00213021291122054970GoogleProprietary
🥉 Grok-2-08-131288128267045xAIProprietary
🥉 Yi-Lightning1287130281.52895801 AIProprietary
🥉 Claude 3.5 Sonnet (20241022)12831326118485.288.756346AnthropicProprietary
Deepseek-v2.5-1210127912967245DeepSeekDeepSeek
Athene-v2-Chat-72B127513008526092NexusFlowNexusFlow
GPT-4o-mini-2024-07-1812731283112474.948265335OpenAIProprietary
Gemini-1.5-Flash-00212711254120536993GoogleProprietary
Llama-3.1-405B-Instruct-bf161269128088.631499MetaLlama 3.1
Llama-3.1-Nemotron-70B-Instruct1268127184.97601NvidiaLlama 3.1
Grok-2-Mini-08-131266126255424xAIProprietary
Yi-Lightning-lite126412671705901 AIProprietary
Qwen2.5-72B-Instruct125712837841543AlibabaQwen
GPT-4-Turbo-2024-04-0912561263115182.63102119OpenAIProprietary
Llama-3.3-70B-Instruct1256125825878MetaLlama-3.3
Mistral-Large-24071251126970.4248201MistralMistral Research
GPT-4-1106-preview12501253103725OpenAIProprietary
Athene-70B1250125377.620604NexusFlowCC-BY-NC-4.0
Llama-3.1-70B-Instruct1248125155.738658750MetaLlama 3.1
Mistral-Large-24111247126620816MistralMRL
Claude 3 Opus12471250107660.3686.8202650AnthropicProprietary
Amazon Nova Pro 1.012461259104318006AmazonProprietary
GPT-4-0125-preview1245124477.9697041OpenAIProprietary
Llama-3.1-Tulu-3-70B124412323014Ai2Llama 3.1
Yi-Large-preview1240124571.485165201 AIProprietary
Claude 3.5 Haiku (20241022)1236126321000AnthropicPropretary
Reka-Core-20240904123512227935Reka AIProprietary
Reka-Core-202407221231120813286Reka AIProprietary
Qwen-Plus-08281227124514612AlibabaProprietary
Gemini-1.5-Flash-00112271232107249.6178.965657GoogleProprietary
Jamba-1.5-Large1221122781.29120AI21 LabsJamba Open
Deepseek-v2-API-06281220124219499DeepSeek AIDeepSeek
Gemma-2-27B-it1220121057.5177543GoogleGemma license
Qwen2.5-Coder-32B-Instruct121712615724AlibabaApache 2.0
Amazon Nova Lite 1.012161234106015984AmazonProprietary
Gemma-2-9B-it-SimPO1216119610549PrincetonMIT
Command R+ (08-2024)1215118110539CohereCC-BY-NC-4.0
Deepseek-Coder-v2-07241214126662.311723DeepSeekProprietary
Gemini-1.5-Flash-8B-00112131208110637680GoogleProprietary
Yi-Large1212122063.71662901 AIProprietary
Llama-3.1-Nemotron-51B-Instruct121112113891NvidiaLlama 3.1
Mistral-Small-24B-Instruct-2501121012156280MistralApache 2.0
Nemotron-4-340B-Instruct1209119820608NvidiaNVIDIA Open Model
Aya-Expanse-32B1209119328750CohereCC-BY-NC-4.0
Gemini App (2024-01-24)1208117111827GoogleProprietary
GLM-4-05201207121663.8410215Zhipu AIProprietary
Llama-3-70B-Instruct1206120046.5782163745MetaLlama 3
Reka-Flash-20240904120511918127Reka AIProprietary
Gemini-1.5-Flash-8B-Exp-082712051189111225350GoogleProprietary
Phi-41204122613156MicrosoftMIT
Claude 3 Sonnet12011213104846.879113016AnthropicProprietary
Reka-Flash-202407221201118713722Reka AIProprietary
Reka-Core-2024050112001190101583.262563Reka AIProprietary
Amazon Nova Micro 1.01198121016043AmazonProprietary
Gemma-2-9B-it1191117355285GoogleGemma license
Command R+ (04-2024)1190116433.0780868CohereCC-BY-NC-4.0
Hunyuan-Standard-256K118912272898TencentProprietary
Qwen2-72B-Instruct1187118746.8684.238875AlibabaQianwen LICENSE
GPT-4-0314118611955086.455951OpenAIProprietary
Llama-3.1-Tulu-3-8B118511783073Ai2Llama 3.1
GLM-4-01161183119155.727580Zhipu AIProprietary
Qwen-Max-04281183118925676AlibabaProprietary
Ministral-8B-2410118212015108MistralMRL
Aya-Expanse-8B1180116510404CohereCC-BY-NC-4.0
Command R (08-2024)1180116210845CohereCC-BY-NC-4.0
Claude 3 Haiku11791189100041.4775.2122289AnthropicProprietary
DeepSeek-Coder-V2-Instruct1178123915749DeepSeek AIDeepSeek License
Llama-3.1-8B-Instruct1176118621.347352633MetaLlama 3.1
Jamba-1.5-Mini1176118169.79271AI21 LabsJamba Open
Reka-Flash-Preview-2024061111651155102420411Reka AIProprietary
GPT-4-06131163116737.991608OpenAIProprietary
Qwen1.5-110B-Chat1161117580.427453AlibabaQianwen LICENSE
Mistral-Large-24021157117037.7181.264903MistralProprietary
Yi-1.5-34B-Chat1157116276.82512101 AIApache-2.0
Reka-Flash-21B-online1156114716016Reka AIProprietary
QwQ-32B-Preview115311463413AlibabaApache 2.0
Llama-3-8B-Instruct1152114620.5668.4109218MetaLlama 3
InternLM2.5-20B-chat1149115810595InternLMOther
Claude-1114911367721158AnthropicProprietary
Command R (04-2024)1149112317.0256356CohereCC-BY-NC-4.0
Mixtral-8x22b-Instruct-v0.11148115236.3677.853773MistralApache 2.0
Mistral Medium1148115231.975.335552MistralProprietary
Reka-Flash-21B1148114173.525807Reka AIProprietary
Qwen1.5-72B-Chat1147116036.1277.540638AlibabaQianwen LICENSE
Gemma-2-2b-it1144110851.347032GoogleGemma license
Granite-3.1-8B-Instruct114311713302IBMApache 2.0
Claude-2.01132113523.9978.512758AnthropicProprietary
Gemini-1.0-Pro-0011131110371.818789GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1112711244859HuggingFaceApache 2.0
Qwen1.5-32B-Chat1125114973.422766AlibabaQianwen LICENSE
Mistral-Next1124113227.3712372MistralProprietary
Phi-3-Medium-4k-Instruct1123112533.377826092MicrosoftMIT
Granite-3.1-2B-Instruct112011473385IBMApache 2.0
Starling-LM-7B-beta1119112923.0116668NexusflowApache-2.0
Claude-2.11118113222.7737688AnthropicProprietary
GPT-3.5-Turbo-06131117113524.8238940OpenAIProprietary
Mixtral-8x7B-Instruct-v0.11114111423.470.676103MistralApache 2.0
Claude-Instant-11111110973.420618AnthropicProprietary
Yi-34B-Chat1111110623.1573.51592401 AIYi License
Gemini Pro1110109117.871.86559GoogleProprietary
Qwen1.5-14B-Chat1109112667.618667AlibabaQianwen LICENSE
GPT-3.5-Turbo-01251106112423.3468856OpenAIProprietary
GPT-3.5-Turbo-03141106111518.05705648OpenAIProprietary
WizardLM-70B-v1.01106107163.78380MicrosoftLlama 2
DBRX-Instruct-Preview1103111824.6373.733720DatabricksDBRX LICENSE
Llama-3.2-3B-Instruct110310808405MetaLlama 3.2
Phi-3-Small-8k-Instruct1102110729.7775.718479MicrosoftMIT
Tulu-2-DPO-70B1099109314.996656AllenAI/UWAI2 ImpACT Low-risk
Granite-3.0-8B-Instruct109310977003IBMApache 2.0
Llama-2-70B-chat1093107211.556339614MetaLlama 2
OpenChat-3.5-01061091110265.812981OpenChatApache-2.0
Vicuna-33B109110678.6359.222944LMSYSNon-commercial
Snowflake Arctic Instruct1090107767.334176SnowflakeApache 2.0
Starling-LM-7B-alpha1088108012.863.910415UC BerkeleyCC-BY-NC-4.0
Gemma-1.1-7B-it1084108464.325071GoogleGemma license
Nous-Hermes-2-Mixtral-8x7B-DPO108410793834NousResearchApache-2.0
NV-Llama2-70B-SteerLM-Chat1081102368.53637NvidiaLlama 2
pplx-70B-online107810286896Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1077107971.34984DeepSeek AIDeepSeek License
OpenChat-3.51076105464.38106OpenChatApache-2.0
Granite-3.0-2B-Instruct107410887185IBMApache 2.0
OpenHermes-2.5-Mistral-7B107410585088NousResearchApache-2.0
Mistral-7B-Instruct-v0.21072107420050MistralApache-2.0
Qwen1.5-7B-Chat10701089614863AlibabaQianwen LICENSE
Phi-3-Mini-4K-Instruct-June-241070108270.912807MicrosoftMIT
GPT-3.5-Turbo-11061068109517031OpenAIProprietary
Phi-3-Mini-4k-Instruct1066108668.821094MicrosoftMIT
Llama-2-13b-chat1063105153.619730MetaLlama 2
Dolphin-2.2.1-Mistral-7B106310251714Cognitive ComputationsApache-2.0
SOLAR-10.7B-Instruct-v1.01062104766.24288Upstage AICC-BY-NC-4.0
WizardLM-13b-v1.21058102652.77174MicrosoftLlama 2
Llama-3.2-1B-Instruct105410478529MetaLlama 3.2
Step-1o-Vision-32k (highres)1183StepFunProprietary
Qwen2.5-VL-72B-Instruct1166AlibabaQwen
Pixtral-Large-24111153MistralMRL
Qwen-VL-Max-11191127AlibabaProprietary
Qwen2-VL-72b-Instruct1111AlibabaQwen
Step-1V-32K1110StepFunProprietary
Molmo-72B-09241075AI2Apache 2.0
Pixtral-12B-24091072MistralApache 2.0
Llama-3.2-90B-Vision-Instruct1068MetaLlama 3.2
InternVL2-26B1067OpenGVLabMIT
Qwen2-VL-7B-Instruct1052AliabaApache 2.0
Yi-Vision104501 AIProprietary
Llama-3.2-11B-Vision-Instruct1031MetaLlama 3.2

If you want to see more models, please help us add them.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.