Chatbot Arena

Attribution LMSYS December 18, 2024

This leaderboard is based on the following benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 2.5M+ user votes to compute Elo ratings.
  • MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade model responses.
  • MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.
  • Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best Open LM

ModelArena EloMMLULicense
Meta Llama-3.3-70B-Instruct125686Llama 3.3
Qwen Qwen2.5-72B-Instruct125786.8Qwen
DeepSeek Deepseek-v2.5125880.4DeepSeek

Full Leaderboard
ModelArena EloCoding EloArena HardMT-benchMMLUVotesOrganizationLicense
🥇 Gemini-Exp-12061372136913175GoogleProprietary
🥇 Gemini-2.0-Flash-Thinking-Exp136913435499GoogleProprietary
🥇 ChatGPT-4o-latest (2024-11-20)1364135126458OpenAIProprietary
🥇 Gemini-2.0-Flash-Exp1355134312190GoogleProprietary
🥇 o1-preview1335135590.432685OpenAIProprietary
🥈 o1-mini130613609241393OpenAIProprietary
🥈 Gemini-1.5-Pro-0021301128937673GoogleProprietary
🥈 Grok-2-08-131288128459987xAIProprietary
🥈 Yi-Lightning1287130381.52918501 AIProprietary
🥈 Claude 3.5 Sonnet (20241022)1283132385.288.739879AnthropicProprietary
🥉 Athene-v2-Chat-72B127712958513497NexusFlowNexusFlow
🥉 GLM-4-Plus1274128327995Zhipu AIProprietary
🥉 GPT-4o-mini-2024-07-181273128474.948256867OpenAIProprietary
🥉 Gemini-1.5-Flash-0021271125230754GoogleProprietary
🥉 Llama-3.1-Nemotron-70B-Instruct1269127184.97669NvidiaLlama 3.1
🥉 Llama-3.1-405B-Instruct1267127769.388.662431MetaLlama 3.1
🥉 Grok-2-Mini-08-131266126251515xAIProprietary
🥉 Yi-Lightning-lite126412671719001 AIProprietary
🥉 Deepseek-v2.51258128826510DeepSeekDeepSeek
🥉 Qwen2.5-72B-Instruct125712827836199AlibabaQwen
🥉 GPT-4-Turbo-2024-04-091256126382.63102236OpenAIProprietary
🥉 Llama-3.3-70B-Instruct125612588089MetaLlama-3.3
Mistral-Large-24071251126970.4248375MistralMistral Research
Athene-70B1250125477.620644NexusFlowCC-BY-NC-4.0
GPT-4-1106-preview125012539.32103822OpenAIProprietary
Mistral-Large-2411124912664386MistralMRL
Llama-3.1-70B-Instruct1248125155.738658389MetaLlama 3.1
Claude 3 Opus1248125060.3686.8197260AnthropicProprietary
GPT-4-0125-preview1245124477.9697155OpenAIProprietary
Amazon Nova Pro 1.0124412606655AmazonProprietary
Yi-Large-preview1240124571.485170401 AIProprietary
Claude 3.5 Haiku (20241022)123812654535AnthropicPropretary
Reka-Core-20240904123512227973Reka AIProprietary
Reka-Core-202407221231120813311Reka AIProprietary
Qwen-Plus-08281227124514694AlibabaProprietary
Gemini-1.5-Flash-0011227123249.6178.965724GoogleProprietary
Jamba-1.5-Large1221122881.29143AI21 LabsJamba Open
Deepseek-v2-API-06281220124219563DeepSeek AIDeepSeek
Gemma-2-27B-it1220121157.5168050GoogleGemma license
Amazon Nova Lite 1.0121912356731AmazonProprietary
Qwen2.5-Coder-32B-Instruct121712605745AlibabaApache 2.0
Gemma-2-9B-it-SimPO1216119710570PrincetonMIT
Command R+ (08-2024)1215118110577CohereCC-BY-NC-4.0
Deepseek-Coder-v2-07241214126662.311744DeepSeekProprietary
Yi-Large1212122063.71667001 AIProprietary
Llama-3.1-Nemotron-51B-Instruct121212113926NvidiaLlama 3.1
Gemini-1.5-Flash-8B-0011212120632411GoogleProprietary
Nemotron-4-340B-Instruct1209119820648NvidiaNVIDIA Open Model
Aya-Expanse-32B1208119123033CohereCC-BY-NC-4.0
Gemini App (2024-01-24)1208117111833GoogleProprietary
GLM-4-05201207121663.8410227Zhipu AIProprietary
Llama-3-70B-Instruct1206120046.5782163892MetaLlama 3
Reka-Flash-20240904120511918158Reka AIProprietary
Gemini-1.5-Flash-8B-Exp-08271205118925413GoogleProprietary
Claude 3 Sonnet1201121346.879113102AnthropicProprietary
Reka-Flash-202407221201118713747Reka AIProprietary
Reka-Core-202405011200119083.262615Reka AIProprietary
Amazon Nova Micro 1.0119712186763AmazonProprietary
Gemma-2-9B-it1191117146390GoogleGemma license
Command R+ (04-2024)1190116433.0780916CohereCC-BY-NC-4.0
Hunyuan-Standard-256K118912262928TencentProprietary
Qwen2-72B-Instruct1187118746.869.1284.238956AlibabaQianwen LICENSE
GPT-4-031411861196508.9686.455978OpenAIProprietary
GLM-4-01161183119155.727583Zhipu AIProprietary
Qwen-Max-04281183119025711AlibabaProprietary
Ministral-8B-2410118212015144MistralMRL
Aya-Expanse-8B118011684485CohereCC-BY-NC-4.0
Claude 3 Haiku1179119041.4775.2122415AnthropicProprietary
Command R (08-2024)1179116210874CohereCC-BY-NC-4.0
DeepSeek-Coder-V2-Instruct1178123915782DeepSeek AIDeepSeek License
Llama-3.1-8B-Instruct1176118621.347352288MetaLlama 3.1
Jamba-1.5-Mini1176118169.79279AI21 LabsJamba Open
Reka-Flash-Preview-202406111165115520453Reka AIProprietary
GPT-4-06131163116737.99.1891645OpenAIProprietary
Qwen1.5-110B-Chat116111758.8880.427470AlibabaQianwen LICENSE
Mistral-Large-24021157117037.7181.264955MistralProprietary
Yi-1.5-34B-Chat1157116276.82515701 AIApache-2.0
Reka-Flash-21B-online1156114716036Reka AIProprietary
Llama-3-8B-Instruct1152114620.5668.4109274MetaLlama 3
QwQ-32B-Preview115011452823AlibabaApache 2.0
InternLM2.5-20B-chat1149115810674InternLMOther
Claude-1114911367.97721169AnthropicProprietary
Command R (04-2024)1149112317.0256400CohereCC-BY-NC-4.0
Mistral Medium1148115331.98.6175.335547MistralProprietary
Mixtral-8x22b-Instruct-v0.11148115336.3677.853816MistralApache 2.0
Reka-Flash-21B1148114173.525820Reka AIProprietary
Qwen1.5-72B-Chat1147116036.128.6177.540651AlibabaQianwen LICENSE
Gemma-2-2b-it1142110551.337951GoogleGemma license
Claude-2.01132113523.998.0678.512763AnthropicProprietary
Gemini-1.0-Pro-0011131110371.818795GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1112711244861HuggingFaceApache 2.0
Qwen1.5-32B-Chat112511498.373.422764AlibabaQianwen LICENSE
Mistral-Next1124113227.3712378MistralProprietary
Phi-3-Medium-4k-Instruct1123112533.377826132MicrosoftMIT
Starling-LM-7B-beta1119112923.018.1216674NexusflowApache-2.0
Claude-2.11118113222.778.1837704AnthropicProprietary
GPT-3.5-Turbo-06131117113524.828.3938965OpenAIProprietary
Mixtral-8x7B-Instruct-v0.11114111423.48.370.676148MistralApache 2.0
Claude-Instant-1111111097.8573.420627AnthropicProprietary
Yi-34B-Chat1111110623.1573.51593001 AIYi License
Yi-34B-Chat1111110623.1573.51593001 AIYi License
Gemini Pro1111109217.871.86560GoogleProprietary
Qwen1.5-14B-Chat110911267.9167.618680AlibabaQianwen LICENSE
GPT-3.5-Turbo-03141107111518.057.94705647OpenAIProprietary
GPT-3.5-Turbo-01251106112423.3468902OpenAIProprietary
WizardLM-70B-v1.0110610717.7163.78382MicrosoftLlama 2
DBRX-Instruct-Preview1103111824.6373.733735DatabricksDBRX LICENSE
Llama-3.2-3B-Instruct110310808426MetaLlama 3.2
Phi-3-Small-8k-Instruct1102110729.7775.718502MicrosoftMIT
Tulu-2-DPO-70B1099109314.997.896662AllenAI/UWAI2 ImpACT Low-risk
Granite-3.0-8B-Instruct109410977066IBMApache 2.0
Llama-2-70B-chat1093107211.556.866339645MetaLlama 2
OpenChat-3.5-0106109111027.865.812986OpenChatApache-2.0
Vicuna-33B109110678.637.1259.222954LMSYSNon-commercial
Snowflake Arctic Instruct1090107717.6167.334183SnowflakeApache 2.0
Starling-LM-7B-alpha1089108012.88.0963.910414UC BerkeleyCC-BY-NC-4.0
Gemma-1.1-7B-it1084108412.0964.325079GoogleGemma license
Nous-Hermes-2-Mixtral-8x7B-DPO108410793835NousResearchApache-2.0
NV-Llama2-70B-SteerLM-Chat108110237.5468.53637NvidiaLlama 2
pplx-70B-online107810286891Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1077107971.34986DeepSeek AIDeepSeek License
OpenChat-3.5107710547.8164.38110OpenChatApache-2.0
Granite-3.0-2B-Instruct107410887238IBMApache 2.0
OpenHermes-2.5-Mistral-7B107410585089NousResearchApache-2.0
Mistral-7B-Instruct-v0.21072107412.577.620067MistralApache-2.0
Phi-3-Mini-4K-Instruct-June-241071108270.912856MicrosoftMIT
Qwen1.5-7B-Chat107010897.6614869AlibabaQianwen LICENSE
GPT-3.5-Turbo-11061068109518.878.3217031OpenAIProprietary
Phi-3-Mini-4k-Instruct1066108668.821099MicrosoftMIT
Llama-2-13b-chat106310516.6553.619739MetaLlama 2
Dolphin-2.2.1-Mistral-7B106310261713Cognitive ComputationsApache-2.0
SOLAR-10.7B-Instruct-v1.0106210477.5866.24288Upstage AICC-BY-NC-4.0
WizardLM-13b-v1.2105910267.252.77182MicrosoftLlama 2
Llama-3.2-1B-Instruct105410478547MetaLlama 3.2
Zephyr-7B-beta105310307.3461.411332HuggingFaceMIT
MPT-30B-chat104510316.3950.42649MosaicMLCC-BY-NC-SA-4.0
pplx-7B-online104410156336Perplexity AIProprietary
CodeLlama-34B-instruct1043104253.77512MetaLlama 2
CodeLlama-70B-instruct104210481194MetaLlama 2
Zephyr-7B-alpha104210346.881814HuggingFaceMIT
Vicuna-13B104210326.5755.819794LMSYSLlama 2
Gemma-7B-it103710477.4764.39181GoogleGemma license
Phi-3-Mini-128k-Instruct1037102915.4368.121634MicrosoftMIT
Llama-2-7B-chat103710026.2745.814549MetaLlama 2
Qwen-14B-Chat103510566.9666.55070AlibabaQianwen LICENSE
falcon-180b-chat10341018681327TIIFalcon-180B TII License
Guanaco-33B10329666.5357.63000UWNon-commercial
Gemma-1.1-2b-it102110363.3764.311354GoogleGemma license
StripedHyena-Nous-7B10189995270Together AIApache 2.0
OLMo-7B-instruct101610176505Allen AIApache-2.0
Mistral-7B-Instruct-v0.1100810086.8455.49144MistralApache 2.0
Vicuna-7B10059816.1749.87014LMSYSLlama 2
PaLM-Chat-Bison-00110049906.48744GoogleProprietary
Gemma-2B-it989100042.34924GoogleGemma license
Qwen1.5-4B-Chat98899056.17814AlibabaQianwen LICENSE
Koala-13B9649375.3544.77036UC BerkeleyNon-commercial
ChatGLM3-6B9559534765TsinghuaApache-2.0
GPT4All-13B-Snoozy9329105.41431787Nomic AINon-commercial
MPT-7B-Chat9289005.42324013MosaicMLCC-BY-NC-SA-4.0
ChatGLM2-6B9248924.9645.52710TsinghuaApache-2.0
RWKV-4-Raven-14B9228973.9825.64934RWKVApache 2.0
Alpaca-13B9027894.5348.15875StanfordNon-commercial
OpenAssistant-Pythia-12B8938734.32276379OpenAssistantApache 2.0
ChatGLM-6B8798844.536.14992TsinghuaNon-commercial
FastChat-T5-3B8687593.0447.74303LMSYSApache 2.0
StableLM-Tuned-Alpha-7B8408582.7524.43340Stability AICC-BY-NC-SA-4.0
Dolly-V2-12B8227463.2825.73485DatabricksMIT
LLaMA-13B8006692.61472445MetaNon-commercial

If you want to see more models, please help us add them.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.

MT-Bench Effectively Distinguishes Among Chatbots

We observe a clear distinction among chatbots of varying abilities, with scores showing a high correlation with the Chatbot Arena Elo rating. In particular, MT-Bench reveals noticeable performance gaps between GPT-4 and GPT-3.5, and between open and proprietary models.

To delve deeper into the distinguishing factors among chatbots, we select a few representative chatbots and break down their performance per category. GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5.

Figure 5: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities