Chatbot Arena

Attribution LMSYS July 22, 2024

This leaderboard is based on the following three benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 1.5M+ user votes to compute Elo ratings.
  • MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade model responses.
  • MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best for Model Size Class

Model▤ SizeArena EloMMLUContext WindowLicense
OpenAI GPT-4o-2024-05-1372B +128788.7128KProprietary
Meta Llama-3.1-405b-Instruct72B +88.6128KLlama 3 Community
Meta Llama-3.1-70b-Instruct40B - 72B120786128KLlama 3 Community
Google Gemma-2-27B-it24B - 40B121775.28KGemma license
Meta Llama-3.1-8b-Instruct4B - 24B115273128KLlama 3 Community
Microsoft Phi-3-Mini-4k-Instruct1B - 4B106668.84KMIT

Full Leaderboard
Model🏆 Arena EloCoding EloMT-benchMMLUVotesOrganizationLicense
GPT-4o-2024-05-131287129788.764700OpenAIProprietary
GPT-4o-mini-2024-07-1812801276824449OpenAIProprietary
Claude 3.5 Sonnet1272130188.734591AnthropicProprietary
Gemini-Advanced-05141267125648001GoogleProprietary
Gemini-1.5-Pro-API-05141261126485.957448GoogleProprietary
GPT-4-Turbo-2024-04-091257126476597OpenAIProprietary
Gemini-1.5-Pro-API-04091257123281.955681GoogleProprietary
GPT-4-1106-preview125112559.3288475OpenAIProprietary
Claude 3 Opus1248125186.8147947AnthropicProprietary
GPT-4-0125-preview1245124581807OpenAIProprietary
Yi-Large-preview124012475049901 AIProprietary
Gemini-1.5-Flash-API-05141228123378.948048GoogleProprietary
Deepseek-v2-API-0628122212458089DeepSeek AIProprietary
Gemma-2-27B-it1217120719624GoogleGemma license
Yi-Large121412211527601 AIProprietary
Nemotron-4-340B-Instruct1209119920685NvidiaNVIDIA Open Model
Bard (Gemini Pro)1209117211830GoogleProprietary
GLM-4-05201207121710253Zhipu AIProprietary
Llama-3-70b-Instruct1207120082154914MetaLlama 3 Community
Claude 3 Sonnet1201121479112955AnthropicProprietary
Reka-Core-202405011200119183.261757Reka AIProprietary
Command R+1190116579583CohereCC-BY-NC-4.0
Gemma-2-9B-it1188116719774GoogleGemma license
Qwen2-72B-Instruct118711879.1284.230310AlibabaQianwen LICENSE
GPT-4-0314118611968.9686.455984OpenAIProprietary
GLM-4-0116118311927589Zhipu AIProprietary
Qwen-Max-04281183119125775AlibabaProprietary
DeepSeek-Coder-V2-Instruct1179124115644DeepSeek AIDeepSeek License
Claude 3 Haiku1179119075.2104266AnthropicProprietary
Reka-Flash-Preview-202406111165115619489Reka AIProprietary
Qwen1.5-110B-Chat116211768.8880.427497AlibabaQianwen LICENSE
GPT-4-0613116111669.1885431OpenAIProprietary
Mistral-Large-24021157117081.264444MistralProprietary
Yi-1.5-34B-Chat1157116376.82517101 AIApache-2.0
Reka-Flash-21B-online1156114816041Reka AIProprietary
Llama-3-8b-Instruct1152114768.4102317MetaLlama 3 Community
Claude-1114911367.97721179AnthropicProprietary
Command R1149112455724CohereCC-BY-NC-4.0
Mistral Medium114811538.6175.335568MistralProprietary
Qwen1.5-72B-Chat114711618.6177.540662AlibabaQianwen LICENSE
Reka-Flash-21B1147114273.525801Reka AIProprietary
Mixtral-8x22b-Instruct-v0.11146115277.845772MistralApache 2.0
Claude-2.0113111358.0678.512783AnthropicProprietary
Gemini Pro (Dev API)1131110371.818794GoogleProprietary
Zephyr-ORPO-141b-A35b-v0.1112711264865HuggingFaceApache 2.0
Qwen1.5-32B-Chat112511508.373.422770AlibabaQianwen LICENSE
Mistral-Next1124113312383MistralProprietary
Phi-3-Medium-4k-Instruct112211267818222MicrosoftMIT
Starling-LM-7B-beta111911308.1216663NexusflowApache-2.0
Claude-2.1111811338.1837713AnthropicProprietary
GPT-3.5-Turbo-0613111711368.3938965OpenAIProprietary
Mixtral-8x7b-Instruct-v0.1111411148.370.673266MistralApache 2.0
Claude-Instant-1111111107.8573.420631AnthropicProprietary
Yi-34B-Chat1111110773.51594701 AIYi License
Gemini Pro1111109371.86568GoogleProprietary
Qwen1.5-14B-Chat110911277.9167.618673AlibabaQianwen LICENSE
GPT-3.5-Turbo-0314110611157.94705663OpenAIProprietary
WizardLM-70B-v1.0110610727.7163.78392MicrosoftLlama 2 Community
GPT-3.5-Turbo-01251105112467470OpenAIProprietary
DBRX-Instruct-Preview1103111973.733738DatabricksDBRX LICENSE
Phi-3-Small-8k-Instruct1102110875.718517MicrosoftMIT
Tulu-2-DPO-70B109910947.896669AllenAI/UWAI2 ImpACT Low-risk
Llama-2-70b-chat109310736.866339635MetaLlama 2 Community
OpenChat-3.5-0106109111037.865.812985OpenChatApache-2.0
Vicuna-33B109110687.1259.222954LMSYSNon-commercial
Snowflake Arctic Instruct1090107867.334214SnowflakeApache 2.0
Starling-LM-7B-alpha108810818.0963.910424UC BerkeleyCC-BY-NC-4.0
Gemma-1.1-7B-it1084108664.325112GoogleGemma license
Nous-Hermes-2-Mixtral-8x7B-DPO108410803843NousResearchApache-2.0
NV-Llama2-70B-SteerLM-Chat108010247.5468.53636NvidiaLlama 2 Community
pplx-70b-online107810296893Perplexity AIProprietary
DeepSeek-LLM-67B-Chat1077108071.34984DeepSeek AIDeepSeek License
OpenChat-3.5107610547.8164.38121OpenChatApache-2.0
OpenHermes-2.5-Mistral-7b107410605090NousResearchApache-2.0
Mistral-7B-Instruct-v0.2107210747.620068MistralApache-2.0
Qwen1.5-7B-Chat107010897.6614869AlibabaQianwen LICENSE
GPT-3.5-Turbo-1106106810968.3217028OpenAIProprietary
Phi-3-Mini-4k-Instruct1066108768.821159MicrosoftMIT
Llama-2-13b-chat106310526.6553.619757MetaLlama 2 Community
Phi-3-Mini-4k-Instruct-June-241062107170.96268MicrosoftMIT
SOLAR-10.7B-Instruct-v1.0106210487.5866.24293Upstage AICC-BY-NC-4.0
Dolphin-2.2.1-Mistral-7B106210261714Cognitive ComputationsApache-2.0
WizardLM-13b-v1.2105810277.252.77195MicrosoftLlama 2 Community
Zephyr-7b-beta105310317.3461.411334HuggingFaceMIT
MPT-30B-chat104510316.3950.42649MosaicMLCC-BY-NC-SA-4.0
pplx-7b-online104510166338Perplexity AIProprietary
CodeLlama-70B-instruct104210491191MetaLlama 2 Community
CodeLlama-34B-instruct1042104353.77514MetaLlama 2 Community
Vicuna-13B104210336.5755.819798LMSYSLlama 2 Community
Zephyr-7b-alpha104110356.881817HuggingFaceMIT
Gemma-7B-it1037104864.39177GoogleGemma license
Phi-3-Mini-128k-Instruct1037103068.121620MicrosoftMIT
Llama-2-7b-chat103710036.2745.814568MetaLlama 2 Community
Qwen-14B-Chat103510566.9666.55071AlibabaQianwen LICENSE
falcon-180b-chat10341018681326TIIFalcon-180B TII License
Guanaco-33B10329666.5357.63003UWNon-commercial
Gemma-1.1-2B-it1021103764.311375GoogleGemma license
StripedHyena-Nous-7B101710015266Together AIApache 2.0
OLMo-7B-instruct101510176500Allen AIApache-2.0
Mistral-7B-Instruct-v0.1100810096.8455.49151MistralApache 2.0
Vicuna-7B10059836.1749.87023LMSYSLlama 2 Community
PaLM-Chat-Bison-00110039916.48747GoogleProprietary
Gemma-2B-it989100142.34919GoogleGemma license
Qwen1.5-4B-Chat98899156.17811AlibabaQianwen LICENSE
Koala-13B9649385.3544.77040UC BerkeleyNon-commercial
ChatGLM3-6B9559544763TsinghuaApache-2.0
GPT4All-13B-Snoozy9319105.41431787Nomic AINon-commercial
MPT-7B-Chat9279005.42324019MosaicMLCC-BY-NC-SA-4.0
ChatGLM2-6B9248934.9645.52708TsinghuaApache-2.0
RWKV-4-Raven-14B9218973.9825.64942RWKVApache 2.0
Alpaca-13B9017914.5348.15874StanfordNon-commercial
OpenAssistant-Pythia-12B8938744.32276385OpenAssistantApache 2.0
ChatGLM-6B8798844.536.14999TsinghuaNon-commercial
FastChat-T5-3B8697603.0447.74309LMSYSApache 2.0
StableLM-Tuned-Alpha-7B8408592.7524.43337Stability AICC-BY-NC-SA-4.0
Dolly-V2-12B8227473.2825.73488DatabricksMIT
LLaMA-13B7996692.61472444MetaNon-commercial

If you want to see more models, please help us add them.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.

MT-Bench Effectively Distinguishes Among Chatbots

We observe a clear distinction among chatbots of varying abilities, with scores showing a high correlation with the Chatbot Arena Elo rating. In particular, MT-Bench reveals noticeable performance gaps between GPT-4 and GPT-3.5, and between open and proprietary models.

To delve deeper into the distinguishing factors among chatbots, we select a few representative chatbots and break down their performance per category. GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5.

Figure 5: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities