Chatbot Arena

Attribution LMSYS May 1, 2024

This leaderboard is based on the following three benchmarks.

  • Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 900K+ user votes to compute Elo ratings.
  • MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade model responses.
  • MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.

| Vote | Blog | GitHub | Paper | Dataset | Twitter | Discord |

Best for Model Size Class

⬍ Model▤ SizeArena EloMMLULicense
OpenAI GPT-4-Turbo-2024-04-09█ > 72B125986.7Proprietary
Meta Llama-3-70b-Instruct▆ 32.1B - 72B121082Llama 3 Community
Anthropic Claude 3 Haiku▄ 8.1B - 32B118175.2Proprietary
Meta Llama-3-8b-Instruct▂ 4.1B - 8B115368.4Llama 3 Community
Microsoft Phi-3-Mini-128k-Instruct▁ ≤ 4B105068.1MIT

Full Leaderboard
⬍ Model🏆 Arena EloMT-benchMMLUVotesOrganizationLicense
GPT-4-Turbo-2024-04-09125935931OpenAIProprietary
GPT-4-1106-preview12539.3273547OpenAIProprietary
Claude 3 Opus125186.880997AnthropicProprietary
Gemini 1.5 Pro API-0409-Preview125081.939482GoogleProprietary
GPT-4-0125-preview124767354OpenAIProprietary
Llama-3-70b-Instruct12108253404MetaLlama 3 Community
Bard (Gemini Pro)120912387GoogleProprietary
Claude 3 Sonnet12017978956AnthropicProprietary
Command R+119144988CohereCC-BY-NC-4.0
GPT-4-031411908.9686.452079OpenAIProprietary
Claude 3 Haiku118175.269660AnthropicProprietary
GPT-4-061311659.1870726OpenAIProprietary
Mistral-Large-2402115781.247755MistralProprietary
Llama-3-8b-Instruct115368.452702MetaLlama 3 Community
Reka-Flash-21B-online115312307Reka AIProprietary
Qwen1.5-72B-Chat11528.6177.534812AlibabaQianwen LICENSE
Claude-111517.97721768AnthropicProprietary
Mistral Medium11488.6175.333625MistralProprietary
Command R114839140CohereCC-BY-NC-4.0
Reka-Flash-21B114773.516999Reka AIProprietary
Mixtral-8x22b-Instruct-v0.1114677.825341MistralApache 2.0
Gemini Pro (Dev API)113671.820086GoogleProprietary
Qwen1.5-32B-Chat11348.373.417407AlibabaQianwen LICENSE
Claude-2.011328.0678.513413AnthropicProprietary
Zephyr-ORPO-141b-A35b-v0.111295207HuggingFaceApache 2.0
Mistral-Next112713059MistralProprietary
GPT-3.5-Turbo-061311208.3940845OpenAIProprietary
Claude-2.111198.1839731AnthropicProprietary
Starling-LM-7B-beta11198.1218020NexusflowApache-2.0
Qwen1.5-14B-Chat11187.9167.620605AlibabaQianwen LICENSE
Gemini Pro111571.86818GoogleProprietary
Mixtral-8x7b-Instruct-v0.111148.370.664921MistralApache 2.0
Yi-34B-Chat111173.51317701 AIYi License
Claude-Instant-111107.8573.421559AnthropicProprietary
GPT-3.5-Turbo-031411097.94705878OpenAIProprietary
WizardLM-70B-v1.011097.7163.78867MicrosoftLlama 2 Community
GPT-3.5-Turbo-0125110747220OpenAIProprietary
Tulu-2-DPO-70B11037.896935AllenAI/UWAI2 ImpACT Low-risk
DBRX-Instruct-Preview110373.731953DatabricksDBRX LICENSE
OpenChat-3.5-010610997.865.814159OpenChatApache-2.0
Snowflake Arctic Instruct109867.328605SnowflakeApache 2.0
Vicuna-33B10947.1259.224375LMSYSNon-commercial
Starling-LM-7B-alpha10928.0963.910962UC BerkeleyCC-BY-NC-4.0
Llama-2-70b-chat10896.866339678MetaLlama 2 Community
Nous-Hermes-2-Mixtral-8x7B-DPO10873980NousResearchApache-2.0
Gemma-1.1-7B-it108364.39351GoogleGemma license
NV-Llama2-70B-SteerLM-Chat10827.5468.53759NvidiaLlama 2 Community
DeepSeek-LLM-67B-Chat108071.35177DeepSeek AIDeepSeek License
OpenChat-3.510797.8164.38440OpenChatApache-2.0
OpenHermes-2.5-Mistral-7b10795275NousResearchApache-2.0
Qwen1.5-7B-Chat10777.6615025AlibabaQianwen LICENSE
pplx-70b-online10757234Perplexity AIProprietary
Mistral-7B-Instruct-v0.210747.620413MistralApache-2.0
GPT-3.5-Turbo-110610738.3217819OpenAIProprietary
Dolphin-2.2.1-Mistral-7B10661775Cognitive ComputationsApache-2.0
SOLAR-10.7B-Instruct-v1.010657.5866.24472Upstage AICC-BY-NC-4.0
WizardLM-13b-v1.210627.252.77611MicrosoftLlama 2 Community
Llama-2-13b-chat10576.6553.618530MetaLlama 2 Community
Zephyr-7b-beta10547.3461.411924HuggingFaceMIT
Phi-3-Mini-128k-Instruct105068.119804MicrosoftMIT
Vicuna-13B10486.5755.820695LMSYSLlama 2 Community
MPT-30B-chat10486.3950.42791MosaicMLCC-BY-NC-SA-4.0
CodeLlama-70B-instruct10481321MetaLlama 2 Community
CodeLlama-34B-instruct104753.78036MetaLlama 2 Community
Gemma-7B-it104464.39859GoogleGemma license
Zephyr-7b-alpha10436.881901HuggingFaceMIT
pplx-7b-online10436603Perplexity AIProprietary
Llama-2-7b-chat10416.2745.815427MetaLlama 2 Community
Qwen-14B-Chat10396.9666.55295AlibabaQianwen LICENSE
falcon-180b-chat1037681393TIIFalcon-180B TII License
Guanaco-33B10346.5357.63181UWNon-commercial
StripedHyena-Nous-7B10235505Together AIApache 2.0
OLMo-7B-instruct10217023Allen AIApache-2.0
Gemma-1.1-2B-it101564.34693GoogleGemma license
Mistral-7B-Instruct-v0.110126.8455.49618MistralApache 2.0
PaLM-Chat-Bison-00110106.49222GoogleProprietary
Vicuna-7B10096.1749.87373LMSYSLlama 2 Community
Qwen1.5-4B-Chat100356.18491AlibabaQianwen LICENSE
Gemma-2B-it100042.35323GoogleGemma license
Koala-13B9695.3544.77300UC BerkeleyNon-commercial
ChatGLM3-6B9614944TsinghuaApache-2.0
GPT4All-13B-Snoozy9395.41431907Nomic AINon-commercial
MPT-7B-Chat9335.42324190MosaicMLCC-BY-NC-SA-4.0
ChatGLM2-6B9334.9645.52880TsinghuaApache-2.0
RWKV-4-Raven-14B9283.9825.65129RWKVApache 2.0
Alpaca-13B9084.5348.16111StanfordNon-commercial
OpenAssistant-Pythia-12B9004.32276623OpenAssistantApache 2.0
ChatGLM-6B8874.536.15195TsinghuaNon-commercial
FastChat-T5-3B8773.0447.74521LMSYSApache 2.0
StableLM-Tuned-Alpha-7B8492.7524.43461Stability AICC-BY-NC-SA-4.0
Dolly-V2-12B8273.2825.73666DatabricksMIT
LLaMA-13B8042.61472538MetaNon-commercial
WizardLM-30B7.0158.7MicrosoftNon-commercial
Vicuna-13B-16k6.9254.5LMSYSLlama 2 Community
WizardLM-13B-v1.16.7650MicrosoftNon-commercial
Tulu-30B6.4358.1AllenAI/UWNon-commercial
Guanaco-65B6.4162.1UWNon-commercial
OpenAssistant-LLaMA-30B6.4156OpenAssistantNon-commercial
WizardLM-13B-v1.06.3552.3MicrosoftNon-commercial
Vicuna-7B-16k6.2248.5LMSYSLlama 2 Community
Baize-v2-13B5.7548.9UCSDNon-commercial
XGen-7B-8K-Inst5.5542.1SalesforceNon-commercial
Nous-Hermes-13B5.5149.3NousResearchNon-commercial
MPT-30B-Instruct5.2247.8MosaicMLCC-BY-SA 3.0
Falcon-40B-Instruct5.1754.7TIIApache 2.0

If you want to see more models, please help us add them.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. The latest and detailed leaderboard is here.

More Statistics for Chatbot Arena

🔗 Arena Statistics

Transition from online Elo rating system to Bradley-Terry model

We adopted the Elo rating system for ranking models since the launch of the Arena. It has been useful to transform pairwise human preference to Elo ratings that serve as a predictor of winrate between models. Specifically, if player A has a rating of RA and player B a rating of RB, the probability of player A winning is

{\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~.}

ELO rating has been used to rank chess players by the international community for over 60 years. Standard Elo rating systems assume a player’s performance changes overtime. So an online algorithm is needed to capture such dynamics, meaning recent games should weigh more than older games. Specifically, after each game, a player’s rating is updated according to the difference between predicted outcome and actual outcome.

{\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+K\cdot (S_{\mathsf {A}}-E_{\mathsf {A}})~.}

This algorithm has two distinct features:

  1. It can be computed asynchronously by players around the world.
  2. It allows for players performance to change dynamically – it does not assume a fixed unknown value for the players rating.

This ability to adapt is determined by the parameter K which controls the magnitude of rating changes that can affect the overall result. A larger K essentially put more weight on the recent games, which may make sense for new players whose performance improves quickly. However as players become more senior and their performance “converges” then a smaller value of K is more appropriate. As a result, USCF adopted K based on the number of games and tournaments completed by the player (reference). That is, the Elo rating of a senior player changes slower than a new player.

When we launched the Arena, we noticed considerable variability in the ratings using the classic online algorithm. We tried to tune the K to be sufficiently stable while also allowing new models to move up quickly in the leaderboard. We ultimately decided to adopt a bootstrap-like technique to shuffle the data and sample Elo scores from 1000 permutations of the online plays. You can find the details in this notebook. This provided consistent stable scores and allowed us to incorporate new models quickly. This is also observed in a recent work by Cohere. However, we used the same samples to estimate confidence intervals which were therefore too wide (effectively CI’s for the original online Elo estimates).

In the context of LLM ranking, there are two important differences from the classic Elo chess ranking system. First, we have access to the entire history of all games for all models and so we don’t need a decentralized algorithm. Second, most models are static (we have access to the weights) and so we don’t expect their performance to change. However, it is worth noting that the hosted proprietary models may not be static and their behavior can change without notice. We try our best to pin specific model API versions if possible.

To improve the quality of our rankings and their confidence estimates, we are adopting another widely used rating system called the Bradley–Terry (BT) model. This model actually is the maximum likelihood (MLE) estimate of the underlying Elo model assuming a fixed but unknown pairwise win-rate. Similar to Elo rating, BT model is also based on pairwise comparison to derive ratings of players to estimate win rate between each other. The core difference between BT model vs the online Elo system is the assumption that player’s performance does not change (i.e., game order does not matter) and the computation takes place in a centralized fashion.

MT-Bench Effectively Distinguishes Among Chatbots

We observe a clear distinction among chatbots of varying abilities, with scores showing a high correlation with the Chatbot Arena Elo rating. In particular, MT-Bench reveals noticeable performance gaps between GPT-4 and GPT-3.5, and between open and proprietary models.

To delve deeper into the distinguishing factors among chatbots, we select a few representative chatbots and break down their performance per category. GPT-4 shows superior performance in Coding and Reasoning compared to GPT-3.5.

Figure 5: The comparison of 6 representative LLMs regarding their abilities in 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities