Statistics for Chatbot Arena

We added some additional figures to show more statistics. The code for generating them is included in this notebook. Please note that you may see different orders from different ranking methods. This is expected for models that perform similarly, as demonstrated by the confidence interval in the bootstrap figure. Going forward, we prefer the classical Elo calculation because of its scalability and interpretability.

Figure 1: Confidence Intervals on Model Strength (via Bootstrapping)

Figure 2: Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties)

Figure 3: Fraction of Model A Wins for All Non-tied A vs. B Battles

Figure 4: Battle Count for Each Combination of Models (without Ties)

Arena Statistics

LMSYS

Statistics for Chatbot Arena

Arena Statistics

LMSYS 2025Aug 6

Statistics for Chatbot Arena

LMSYS