Leaderboard

Chatbot Arena

This leaderboard is based on the following benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.1M+ user votes to compute Elo ratings. MMLU - a test to measure a model’s multitask accuracy on 57 tasks. Arena-Hard-Auto - an automatic evaluation tool for instruction-tuned LLMs.

Coder EvalPlus

EvalPlus is a rigorous evaluation framework for LLM4Code, with: ✨ HumanEval+: 80x more tests than the original HumanEval! ✨ MBPP+: 35x more tests than the original MBPP! ✨ EvalPerf: evaluating the efficiency of LLM-generated code! ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Text2SQL Leaderboard

Text-to-SQL (or Text2SQL), as the name implies, is to convert text into SQL. A more academic definition is to convert natural language problems in the database field into structured query languages that can be executed in relational databases. Therefore, Text-to-SQL can also be abbreviated as NL2SQL. Input: natural language questions, such as Query the relevant information of the table t_user, and the results are sorted in descending order by id. Output: SQL, such as SELECT * FROM t_user ORDER BY id DESC.

Leaderboard

⌂ • Engineering • Research • Leaderboard • Learning

Chatbot Arena

Coder EvalPlus

Text2SQL Leaderboard