Leaderboard

Chatbot Arena +

This leaderboard is based on the following benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.5M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.

Coder EvalPlus

EvalPlus is a rigorous evaluation framework for LLM4Code, with: ✨ HumanEval+: 80x more tests than the original HumanEval! ✨ MBPP+: 35x more tests than the original MBPP! ✨ EvalPerf: evaluating the efficiency of LLM-generated code! ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Text2SQL Leaderboard

Text-to-SQL (or Text2SQL), as the name implies, is to convert text into SQL. A more academic definition is to convert natural language problems in the database field into structured query languages that can be executed in relational databases. Therefore, Text-to-SQL can also be abbreviated as NL2SQL. Input: natural language questions, such as Query the relevant information of the table t_user, and the results are sorted in descending order by id. Output: SQL, such as SELECT * FROM t_user ORDER BY id DESC.

Leaderboard

EngineeringResearchLeaderboardLearning

Chatbot Arena +

Coder EvalPlus

Text2SQL Leaderboard

Engineering Research Leaderboard Learning