Text2SQL Leaderboard

Text-to-SQL (or Text2SQL), as the name implies, is to convert text into SQL. A more academic definition is to convert natural language problems in the database field into structured query languages that can be executed in relational databases. Therefore, Text-to-SQL can also be abbreviated as NL2SQL.

Input: natural language questions, such as Query the relevant information of the table t_user, and the results are sorted in descending order by id.
Output: SQL, such as SELECT * FROM t_user ORDER BY id DESC.

🔔 Leaderboard

	WikiSQL	Spider Exact Match(EM)	Spider Exact Execution(EX)	BIRD Reward-based Valid Efficiency Score (R-VES)	BIRD Execution Accuracy (EX)
🏆1	93.0 (2021/05-SeaD+Execution-Guided Decoding)	81.5 (2023/11-MiniSeek)	91.2 (2023/11-MiniSeek)	69.36 (2024/08-OpenSearch-SQL, v2 + GPT-4o)	73.00 (2024/09-CHASE-SQL + Gemini)
🥈2	92.7 (2021/03-SDSQL+Execution-Guided Decoding)	74.0 (2022/09-Graphix-3B + PICARD)	86.6 (2023/08-DAIL-SQL + GPT-4 + Self-Consistency)	68.79 (2024/08-ExSL + granite-34b-code)	72.39 (2024/09-AskData + GPT-4o)
🥉3	92.5 (2020/11-IE-SQL+Execution-Guided Decoding)	73.9 (2022/09-CatSQL + GraPPa)	86.2 (2023/08-DAIL-SQL + GPT-4)	68.44 (2024/09-CHASE-SQL + Gemini)	72.28 (2024/08-OpenSearch-SQL, v2 + GPT-4o)
4	92.2 (2020/03-HydraNet+Execution-Guided Decoding)	73.1 (2022/09-SHiP + PICARD)	85.6 (2023/10-DPG-SQL + GPT-4 + Self-Correction)	67.41 (2024/07-Distillery + GPT-4o)	71.83 (2024/07-Distillery + GPT-4o)
5	91.9 (2020/12-BRIDGE+Execution-Guided Decoding)	72.9 (2022/05-G³R + LGESQL + ELECTRA)	85.3 (2023/04-DIN-SQL + GPT-4)	66.92 (2024/09-AskData + GPT-4o)	70.37 (2024/08-ExSL + granite-34b-code)
6	91.8 (2019/08-X-SQL+Execution-Guided Decoding)	72.4 (2022/08-RESDSQL+T5-1.1-lm100k-xl)	83.9 (2023/07-Hindsight Chain of Thought with GPT-4)	66.39 (2024/08-Insights AI)	70.26 (2024/08-Insights AI)
7	91.4 (2021/03-SDSQL)	72.4 (2022/05-T5-SR)	82.3 (2023/06-C3 + ChatGPT + Zero-Shot)	66.25 (2024/05-ExSL + granite-20b-code)	70.21 (2024/07-PURPLE + RED + GPT-4o）
8	91.1 (2020/12-BRIDGE)	72.2 (2022/12-N-best List Rerankers + PICARD)	80.8 (2023/07-Hindsight Chain of Thought with GPT-4 and Instructions)	65.70 (2024/07-RECAP + Gemini)	69.03 (2024/07-RECAP + Gemini)
9	91.0 (2021/04-Text2SQLGen + EG)	72.1 (2021/09-S²SQL + ELECTRA )	79.9 (2023/02-RESDSQL-3B + NatSQ)	65.62 (2024/07-PURPLE + RED + GPT-4o）	68.87 (2024/07-ByteBrain)
10	90.5 (2020/11-SeqGenSQL+EG)	72.0 (2023/02-RESDSQL-3B + NatSQL)	78.5 (2022/11-SeaD + PQL)	63.68 (2024/08-Arcwise + GPT-4o）	67.86 (2024/05-ExSL + granite-20b-code)

📖 Survey

(2025-TKDE, CCF-A) Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL [paper] [code]
(2025-TKDE, CCF-A) A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? [paper] [code]
(2024-arXiv) From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems [paper]
(2024-arXiv) Large Language Model Enhanced Text-to-SQL Generation: A Survey [paper]
(2024-arXiv) A Survey on Employing Large Language Models for Text-to-SQL Tasks [paper]
(2023-VLDB, CCF-A）A survey on deep learning approaches for text-to-SQL [paper]
(2022-TKDE, CCF-A) A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions [paper]
(2022-COLOING, CCF-B) Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect [paper]
(2022-arXiv)Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey [paper]

💬 Classic Model

(2025-NAACL, CCF-B) You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL [paper]
(2025-EDBT, CCF-B) DBCᴏᴘɪʟᴏᴛ: Natural Language Querying over Massive Databases via Schema Routing [paper] [code]
(2025-ACL, CCF-A) MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL [paper] [code]
(2024-NeurIPS, CCF-A) Towards Optimizing SQL Generation via LLM Routing [paper]
(2024-arXiv, None) CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL [paper]
(2024-arXiv, None) E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL [paper] [code]
(2024-NeurIPS, CCF-A) Distillery: The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models [paper]
(2024-arXiv, None) DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models [paper] [code]
(2024-VLDB, CCF-A) SuperSQL: The Dawn of Natural Language to SQL: Are We Fully Ready? [paper] [code]
(2024-arXiv, None) CHESS: Contextual Harnessing for Efficient SQL Synthesis [paper] [code]
(2023-arXiv, None) Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [paper] [code]
(2023-AAAI 2023, CCF-A) RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL [paper] [code]
(2023-arXiv, None) Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs [paper] [code]
(2023-arXiv, None) DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [paper] [code]
(2023-arXiv, None) A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability [paper] [code]
(2023-ICLR, CCF-A) Binding Language Models in Symbolic Languages [paper] [code]
(2023-SIGMOD, CCF-A) Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning [paper] [code]
(2023-ICASSP, CCF-B) T5-SR: A Unified Seq-to-Seq Decoding Strategy for Semantic Parsing [paper]
(2022-ACL, CCF-A) S²SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers [paper]
(2022-NAACL, CCF-B) SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising [paper]
(2022-EMNLP, CCF-B) STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [paper] [code]
(2022-EMNLP, CCF-B) RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL [paper] [code]
(2022-EMNLP, CCF-B) CQR-SQL: Conversational Question Reformulation Enhanced Context-Dependent Text-to-SQL Parsers [paper]
(2022-ACL, CCF-A) HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing [paper]
(2022-arXiv, None) Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [paper]
(2021-ACL, CCF-A) Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL [paper]
(2021-arXiv, None) Pay More Attention to History: A Context Modelling Strategy for Conversational Text-to-SQL [paper] [code]
(2021-ICLR, CCF-A) SCORE: Pre-training for Context Representation in Conversational Semantic Parsing [paper]
(2021-DASFAA, CCF-B) An Interactive NL2SQL Approach with Reuse Strategy [paper]
(2021-NAACL, CCF-B) Structure-Grounded Pretraining for Text-to-SQL [paper]
(2021-EMNLP, CCF-B) PICARD:Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models [paper] [code]
(2021-ICLR, CCF-A) GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [paper] [code]
(2021-ACL, CCF-A) LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations [paper] [code]
(2020-EMNLP, CCF-B) Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [paper] [code]
(2020-ACL, CCF-A) TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [paper] [code]
(2020-ACL, CCF-A) RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers [paper] [code]
(2020-EMNLP, CCF-B) Mention Extraction and Linking for SQL Query Generation [paper]
(2020-EMNLP, CCF-B) IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation [paper] [code]
(2020-arXiv, None) Hybrid Ranking Network for Text-to-SQL [paper] [code]
(2019-arXiv, None) X-SQL: reinforce schema representation with context [paper]
(2019-EMNLP, CCF-B) Global Reasoning over Database Structures for Text-to-SQL Parsing [paper] [code]
(2019-EMNLP, CCF-B) Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions [paper] [code]
(2019-ACL, CCF-A) Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing [paper] [code]
(2019-ACL, CCF-A) Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation [paper] [code]
(2018-EMNLP, CCF-B) SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task [paper] [code]
(2018-NAACL, CCF-B) TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation [paper] [code]
(2017-arXiv, None) SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning [paper] [code]

🔥 Base Model

Llama [paper] [code] [model]
- 2023/02, Meta AI proposes the open source LLM Llama, which has four scales: 7b, 13b, 33b, and 65b.
ChatGLM [paper] [code] [model]
- 2023/03, Tsinghua University proposes the open bilingual language model ChatGLM, based on General Language Model framework, with the specification of 7b.
Alpaca [paper] [code] [model]
- 2023/03, Stanford University proposes Alpaca, an open source LLM fine-tuned based on the Llama 7b model.
Vicuna [paper] [code] [model]
- 2023/03, UC Berkeley University, CMU and Stanford University propose Vicuna, an open source LLM based on the Llama model, with two specifications: 7b and 13b.
WizardLM [paper] [code] [model]
- 2023/04, Peking University and Microsoft propose WizardLM, an LLM of evolutionary instructions, with three specifications of 7b, 13b, and 30b. 2023/06, They propose WizardMath, an LLM in the field of mathematics. 2023/08, They propose WizardCoder, an LLM in the field of code.
Falcon [paper] [code] [model]
- 2023/06, United Arab Emirates proposes Falcon, an open source LLM trained solely on refinedweb datasets, with four parameter specifications of 1b, 7b, 40b and 180b. It is worth noting that the performance on model 40B exceeds that of 65B LLaMA.
ChatGLM2[paper] [code] [model]
- 2023/06, Tsinghua University proposes the second-generation version of ChatGLM, with the specification of 7b, which has stronger performance, longer context, more efficient inference and more open license.
Baichuan-7b [code] [model]
- 2023/06, Baichuan Intelligent Technology proposes the Baichuan-7B, an open-source, large-scale pre-trained language model based on Transformer architecture, which contains 7 billion parameters and was trained on approximately 1.2 trillion tokens. It supports both Chinese and English languages with a context window length of 4096.
Baichuan-13b [code] [model]
- 2023/07, Baichuan Intelligent Technology proposes the Baichuan-13B, an open-source, commercially available large-scale language model, following Baichuan-7B, which has two versions: pre-training (Baichuan-13B-Base) and alignment (Baichuan-13B-Chat).
InternLM [paper] [code] [model]
- 2023/07, Shanghai AI Laboratory and SenseTime propose the InternLM, which has open-sourced 7b and 20b parameter base models and chat models tailored for practical scenarios and the training system.
Llama 2 [paper] [code] [model]
- 2023/07, Meta AI proposes the second-generation Llama series open-source LLM Llama 2. Compared with Llama 1, the training data is 40% more, and the context length is doubled. The model has four specifications: 7b, 13b, 34b, and 70b, but 34b is not open source.
Code Llama [paper] [code] [model]
- 2023/08, Meta AI proposes Code LLama, based on Llama 2. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. There are foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models, with 7B, 13B and 34B parameters each. 2024/01, Meta AI open sourced CodeLlama-70b, CodeLlama-70b-Python and CodeLlama-70b-Instruct.
Qwen [paper] [code] [model]
- 2023/08, Alibaba Cloud proposes the 7b-parameter version of the large language model series Qwen-7B (abbr. Tongyi Qianwen), pretrained on a large volume of data, including web texts, books, codes, etc. 2023/09, Alibaba Cloud updated the Qwen-7B and Qwen-7B-Chat and open sourced Qwen-14B and Qwen-14B-Chat. 2023/11, they open sourced Qwen-1.8B, Qwen-1.8B-Chat, Qwen-72B and Qwen-72B-Chat.
Baichuan 2 [paper] [code] [model]
- 2023/09, Baichuan Intelligent Technology proposes the new generation of open-source large language models Baichuan 2, trained on a high-quality corpus with 2.6 trillion tokens, which has base and chat versions for 7B and 13B, and a 4bits quantized version for the chat model.
Phi-1.5 [paper] [model]
- 2023/09, Microsoft Research proposes the open source language model phi-1.5, a Transformer with 1.3 billion parameters, which was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters. 2023/12, They propose Phi-2, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters.
Mistral-7B [paper] [code] [model]
- 2023/10, Mistral-AI company proposes the open source LLM Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. They also provide a model fine-tuned to follow instructions, Mistral 7B – Instruct, that surpasses Llama 2 13B–chat model both on human and automated benchmarks. 2023/12, They propose the open source LLM Mixtral-8x7B, a pretrained generative Sparse Mixture of Experts, which outperforms Llama 2 70B on most benchmarks.
Deepseek [paper] [code] [model]
- 2023/11, DeepSeek-AI company proposes the open source LLM deepseek, which has been trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. Similarly, the deepseek LLM mainly has two categories: base and chat, with two parameter formats of 7b and 67b respectively. Data from its paper shows that deepSeek LLM 67b surpasses LLaMA-2 70b across a range of benchmarks, especially in the domains of code, mathematics, and reasoning. Furthermore, DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
MiniCPM [paper] [code] [model]
- 2024/02, ModelBest Inc. and TsinghuaNLP propose the open source LLM MiniCPM, which is an End-Side LLM, with only 2.4B parameters excluding embeddings (2.7B in total). It is worth that MiniCPM has very close performance compared with Mistral-7B on open-sourced general benchmarks with better ability on Chinese, Mathematics and Coding after SFT. The overall performance exceeds Llama2-13B, MPT-30B, Falcon-40B, etc.
Mixtral-8x22B [paper] [code] [model]
- 2024/04, Mistral AI proposed the latest open model Mixtral 8x22B. It sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size.
Phi-3 [paper] [model]
- 2024/04, Microsoft proposed the Phi-3 models, which are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks. Phi-3-mini is available in two context-length variants—4K and 128K tokens. It is the first model in its class to support a context window of up to 128K tokens, with little impact on quality. Phi-3-small (7B) and Phi-3-medium (14B) will be available in the Azure AI model catalog and other model gardens shortly.  
Llama 3 [paper] [code] [model]
- 2024/04, Meta AI proposed the third generation Llama series open source large model Llama 3. The model has 2 parameter specifications, 8b and 70b, with base and instruct versions respectively. Excitingly, Llama 3 models are a major leap over Llama 2 and establish a new state-of-the-art for LLM models at those scales.
Qwen-1.5-110B [paper] [code] [model]
- 2024/04, Alibaba Cloud proposed the first 100B+ model of the Qwen1.5 series, Qwen1.5-110B, which achieves comparable performance with Meta-Llama3-70B in the base model evaluation, and outstanding performance in the chat evaluation, including MT-Bench and AlpacaEval 2.0. Qwen1.5 is the beta version of Qwen2, which has 9 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B dense models, and an MoE model of 14B with 2.7B activated.
Qwen2 [paper] [code] [model]
- 2024/06, Alibaba Cloud proposed the evolution from Qwen1.5 to Qwen2, which has 5 model sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Qwen2-72B exhibits superior performance compared to leading models such as Llama-3-70B. Notably, it surpasses the performance of its predecessor, Qwen1.5-110B, despite having fewer parameters.
Llama 3.1 [paper] [code] [model]
- 2024/07, Meta AI proposed the Llama 3.1 405B, which is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. As part of this latest release, they’re introducing upgraded versions of the 8B and 70B models. These are multilingual and have a significantly longer context length of 128K, state-of-the-art tool use, and overall stronger reasoning capabilities.
Qwen2.5 [paper] [code] [model]
- 2024/09, Alibaba Cloud proposed the latest addition to the Qwen family: Qwen2.5, along with specialized models for coding, Qwen2.5-Coder, and mathematics, Qwen2.5-Math. All open-weight models are dense, decoder-only language models, available in various sizes, including: Qwen2.5(0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B), Qwen2.5-Coder (1.5B, 7B, and 32B on the way) and Qwen2.5-Math (1.5B, 7B, and 72B). They benchmarked their largest open-source model, Qwen2.5-72B-Instruct against leading open-source models like Llama-3.1-70B-Instrct and Mistral-Large-V2-Instruct and achieved the best results in multiple indicators.
Llama 3.2 [paper] [code] [model]
- 2024/09, Meta AI proposed the Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions. The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors.

💡 Fine-tuning

P-Tuning [paper] [code]
- 2021/03, Tsinghua University and others propose P-Tuning, a fine-tuning method for LLM, which uses trainable continuous prompt word embeddings to reduce the cost of fine-tuning.
LoRA [paper] [code]
- 2021/06, Microsoft proposes the Low-Rank Adaptation method for fine-tuning LLM by freezing the pre-training weights.
P-Tuning V2 [paper] [code]
- 2021/10, Tsinghua University proposes P-Tuning V2, an improved version of P-Tuning with better performance.
RLHF [paper] [code]
- 2022/12, OpenAI uses the RLHF (Reinforcement Learning from Human Feedback) method to train ChatGPT, and uses human feedback signals to directly optimize the language model, with excellent performance.
RRHF [paper] [code]
- 2023/04, Alibaba proposes a novel learning paradigm called RRHF（Rank Responses to Align Language Models with Human Feedback without tears), which can be tuned as easily as fine-tuning and achieve a similar performance as PPO in HH dataset.
QLoRA [paper] [code]
- 2023/05, Washington University proposes the qlora method, based on the frozen 4bit quantization model, combined with LoRA method training, which further reduces the cost of fine-tuning.
RLTF [paper] [code]
- 2023/07, Tencent proposes RLTF（Reinforcement Learning from Unit Test Feedback), a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs.
RRTF [paper]
- 2023/07, Huawei proposes RRTF（Rank Responses to align Test&Teacher Feedback). Compared with RLHF, RRHF can efficiently align the output probabilities of a language model with human preferences, with only 1-2 models required during the tuning period, and it is simpler than PPO in terms of implementation, hyperparameter tuning, and training.
RLAIF [paper]
- 2023/09, Google proposes RLAIF (RL from AI Feedback), a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans. They find that the RLHF and RLAIF methods achieve similar results on the task of summarization.

💪 Dataset

WikiSQL [paper] [code] [dataset]
- 2017/09, Salesforce proposes a large Text-to-SQL dataset WikiSQL, the data comes from Wikipedia, which belongs to a single domain, contains 80,654 natural language questions, and 77,840 SQL statements. The form of SQL statements is relatively simple, and does not include sorting, grouping, and subqueries and other complex operations.
Spider 1.0 [paper] [code] [dataset]
- 2018/09, Yale University proposes the Text-to-SQL dataset Spider with multiple databases, multiple tables, and single-round query. It is also recognized as the most difficult large-scale cross-domain evaluation list in the industry. It contains 10,181 natural language questions and 5,693 SQL statements. Involving more than 200 databases in 138 different fields, the difficulty level is divided into: easy, medium, difficult, and extremely difficult. 2024/02, Yale University has open sourced the test collection of Spider 1.0 leaderboard, and they will open source the Spider 2.0 data set in March.
SParC [paper] [code] [dataset]
- 2019/06, Yale University proposes a large dataset SParC for complex, cross-domain, and context-dependent(multi-turn) semantic parsing and text-to-SQL task, which consists of 4,298 coherent question sequences (12k+ unique individual questions annotated with SQL queries annotated by 14 Yale students), obtained from user interactions with 200 complex databases over 138 domains.
CSpider [paper] [code] [dataset]
- 2019/09, Westlake University proposes a large Chinese dataset CSpider for complex and cross-domain semantic parsing and text-to-SQL task, translated from Spider by 2 NLP researchers and 1 computer science student, which consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.
CoSQL [paper] [code] [dataset]
- 2019/09, Yale University and Salesforce Research propose a cross-domain database CoSQL, which consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz (WOZ) collection of 3k dialogues querying 200 complex DBs spanning 138 domains.
TableQA [paper] [dataset]
- 2020/06, Zhuiyi Technology proposes a large-scale cross-domain Natural Language to SQL dataset TableQA in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables.
DuSQL [paper] [dataset]
- 2020/11, Baidu proposes a large-scale and pragmatic Chinese dataset DuSQL for the cross-domain text-toSQL task, containing 200 databases, 813 tables, and 23,797 question/SQL pairs.
KaggleDBQA [paper] [code] [dataset]
- 2021/06, University of Washington and Microsoft Research propose KaggleDBQA, a cross-domain evaluation dataset of real Web databases with domain-specific data types, original formatting, and unrestricted questions. It includes 272 examples across 8 databases, with an average of 2.25 tables per database. The dataset is notable for its real-world data sources, natural question authoring environments, and database documentation with rich domain knowledge. Key statistics: 8.7% WHERE clauses, 73.5% VAL, 24.6% SELECT, and 6.8% NON-SELECT.
CHASE [paper] [code] [dataset]
- 2021/08, Xi’an Jiaotong University and Microsoft propose the first cross-domain, multi-round Text-to-SQL Chinese dataset, which contains a list of 5459 multi-round questions and 17940 <query, SQL> binary groups.
BIRD-SQL [paper] [code] [dataset]
- 2023/05, the University of Hong Kong and Alibaba propose a large-scale cross-domain dataset BIRD, which contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
BIRD-SQL Mini-Dev [paper] [code] [dataset]
- 2024/06, the collaboration between the University of Hong Kong and Alibaba continues with the release of BIRD-SQL Mini-Dev, a lite version of their development dataset designed for efficient and cost-effective SQL model testing. This dataset compiles 500 high-quality text2SQL pairs from 11 distinct databases and supports both MySQL and PostgreSQL formats. It features the introduction of two new evaluation metrics: the Reward-based Valid Efficiency Score (R-VES) and the Soft F1-Score, both currently in beta and specifically developed to enhance the accuracy and efficiency of text-to-SQL models in a development setting.
Spider 2.0 [paper] [code] [dataset]
- 2024/08, Spider 2.0, proposed by XLang AI, serves as an advanced evaluation framework for text-to-SQL tasks within real-world enterprise-level workflows. It contains 600 complex text-to-SQL workflow problems, derived from various enterprise database use cases. The dataset includes databases sourced from actual data applications, often containing over 1,000 columns, and stored in cloud or local systems like BigQuery, Snowflake, or PostgreSQL. Problems in Spider 2.0 require understanding and searching through database metadata, dialect documentation, and project-level codebases. The challenges include processing long contexts, performing intricate reasoning, and generating multiple SQL queries with diverse operations, often exceeding 100 lines. Current state-of-the-art models, like GPT-4, solve only 6.0% of the questions, highlighting the dataset’s difficulty and the need for more advanced, autonomous LLM-based code agents.

🌈 Evaluation Index

Execution Accuracy (EX) [paper]
- Definition: Calculate the proportion of the correct number of SQL execution results in the data set, and the result may be overestimated.
Exact Match (EM) [paper]
- Definition: Calculate the matching degree between the SQL generated by the model and the marked SQL, and the result may be underestimated.

📦 Libraries

mindsql
- MindSQL is a Python RAG (Retrieval-Augmented Generation) Library designed to streamline the interaction between users and their databases using just a few lines of code. With seamless integration for renowned databases such as PostgreSQL, MySQL, and SQLite, MindSQL also extends its capabilities to major databases like Snowflake and BigQuery by extending the core class. The library utilizes large language models (LLM) like GPT-4, Llama 2, Google Gemini, and supports knowledge bases like ChromaDB and Fais
PremSQL
- PremSQL is a lightweight library that aims to provide modular and extendible tools for natural language to SQL applications. This would help developers build more controllable RAG solutions based on different databases. It seeks to offer toolings for evaluation, fine-tuning, connectors, and agentic frameworks. PremSQL also comes up with Prem-1B-SQL a 1B parameter Text to SQL model with 51.54 % on BirdBench private test set. It is super light weight and works for fully local text to SQL generation.
QueryWeaver
- QueryWeaver is an open-source graph-powered Text-to-SQL engine that leverages a semantic layer to bridge the gap between natural language and complex relational databases. By mapping database schemas into knowledge graphs using FalkorDB, it provides superior contextual understanding for intricate joins and business logic. The library supports PostgreSQL and MySQL and includes a built-in MCP (Model Context Protocol) server, making it easily integrable into agentic workflows and LLM-based tools like Claude or Cursor.

🔧 Practice Project

DB-GPT-Hub
- The eosphoros organization proposes an open source project focusing on Text-to-SQL fine-tuning based on LLM, including large-scale model download, dataset preprocessing, fine-tuning technologies such as LoRA and QLoRA, model prediction, model evaluation and other steps.
sqlcoder
- The Defog organization proposes an advanced Text-to-SQL LLM, which has outstanding performance and is better than GPT3.5, wizardcoder and starcoder, etc., second only to GPT4.
modal_finetune_sql
- This project is based on the LLaMa 2 7b model for Text-to-SQL fine-tuning, which includes a complete training, fine-tuning, and evaluation process.
LLaMA-Efficient-Tuning
- Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Chat
Datrics Text2SQL
- Datrics Text2SQL is an open-source framework that uses a tuned RAG pipeline to generate SQL queries from natural language. It focuses on accurate schema retrieval, resolving ambiguous documentation, and improving generalization on real-world queries.

🔗 Citation

If you find Text2SQL useful for your research or development, please cite the following paper:

@misc{zhou2024dbgpthub,
      title={DB-GPT-Hub: Towards Open Benchmarking Text-to-SQL Empowered by Large Language Models}, 
      author={Fan Zhou and Siqiao Xue and Danrui Qi and Wenhui Shi and Wang Zhao and Ganglin Wei and Hongyang Zhang and Caigai Jiang and Gangwei Jiang and Zhixuan Chu and Faqiang Chen},
      year={2024},
      eprint={2406.11434},
      archivePrefix={arXiv},
      primaryClass={id='cs.DB' full_name='Databases' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers database management, datamining, and data processing. Roughly includes material in ACM Subject Classes E.2, E.5, H.0, H.2, and J.1.'}
}

🤝 Friendship Links

eosphoros
- They are a team of technology enthusiasts from internet companies and NLP graduate students who are passionate about open source projects. Their focus is on developing solutions that protect the privacy and security of databases and large language models. Their aim is to ensure that the abilities of these models remain absolutely private, secure, and under control.
Awesome-AIGC-Tutorials
- Awesome AIGC Tutorials houses a curated collection of tutorials and resources spanning across Large Language Models, AI Painting, and related fields. Discover in-depth insights and knowledge catered for both beginners and advanced AI enthusiasts.
AI2sql
- AI2sql is a Text-to-SQL tool that enables users to write SQL queries using natural language. It supports multiple databases including MySQL, PostgreSQL, SQL Server, and more. Perfect for data analysts, developers, and anyone who needs to query databases without deep SQL knowledge.