Zhongyuan Wang1,Richong Zhang1,2 ,Zhijie Nie1,3,Jaein Kim1Corresponding author
Abstract
Recent Text-to-SQL methods leverage large language models (LLMs) by incorporating feedback from the database management system. While these methods effectively address execution errors in SQL queries, they struggle with database mismatches—errors that do not trigger execution exceptions. Database mismatches include issues such as condition mismatches and stricter constraint mismatches, both of which are more prevalent in real-world scenarios. To address these challenges, we propose a tool-assisted agent framework for SQL inspection and refinement, equipping the LLM-based agent with two specialized tools: a retriever and a detector, designed to diagnose and correct SQL queries with database mismatches. These tools enhance the capability of LLMs to handle real-world queries more effectively. We also introduce Spider-Mismatch, a new dataset specifically constructed to reflect the condition mismatch problems encountered in real-world scenarios. Experimental results demonstrate that our method achieves the highest performance on the averaged results of the Spider and Spider-Realistic datasets in few-shot settings, and it significantly outperforms baseline methods on the more realistic dataset, Spider-Mismatch.
Introduction
The Text-to-SQL task (Zhong, Xiong, and Socher 2017; Yu etal. 2018) aims to automatically convert natural language questions from users into Structured Query Language (SQL) queries with the database schema, enabling non-expert users to more easily access data from databases.
Previous works on the Text-to-SQL task (Wang etal. 2020; Scholak, Schucher, and Bahdanau 2021; Gan etal. 2021) focus on training models using various frameworks and strategies, which typically require a large number of labeled data. Recent research explores the power of large language models (LLMs) and applies the in-context learning (ICL) paradigm to this task. In-context learning is an emerging capability of LLMs that allows them to perform comparably to fine-tuned models on many tasks simply by giving them a few examples. Initial approaches to in-context learning (Zhang etal. 2023; Dong etal. 2023) focus on creating better prompts to leverage the single-step reasoning capabilities of LLMs. Later approaches (Pourreza and Rafiei 2024; Wang etal. 2024) introduce multi-step processes to assist LLMs in SQL query generation. Given the challenges in producing entirely correct SQL queries, existing methods incorporate a refinement stage, which can be divided into two types: self-correction and refinement based on execution feedback. The self-correction approach (Pourreza and Rafiei 2024) guides LLMs to generate revised SQL queries based on predefined correction guidelines, though it can only address a limited range of errors. In contrast, the execution feedback approach (Wang etal. 2024) refines SQL queries by leveraging feedback from executing these queries on a database management system (DBMS), ensuring excitability and improving the results.
Although existing methods address execution errors in SQL queries by leveraging feedback from the DBMS, they struggle with issues that do not trigger execution exceptions. We focus on a specific type of error within this category, which we term database mismatches. This error type encompasses two common and challenging problems in real-world scenarios. (1) Mismatch of Conditions:The mismatch of conditional clauses in SQL queries can lead to either empty or incorrect results.In real-world scenarios, the diverse and irregular nature of user questions makes it difficult for LLMs to accurately align questions with the database and generate correct SQL condition clauses.Figure 1 illustrates the challenges of translating a natural language query into SQL in real-world scenarios, emphasizing the ambiguity often present in user questions. Such ambiguity complicates the LLMs to accurately determine whether the user is referring to specific columns like “Written_by” or “Directed_by” due to the lack of clarity in the question. Furthermore, even if the correct columns are identified, inconsistencies between the values mentioned in the user question and the actual data in the database can lead to empty results (e.g., “todd casey” vs. “Todd Casey”). Although existing methods attempt to assist LLMs by providing example values for each column, which is insufficient in real-world scenarios.(2) Mismatch of Stricter Constraints:In real-world scenarios, SQL queries often need to adhere to stricter constraints, which may stem from the inherent features of SQL or user-defined rules. For instance, the former might involve restrictions related to foreign key relationships or column data types, and the latter could include mandatory processes such as “NULL” values or specific data formats. These mismatches of stricter constraints are not reflected during execution, but SQL queries that do not satisfy these constraints may not yield the expected results.
These challenges make it difficult for LLMs to generate accurate SQL queries in a single process, requiring a multi-step refinement to accomplish SQL generation.To enhance the SQL generation capabilities of LLMs in real-world scenarios, we propose Tool-SQL, a tool-assisted agent framework designed to continuously inspect and correct errors in SQL queries. Our framework employs various tools to diagnose problems within SQL queries and leverages an LLM-based agent to purposefully refine these queries based on the feedback provided by these tools.We design two specific tools to address the aforementioned problems: (1) Database Retriever, which helps LLM-based agents by retrieving similar database cells as feedback when SQL conditional clauses do not match any entries in the databases. (2) Error Detector, which diagnoses a wider range of errors, including execution errors and mismatches of stricter constraints defined by SQL rules or domain experts.
Additionally, we observe that the mainstream Spider dataset and its variants (Yu etal. 2018; Gan etal. 2021; Deng etal. 2021; Gan, Chen, and Purver 2021) rarely reflect the mismatches of conditions in SQL queries, which usually requires verification in databases.Meanwhile, the values mentioned in the conditions of most questions in existing datasets (including the Bird dataset (Li etal. 2024)) are identical to corresponding cells in databases, which is rare in the real world.To bridge the gap between existing datasets and real-world scenarios, we introduce Spider-Mismatch, a new dataset specifically designed to highlight the mismatch problem in SQL conditional clauses. We modify the questions and corresponding SQL queries in Spider and its variants by applying specific disturbances, which challenge models to generate accurate SQL queries.
Our main contributions can be summarized as follows:
- •
We propose Tool-SQL, a tool-assisted agent framework aiming to inspect problems in SQL queries with the assistance of specialized tools and refine these queries using an LLM-based agent.
- •
We introduce Spider-Mismatch, a new dataset designed specifically to bridge the gap between existing benchmarks and real-world scenarios, where the semantics of the questions are more ambiguous.
- •
Tool-SQLachieves the highest execution accuracy on the overall results of Spider and Spider Realistic in few-shot settings. Experiments on Spider-Mismatchdemonstrate that our method maintains strong performance despite the increased ambiguity in question narratives encountered in real-world scenarios.
Related Work
LLM for Text-to-SQL
LLMs have been widely used in the Text-to-SQL task, with various approaches proposed to enhance their ability.Some research focuses on designing higher-quality prompts for LLMs to explore the potential of LLMs in Text-to-SQL parsing. For example, ACT-SQL (Zhang etal. 2023) and (Tai etal. 2023) enhance LLM’s reasoning capabilities through chain-of-thought prompts. The DAIL-SQL (Gao etal. 2024) systematically investigates prompt engineering in LLM Text-to-SQL, including question representation, demonstration selection, and demonstration organization.Most recent research employs multi-stage frameworks, aiming to enhance the performance of LLMs by decomposing the Text-to-SQL task into smaller sub-tasks and designing different prompts for each sub-task.For instance, DIN-SQL (Pourreza and Rafiei 2024) decomposes the Text-to-SQL task into schema linking, question classification, SQL generation, and self-correlation to reduce the overall difficulty. DEA-SQL (Xie etal. 2024) enhances the process of DIN-SQL and introduces an additional active learning module.To reduce errors in generated SQL queries, existing multi-stage methods often introduce error correction modules. DIN-SQL and DEA-SQL adopt self-correction, guiding LLMs to correct the SQL based on the static guidelines in prompts. MAC-SQL (Wang etal. 2024) leverage the feedback from the database management system to guide the LLM, solving the execution errors in SQL queries.
LLM-based Agents
With the rise of LLM, the potential of LLM-based agents is continuously being explored. Invoking tools is a crucial capability for LLM agents, which bridge the gap between LLM agents and the external world. AutoGPT (Team 2023) is an open-source implementation of the AI agent, with many useful tools to augment a single agent. OpenAgents (Xie etal. 2023) develops three agents, each specializing in different domains and equipped with domain-specific tools.ToolLLM (Qin etal. 2024) and API-Bank (Li etal. 2023) focus on LLM Agent interacting with a wide range of open-domain real-world applications with RESTful APIs.
In the Text-to-SQL task, MAC-SQL (Wang etal. 2024) proposed a multi-agent framework that separately addresses various sub-tasks of Text-to-SQL, including SQL refinement through execution exceptions. However, leveraging tools to diagnose the other types of errors in SQL queries and provide feedback to assist LLM-based agents in performing SQL refinement is under-explored. We fill this gap and explore the use of tools to inspect and address the database mismatches in SQL queries.
Method
LLM-based Text-to-SQL Task
LLM-based methods typically adopt the in-context learning paradigm, treating Text-to-SQL as a generation task. The generation process can be formulated as:
(1) |
where the input to the large language models includes a task instruction prompt , a set of demonstration examples , a database schema of the database , and a new query . The demonstrations consists of examples from the training set, each with expected output . The output from the LLM can be either an SQL query or intermediate results in other forms.
Framework
We introduce Tool-SQLin Figure 3, a tool-assisted agent framework designed to continuously inspect and refine SQL queries using multiple tools guided by an LLM-based agent. Following previous works (Jiang etal. 2023; Gu etal. 2024), we define a set of Python functions as the action space for the LLM-based agent. These functions correspond to different SQL clauses. Therefore, in our method, the output is a sequence of actions that represent the SQL query, rather than the SQL query itself.With the Python interpreter executing the sequence of actions, each tool in the toolset is invoked to inspect different errors in the function calls based on the question and the database . If an error is detected, each tool provides specific feedback to the LLM-based agent, helping the agent to refine specific SQL clauses rather than blindly modifying the SQL query. The inspection process can be formulated as
(2) |
The inspection and refinement process is iterative. After the LLM-based agent generates a sequence of actions, all the tools are invoked to inspect potential problems. If all the tools approve the action sequence, it will be used to assemble the final SQL query. In contrast, if any tool detects an issue, the agent will generate a new action sequence based on the original sequence and the feedback from the tool.This process may be repeated several times until all the tools approve the sequence or the maximum number of attempts is reached. The refinement process can be formulated as
(3) |
In the following subsections, we will introduce the details of our method as follows:
- •
The Python function calls we design for the LLM-based agent to make a sequence of actions.
- •
The two tools integrated into our method: a database retriever and an error detector.The former checks the validity of SQL conditional clauses and assists the agent by exploring database content, while the latter detects errors in queries based on SQL execution syntax, database schema, and stricter constraints defined by SQL features or domain experts.
- •
The process to obtain a final SQL query, where the Python interpreter executes an action sequence, and an LLM is employed to supplement the missing information in the SQL query.
Function Calls Design
Figure 2 illustrates our function design. We define eight Python functions based on the primary clauses of the SQL, which can be used to assemble the SQL query after refinement. Each SQL clause has a corresponding function action, such as the “WHERE” clause, which is represented by the “add_where” function. To reduce the action space for the LLM-based agent, we merge SQL query concatenation operators (e.g., “UNION”, “INTERSECT”, “EXCEPT”) into a single “add_merge” function. Additionally, conjunction operators like “AND” and “OR” within “WHERE” or “HAVING” clauses are hidden in the function design and resolved by LLM at the end, simplifying the action-making process. Since each SQL clause has distinct structural characteristics, the content of each clause is passed as parameters to the corresponding functions. For example, the “WHERE” clause “A = B” is represented as “add_where(A, =, B)”. This parameterized approach allows the tools in our framework to more effectively diagnose errors in SQL clauses and reduces the complexity of string parsing.
Besides these functions, we introduce the “QA()” function to better address sub-questions, enhancing the reasoning ability of the agent. When this function is executed, the agent generates separate actions for sub-questions. These nine functions form the action space for the LLM Agent.
For a new question in the reasoning stage, since the context provides the LLM-based agent with a complete database schema that does not change immediately, the agent generates the sequence of actions in one go, rather than proceeding step-by-step as in other works (Jiang etal. 2023; Cen etal. 2024). This approach avoids excessive and unnecessary interactions between the agent and the environment. For most database table cells not covered in the context, the database retrieval tool within our framework inspects the conditional clauses and provides feedback to the agent.
Inspection Tools
In this section, we define two tools—a database retriever and an error detector, which inspect problems in SQL queries and assist the LLM-based agent in refining SQL queries.
Database Retriever
The primary responsibility of the database retriever is to assist the LLM-based agent in verifying the correctness of SQL conditional clauses. As shown in Figure 3, the retriever inspects whether the parameters in conditional actions (e.g., “add_where” and “add_having”) match any entry in the database and provides the agent with references to similar cells if no match is found. By using the retriever, the agent can align the value in the SQL query with the corresponding cells in the database or decide to exclude the columns from the conditional clauses, which is crucial for the Text-to-SQL task in real-world scenarios. In real-world settings, user questions often contain irregular values that differ from standardized values in the database, requiring validation before executing the query. Additionally, the ambiguity of user questions may make it challenging, even for an advanced agent, to locate the correct column names in the conditional clauses.
We employ SimCSE (Gao, Yao, and Chen 2021) as the retrieval model because it effectively captures semantic information between the constraint value and database cells, which is particularly useful for handling cases with significant variations, such as abbreviation. Based on the cells returned by the retriever, the LLM-based agent evaluates the correctness of the conditional clauses. If the condition is deemed correct, the agent selects the ground truth cell(s) to replace the mention of value; otherwise, the agent generates a new conditional action. Each time the agent makes a new action, we remove the previously used conditional columns from the database schema in the context to prevent duplicate actions. This process is repeated several times until the correct constraint is generated, or the maximum number of attempts is reached. If the maximum attempts are reached without success, we use the initial conditional action as the final answer, as it is likely to be the most accurate.
Error Detector
The role of the error detector is to inspect the mismatches of stricter constraints and detect execution errors in SQL through indirect access to the database. SQL generated by LLM may contain errors due to factors like unfamiliarity with domain-specific SQL or hallucinations, making error detection essential. We parse the parameters of our Python functions and design verification programs to perform extensive detection with the help of databases. Different from MAC-SQL (Wang etal. 2024), we do not directly execute SQL queries in the DBMS to gather feedback, as this approach has limited error detection capabilities, typically only catching execution exceptions like syntax errors and database schema errors.Additionally, although privilege control strategies can prevent potentially harmful SQL from damaging the important data in databases, executing unchecked SQL queries is still risky, as it may lead to slow response times in large real-world databases.
In the inspection process, we first extract the schema information of the database, including all table names, column names and types, foreign key relationships, and more. We then design verification programs to check whether the parameters of the functions satisfy the SQL operations and the database schema. For stricter constraints, our diagnostic process focuses on detecting the following errors based on SQL features: 1) mismatch of foreign key relationships, 2) redundancy or absence of “JOIN” operations, 3) mismatch of column types in conditional clauses, 4) absence or improper usage of “GROUP BY” clauses. We also emphasize that our tool is extensible and can be easily adapted to detect user-defined constraints in real-world scenarios by analyzing parameters in function calls. For example, in real-world scenarios where there are specific requirements on column data processing, the tool can be expanded to detect whether “NULL” values have been excluded or whether columns with specific formats have been appropriately processed.
SQL Generation
In the final stage, we generate the SQL query using the corrected action sequences. We use the Python Interpreter to execute these function calls and extract the main components of the SQL query. For the missing logical operators ‘AND’ and ‘OR’ in ‘WHERE’ or ‘HAVING’ clauses, which are not included in the action sequences, we rely on the LLM to predict them. With all the components in place, we can then assemble the complete SQL query.
Spider-MismatchDataset
Dataset Construction
User questions in real-world scenarios exhibit diversity and may differ significantly from database content, especially for cells that cannot all be seen in the context of LLMs, which is not well highlighted by existing datasets. Spider (Yu etal. 2018) is a widely used benchmark for evaluating models’ generalization ability across domains. Since it was built before the advent of LLMs, the utterances used in Spider questions closely resemble their paired SQL queries, i.e. the column and the values mentioned in the question are almost the same as those in gold SQL queries. Some works address the deficiencies of Spider and derive new datasets from the Spider validation set. Spider-SYN (Gan etal. 2021) modifies the schema-related utterances with corresponding synonyms. Spider-DK (Gan, Chen, and Purver 2021) incorporates domain knowledge reflecting real-world question narratives. Spider-Realistic (Deng etal. 2021) paraphrases or removes explicit mentions of column names. The Bird dataset was constructed after the emergence of LLMs and focuses on more complex database content, external knowledge reasoning between problems and databases, and SQL query efficiency.In these datasets, values mentioned in questions are typically highlighted with initial capitalization or within quotation marks. The differences between the mention of value in questions and the ground-truth cells in databases are minimal. In most cases, they are identical, which is unlikely in real-world scenarios. In a few cases, the LLMs can infer the correct cell statements by simply seeing the first three rows of tables. Therefore, existing approaches do not account for the differences between values mentioned in questions and cells in databases, most approaches simply use the strategy of looking at a few cell examples.
Disturbance Example Original question When did the episode “A Love of a Lifetime” air? Remove column When did the episode “A Love of a Lifetime” air? Remove highlight When did “A Love of a Lifetime” a love of lifetime air? Replace common value When did a love of lifetime air double down air?
In light of the differences between value mentions in questions and cells in databases brought by the diversity of user questions in real-world scenarios, we introduce Spider-Mismatch, a new realistic evaluation set derived from Spider, Spider-SYN, and Spider-DK. We first extract examples from these datasets which contain string-type cells in gold SQL queries, and remove duplicates and simple examples. We then manually modify both the questions and the gold SQL queries to add disturbances that reflect the real-world scenarios. The disturbances adopted are illustrated inTable 1. While these disturbances do not introduce new SQL structures, they are expected to increase the difficulty for models in generating correct conditional clauses.
Condition Post-processing Module
Since existing LLM Text-to-SQL methods do not have specialized handling for values in SQL queries, generating correct conditional clauses can be challenging. To address this, we propose a simple module called Condition Post-processing. This module extracts value mentions from the predicted SQL and replaces each value with the most similar cell (retrieved using SimCSE) in the corresponding column. The condition post-process module is applied to all the methods in our experiments to provide a fair comparison.
Experiments
Experiment Setup
Dataset
In addition to our dataset, we also evaluate our framework on the popular benchmark Spider (Yu etal. 2018) and Spider-Realistic (Deng etal. 2021).
- •
Spider is a large-scale dataset across 200 databases from 138 domains, aiming to assess model generalization on the unfamiliar database schema. It contains 8,659 examples in the training set, 1,034 examples in the development set, and 2,147 examples in the test set.
- •
Spider-Realistic is a variant dataset of Spider, which contains 508 examples derived from the Spider development set. It disturbs the natural language questions by removing or paraphrasing the explicit mentions of column names in these questions to make them more realistic.
LLM
Following previous works, we use two publicly accessible LLMs: ChatGPT (gpt-3.5-turbo-0125) and GPT-4 (gpt-4). In all experiments, we obtain model results using the official API with the temperature set to zero to ensure stable output, and max_tokens set to 300. All other parameters of API are kept at their default settings.
Evaluation Metrics
We use execution accuracy (EX) and exact match accuracy (EM), two commonly used evaluation metrics in the Text-to-SQL task, to evaluate the performance of our framework. Execution Accuracy (EX) evaluates whether the execution results of a predicted SQL query are identical to those of the corresponding gold query. Exact Match Accuracy (EM) requires each component of a predicted SQL query to match exactly with the gold query, though it ignores differences in the values within the SQL queries. However, since multiple correct SQL queries can exist for a given question, the EM metric may mark some correct SQL queries as incorrect. Therefore, we use execution accuracy as our primary evaluation metric. Following previous work, we utilize the evaluation script proposed by Zhong, Yu, and Klein (2020).
Baseline
For Spider and Spider-Realistic, we mainly choose the few-shot SOTA methods as baselines to ensure a fair comparison. Few-shot methods use only a few static examples in the context of LLMs, which are different from the demonstration selection methods that select examples from the entire training set. For Spider-Mismatch, we compare our method with the following few-shot baselines:
- •
DIN-SQL: A multi-stage method that employs a self-correction approach to refine SQL queries.
- •
MAC-SQL: A multi-agent collaboration method that refines SQL based on feedback from the DBMS.
- •
ACT-SQL: A single-stage method that introduces the Chain-of-Thought paradigm for SQL generation, which achieves excellent results on the Spider-Realistic dataset compared to other methods using ChatGPT.
Main results
Results on Spider and Spider-Realistic
As shown in Table 2, Tool-SQLachieves the highest execution accuracy on the Spider development set and the average results of the Spider development and test set. On Spider-Realistic, we achieved a larger performance gap compared to the baselines. As shown in Table 3, Tool-SQL+ GPT-4 outperforms other methods that perform well on Spider by at least 4.8%, indicating that our method is more effective at handling column disturbances in Spider-Realistic. The consistent performance of Tool-SQLacross both Spider and Spider-Realistic demonstrates its robustness in addressing challenges across different scenarios.
Method Dev Test Avg ChatGPT (Ouyang etal. 2022) 74.4 - - GPT-4 (OpenAI 2023) 72.3 - - C3 + ChatGPT (Dong etal. 2023) 81.2 82.3 81.9 ACT-SQL + ChatGPT (Zhang etal. 2023) 80.4 - - ACT-SQL + GPT-4 (Zhang etal. 2023) 82.9 - - DIN-SQL + GPT-4 (Pourreza and Rafiei 2024) 82.8 85.3 84.5 DAIL-SQL + GPT-4 (Gao etal. 2023) 83.1 86.2 85.2 DAIL-SQL + GPT-4 + SC (Gao etal. 2023) 83.6 86.6 85.6 MAC-SQL + GPT-4 (Wang etal. 2024) 86.8 82.8 84.1 Tool-SQL+ GPT-4 86.9 85.6 86.0
Method | EX |
---|---|
C3 + ChatGPT (Dong etal. 2023) | 75.4 |
ACT-SQL + ChatGPT (Zhang etal. 2023) | 75.8 |
DIN-SQL + GPT-4 (Pourreza and Rafiei 2024) | 78.1 |
DAIL-SQL + GPT-4 (Gao etal. 2023) | 75.6 |
DAIL-SQL + GPT-4 + SC (Gao etal. 2023) | 75.2 |
Tool-SQL+ ChatGPT | 76.8 |
Tool-SQL+ GPT-4 | 82.9 |
Results on Spider-Mismatch
Table 4 presents the results on Spider-Mismatch. Under the few-shot setting, Tool-SQLexceeds the baselines by 9.6% and 7.1% using ChatGPT and GPT-4 as agents, respectively. Spider-Mismatchtargets user questions in real-world scenarios, focusing on the mismatch of conditions. Existing methods fail to extract sufficient value information from tables, making it challenging for them to generate accurate conditional clauses. Tool-SQLmaintains high performance on Spider-Mismatch, indicating that our tools significantly enhance the capability of the LLM-based agents to handle real-world questions.
Method | EX |
---|---|
ACT-SQL + ChatGPT (Zhang etal. 2023) | 65.4 |
DIN-SQL + ChatGPT (Pourreza and Rafiei 2024) | 63.5 |
MAC-SQL + ChatGPT (Wang etal. 2024) | 64.7 |
ACT-SQL + GPT-4 (Zhang etal. 2023) | 73.1 |
DIN-SQL + GPT-4 (Pourreza and Rafiei 2024) | 78.2 |
MAC-SQL + GPT-4 (Wang etal. 2024) | 74.4 |
Tool-SQL+ ChatGPT | 75.0 |
Tool-SQL+ GPT-4 | 85.3 |
Ablation Study
To evaluate the effectiveness of each verification tool in our framework, we conducted an ablation study of each tool on Spider-Mismatch, where the questions are more reflective of real-world scenarios. As shown in Figure 4, excluding each tool gradually decreases the performance of the LLM agent.When the Database Retriever is removed, the execution accuracy of ChatGPT and GPT-4 decreases by 4.1% and 3.2% respectively, indicating that the ambiguity of user questions makes it difficult for LLMs to generate correct SQL conditional clauses. This issue is likely to be more pronounced in real-world scenarios, where the presence of many similar cells in the database complicates the selection of the correct cell by condition post-processing methods.Moreover, the blind application of condition post-processing methods can lead to inaccurate execution results, especially for user questions that cannot be answered by the database, highlighting the necessity of exploring cells within the database.Further degradation in LLM performance is observed when error detection tools are also removed, with ChatGPT showing a greater decrease. This suggests that weaker LLMs are more prone to errors without the help of the error detector.
In addition, we replaced our error detector with database verification (corresponding to “w/ DBMS” in Figure 4). Compared to our error detector, the execution accuracy of ChatGPT and GPT-4 decreased by 1.6% and 1.0% respectively when obtaining feedback from the databases. This highlights the effectiveness of our error detector in inspecting stricter constraints for SQL queries while diagnosing the execution errors in SQL queries.
Discussion
Analysis of Refinement Effectiveness
Figure 5 presents the impact of the maximum number of iterations in the refinement process on Spider-Mismatchusing ChatGPT and GPT-4 as the agents. The results indicate that most errors in SQL queries can be solved with a single correction, mainly addressing execution errors and stricter constraint mismatches. However, there are still many errors that require multiple iterations to refine, most of which are for the refinement of conditional clauses. This suggests that LLM-based agents may need several attempts to find the correct conditions when faced with challenging user questions in real-world scenarios.
According to the results on Spider-Mismatch, the average number of iterations required for the refinement process using Tool-SQL+ ChatGPT is 0.74, while for Tool-SQL+ GPT-4, the average is 0.44. Our method only initiates refinement when an error is detected, thereby avoiding the additional cost of introducing static processes.
Analysis of Condition Post-processing Module
We also explored the impact of removing the condition post-processing module on the results. As shown in Table 5, the performance of all baseline methods significantly decreases after removing the post-processing module, highlighting the discrepancy between the values predicted by the LLM and the ground-truth cells in the database.Since our method assists LLM-based agents generate correct conditional clauses through the database detector, removing the post-processing module does not lead to a performance decline. On the contrary, introducing the module can lead to incorrect answers for the no-answer questions in real-world scenarios because the post-processing module forces the value in the conditional clauses to be replaced by the most similar cells.
Method | Post-Process | ChatGPT | GPT-4 |
---|---|---|---|
ACT-SQL | ✓ | 65.4 | 73.1 |
✗ | 51.9 (13.5 ) | 55.8 (17.3 ) | |
DIN-SQL | ✓ | 63.5 | 78.2 |
✗ | 33.3 (30.2 ) | 37.8 (40.4 ) | |
MAC-SQL | ✓ | 64.7 | 74.4 |
✗ | 56.1 (8.6 ) | 58.7 (15.7 ) | |
Tool-SQL | ✓ | 75.0 | 85.3 |
(Ours) | ✗ | 75.3 (0.3 ) | 85.9 (0.6 ) |
Conclusion
In this paper, we proposed the Tool-SQLframework designed for SQL generation in more realistic scenarios. This framework focuses on using an LLM-based agent to refine SQL queries with targeted feedback from various tools to inspect specific problems in SQL queries.We designed a database retriever and an error detector to address potential database mismatch problems that are common in real-world scenarios.The averaged experimental result on the Spider dataset and the Spider-Realistic dataset demonstrate that our method achieves the highest performance in few-shot settings. Additionally, thorough experiments on Spider-Mismatchdemonstrate that our method maintains high performance despite more realistic disturbances, which illustrates the effectiveness of our method in enhancing the SQL generation capabilities of LLM in real-world scenarios.
References
- Cen etal. (2024)Cen, J.; Liu, J.; Li, Z.; and Wang, J. 2024.SQLFixAgent: Towards Semantic-Accurate SQL Generation via Multi-Agent Collaboration.arXiv preprint arXiv:2406.13408.
- Deng etal. (2021)Deng, X.; Hassan, A.; Meek, C.; Polozov, O.; Sun, H.; and Richardson, M. 2021.Structure-Grounded Pretraining for Text-to-SQL.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1337–1350.
- Dong etal. (2023)Dong, X.; Zhang, C.; Ge, Y.; Mao, Y.; Gao, Y.; Lin, J.; Lou, D.; etal. 2023.C3: Zero-shot Text-to-SQL with ChatGPT.arXiv preprint arXiv:2307.07306.
- Gan etal. (2021)Gan, Y.; Chen, X.; Huang, Q.; Purver, M.; Woodward, J.R.; Xie, J.; and Huang, P. 2021.Towards Robustness of Text-to-SQL Models against Synonym Substitution.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2505–2515.
- Gan, Chen, and Purver (2021)Gan, Y.; Chen, X.; and Purver, M. 2021.Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8926–8931.
- Gao etal. (2023)Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; and Zhou, J. 2023.Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363.
- Gao etal. (2024)Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; and Zhou, J. 2024.Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.Proceedings of the VLDB Endowment, 17(5): 1132–1145.
- Gao, Yao, and Chen (2021)Gao, T.; Yao, X.; and Chen, D. 2021.SimCSE: Simple Contrastive Learning of Sentence Embeddings.In Moens, M.-F.; Huang, X.; Specia, L.; and Yih, S. W.-t., eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6894–6910. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Gu etal. (2024)Gu, Y.; Shu, Y.; Yu, H.; Liu, X.; Dong, Y.; Tang, J.; Srinivasa, J.; Latapie, H.; and Su, Y. 2024.Middleware for llms: Tools are instrumental for language agents in complex environments.arXiv preprint arXiv:2402.14672.
- Jiang etal. (2023)Jiang, J.; Zhou, K.; Dong, Z.; Ye, K.; Zhao, X.; and Wen, J.-R. 2023.StructGPT: A General Framework for Large Language Model to Reason over Structured Data.In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9237–9251. Singapore: Association for Computational Linguistics.
- Li etal. (2024)Li, J.; Hui, B.; Qu, G.; Yang, J.; Li, B.; Li, B.; Wang, B.; Qin, B.; Geng, R.; Huo, N.; etal. 2024.Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36.
- Li etal. (2023)Li, M.; Zhao, Y.; Yu, B.; Song, F.; Li, H.; Yu, H.; Li, Z.; Huang, F.; and Li, Y. 2023.API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 3102–3116.
- OpenAI (2023)OpenAI. 2023.GPT-4 Technical Report.arXiv preprint arXiv:2303.08774.
- Ouyang etal. (2022)Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; etal. 2022.Training language models to follow instructions with human feedback.In NeurIPS.
- Pourreza and Rafiei (2024)Pourreza, M.; and Rafiei, D. 2024.Din-sql: Decomposed in-context learning of text-to-sql with self-correction.Advances in Neural Information Processing Systems, 36.
- Qin etal. (2024)Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; Zhao, S.; Hong, L.; Tian, R.; Xie, R.; Zhou, J.; Gerstein, M.; dahai li; Liu, Z.; and Sun, M. 2024.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.In The Twelfth International Conference on Learning Representations.
- Scholak, Schucher, and Bahdanau (2021)Scholak, T.; Schucher, N.; and Bahdanau, D. 2021.PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models.In Moens, M.-F.; Huang, X.; Specia, L.; and Yih, S. W.-t., eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 9895–9901. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Tai etal. (2023)Tai, C.-Y.; Chen, Z.; Zhang, T.; Deng, X.; and Sun, H. 2023.Exploring Chain of Thought Style Prompting for Text-to-SQL.In Bouamor, H.; Pino, J.; and Bali, K., eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 5376–5393. Singapore: Association for Computational Linguistics.
- Team (2023)Team, A. 2023.AutoGPT: build and use AI agents.
- Wang etal. (2024)Wang, B.; Ren, C.; Yang, J.; Liang, X.; Bai, J.; Chai, L.; Yan, Z.; Zhang, Q.-W.; Yin, D.; Sun, X.; and Li, Z. 2024.MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL.arXiv:2312.11242.
- Wang etal. (2020)Wang, B.; Shin, R.; Liu, X.; Polozov, O.; and Richardson, M. 2020.RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7567–7578.
- Xie etal. (2023)Xie, T.; Zhou, F.; Cheng, Z.; Shi, P.; Weng, L.; Liu, Y.; Hua, T.J.; Zhao, J.; Liu, Q.; Liu, C.; etal. 2023.Openagents: An open platform for language agents in the wild.arXiv preprint arXiv:2310.10634.
- Xie etal. (2024)Xie, Y.; Jin, X.; Xie, T.; Lin, M.; Chen, L.; Yu, C.; Cheng, L.; Zhuo, C.; Hu, B.; and Li, Z. 2024.Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm.arXiv preprint arXiv:2402.10671.
- Yu etal. (2018)Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; etal. 2018.Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task.In 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 3911–3921. Association for Computational Linguistics.
- Zhang etal. (2023)Zhang, H.; Cao, R.; Chen, L.; Xu, H.; and Yu, K. 2023.ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought.arXiv preprint arXiv:2310.17342.
- Zhong, Yu, and Klein (2020)Zhong, R.; Yu, T.; and Klein, D. 2020.Semantic Evaluation for Text-to-SQL with Distilled Test Suite.In The 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Zhong, Xiong, and Socher (2017)Zhong, V.; Xiong, C.; and Socher, R. 2017.Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103.