GRAST-SQL: Scaling Text-to-SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers
GRAST-SQL is a lightweight, open-source schema-filtering framework that scales Text-to-SQL to real-world, very wide schemas by compacting prompts without sacrificing accuracy. It ranks columns with a query-aware LLM encoder enriched by values/metadata, reranks them via a graph transformer over a functional-dependency (FD) graph to capture inter-column structure, and then guarantees joinability with a Steiner-tree spanner to produce a small, connected sub-schema. This approach delivers near-perfect recall with substantially higher precision and maintains sub-second median latency while scaling to schemas with 23,000+ columns.
This model was presented in the paper: Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers.
For more details, code, and further usage instructions, please visit the official GitHub repository.
Sample Usage
To apply GRAST-SQL to your own database and filter the most relevant columns for a given question, follow these two simple steps. Ensure your environment is set up as described in the GitHub repository.
Step 1: Initialize (ONE-TIME per database) - Functional Dependency Graph Construction & Metadata Completion
Extract schema information, generate table/column meanings, predict missing keys, and build the functional dependency graph. Make sure your OpenAI API key is set in .env if you are using an OpenAI model for meaning generation.
python init_schema.py \
--db-path /path/to/your/database.sqlite \
--output your_database.pkl \
--model gpt-4.1-mini
Arguments:
--db-path: Path to your SQLite database file (required)--output: Output path for the graph pickle file (default:schema_graph.pkl)--model: OpenAI model to use for meaning generation and key prediction (default:gpt-4.1-mini)
Step 2: Filter Top-K Columns
Use the GRAST-SQL model to filter the most relevant columns for a given question:
python filter_columns.py \
--graph your_database.pkl \
--question "Show name, country, age for all singers ordered by age from the oldest to the youngest." \
--top-k 5
Arguments:
--graph: Path to the graph pickle file from Step 1 (required)--question: Natural language question about the database (required)--top-k: Number of top columns to retrieve (default: 10)--checkpoint: Path to GNN checkpoint (default:griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker/layer-3-hidden-2048.pt)--encoder-path: Path to encoder model (default:griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker)--max-length: Maximum sequence length (default: 4096)--batch-size: Batch size for embedding generation (default: 32)--hidden-dim: Hidden dimension for GNN (default: 2048)--num-layers: Number of GNN layers (default: 3)
Citation
If you use GRAST-SQL in your research, please cite the following paper:
@misc{hoang2025scalingtext2sqlllmefficientschema,
title={Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers},
author={Thanh Dat Hoang and Thanh Tam Nguyen and Thanh Trung Huynh and Hongzhi Yin and Quoc Viet Hung Nguyen},
year={2025},
eprint={2512.16083},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2512.16083},
}
- Downloads last month
- 20