GRAST-SQL: Scaling Text-to-SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers

GRAST-SQL is a lightweight, open-source schema-filtering framework that scales Text-to-SQL to real-world, very wide schemas by compacting prompts without sacrificing accuracy. It ranks columns with a query-aware LLM encoder enriched by values/metadata, reranks them via a graph transformer over a functional-dependency (FD) graph to capture inter-column structure, and then guarantees joinability with a Steiner-tree spanner to produce a small, connected sub-schema. This approach delivers near-perfect recall with substantially higher precision and maintains sub-second median latency while scaling to schemas with 23,000+ columns.

This model was presented in the paper: Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers.

For more details, code, and further usage instructions, please visit the official GitHub repository.

Sample Usage

To apply GRAST-SQL to your own database and filter the most relevant columns for a given question, follow these two simple steps. Ensure your environment is set up as described in the GitHub repository.

Step 1: Initialize (ONE-TIME per database) - Functional Dependency Graph Construction & Metadata Completion

Extract schema information, generate table/column meanings, predict missing keys, and build the functional dependency graph. Make sure your OpenAI API key is set in .env if you are using an OpenAI model for meaning generation.

python init_schema.py \
    --db-path /path/to/your/database.sqlite \
    --output your_database.pkl \
    --model gpt-4.1-mini

Arguments:

--db-path: Path to your SQLite database file (required)
--output: Output path for the graph pickle file (default: schema_graph.pkl)
--model: OpenAI model to use for meaning generation and key prediction (default: gpt-4.1-mini)

Step 2: Filter Top-K Columns

Use the GRAST-SQL model to filter the most relevant columns for a given question:

python filter_columns.py \
    --graph your_database.pkl \
    --question "Show name, country, age for all singers ordered by age from the oldest to the youngest." \
    --top-k 5

Arguments:

--graph: Path to the graph pickle file from Step 1 (required)
--question: Natural language question about the database (required)
--top-k: Number of top columns to retrieve (default: 10)
--checkpoint: Path to GNN checkpoint (default: griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker/layer-3-hidden-2048.pt)
--encoder-path: Path to encoder model (default: griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker)
--max-length: Maximum sequence length (default: 4096)
--batch-size: Batch size for embedding generation (default: 32)
--hidden-dim: Hidden dimension for GNN (default: 2048)
--num-layers: Number of GNN layers (default: 3)

Citation

If you use GRAST-SQL in your research, please cite the following paper:

@misc{hoang2025scalingtext2sqlllmefficientschema,
      title={Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers}, 
      author={Thanh Dat Hoang and Thanh Tam Nguyen and Thanh Trung Huynh and Hongzhi Yin and Quoc Viet Hung Nguyen},
      year={2025},
      eprint={2512.16083},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2512.16083}, 
}

Downloads last month: 29

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker

GRAST-SQL

Collection

[V1] Scaling Text-to-SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers • 9 items • Updated 17 days ago

Paper for griffith-bigdata/GRAST-SQL-0.6B-BIRD-Reranker

Scaling Text2SQL via LLM-efficient Schema Filtering with Functional Dependency Graph Rerankers

Paper • 2512.16083 • Published Dec 18, 2025