nl2sql-copilot / README.md
Melika Kheirieh
feat(api): expose executor result in NL2SQLResponse and return full result to UI
3ef53b4
|
raw
history blame
8.09 kB
---
title: NL2SQL Copilot โ€” Full Stack Demo
emoji: ๐Ÿงฉ
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
---
# ๐Ÿงฉ **NL2SQL Copilot โ€” Natural-Language โ†’ Safe SQL**
[![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
[![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
**Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI.**
Generates *safe, verified, executable SQL* through a multi-stage agentic pipeline.
Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI.
๐Ÿš€ **Live Demo:**
๐Ÿ‘‰ **[https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot](https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot)**
---
# **1) Quick Start**
```bash
git clone https://github.com/melika-kheirieh/nl2sql-copilot
cd nl2sql-copilot
make setup # install dependencies
make run # start API + Gradio UI
```
Open:
* [http://localhost:8000](http://localhost:8000) (FastAPI Swagger UI)
* [http://localhost:7860](http://localhost:7860) (Gradio Demo)
---
# **2) Demo (Gradio UI)**
The demo supports:
* Uploading any SQLite database
* Asking natural-language questions
* Viewing generated SQL
* Viewing query results
* Full multi-stage trace (detector โ†’ planner โ†’ generator โ†’ safety โ†’ executor โ†’ verifier โ†’ repair)
* Per-stage timings
* Example queries
* And a default demo DB (no upload required)
Everything runs on the same backend as the API.
---
# **3) Agentic Architecture**
```
user query
โ†“
detector (ambiguity, missing info)
planner (schema reasoning + task decomposition)
generator (SQL generation)
safety (SELECT-only validation)
executor (sandboxed DB execution)
verifier (semantic + execution checks)
repair (minimal-diff SQL repair loop)
โ†“
final SQL + result + traces
```
### โš™๏ธ Tech Stack
* FastAPI
* Pydantic-AI
* SQLiteAdapter
* Prometheus + Grafana
* pytest + mypy + Makefile
* Gradio UI
The pipeline is fully modular: each stage has a clean, swappable interface.
---
# **4) Evolution (Prototype โ†’ Copilot)**
This project is the **second-generation, production-grade** version of an earlier prototype:
๐Ÿ‘‰ [https://github.com/melika-kheirieh/nl2sql-copilot-prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype)
The prototype explored single-step, prompt-based SQL generation.
The current version is a **complete architectural redesign**, adding:
* multi-stage agentic pipeline
* schema introspection
* safety guardrails
* self-repair loop
* caching
* observability
* Spider benchmarks
* multi-DB support with upload + TTL handling
This repository is the first **end-to-end, production-oriented** version.
---
# **5) Key Features**
### โœ” Agentic Pipeline
Planner โ†’ Generator โ†’ Safety โ†’ Executor โ†’ Verifier โ†’ Repair.
### โœ” Schema-Aware
Automatic schema preview for any uploaded SQLite database.
### โœ” Safety by Design
* SELECT-only
* Column/table validation
* No multi-statement SQL
* Prevents schema hallucination
### โœ” Self-Repair
Automatic minimal-diff correction when SQL fails.
### โœ” Caching
TTL-based, with key = (db_id, normalized_query, schema_hash).
Hit/miss metrics included.
### โœ” Observability
* Per-stage latency
* Pipeline success ratio
* Repair success rate
* Cache hit ratio
* p95 latency
* Full Grafana dashboard
### โœ” Benchmarks
Reproducible Spider evaluation with plots + summary.
---
# **6) Benchmarks (Spider dev, 20 samples)**
[![Benchmarks](https://img.shields.io/badge/Benchmarks-Spider%20dev-blue)](#benchmarks)
Evaluated on a curated 20-sample subset of the Spider **dev** split
(focused on `concert_singer`), using the full production pipeline.
### ๐Ÿงฎ Summary
* **Total samples:** 20
* **Successful runs:** 20/20 (**100%**)
* **Exact Match (EM):** **0.10**
* **Structural Match (SM):** **0.70**
* **Execution Accuracy:** **0.725**
This reflects a *production-oriented* NL2SQL system:
the model optimizes for **executable SQL**, not literal gold-string alignment.
---
### โฑ Latency
* **Avg latency:** ~**8066 ms**
* **p50:** ~**9229 ms**
* **p95:** ~**14936 ms**
Latency is **bimodal**:
simple queries โ†’ fast, reasoning-heavy queries โ†’ planner-dominated.
---
### โš™๏ธ Per-Stage Latency
| Stage | Avg latency (ms) |
| --------- | ---------------- |
| detector | ~1 |
| planner | ~8360 |
| generator | ~1645 |
| safety | ~2 |
| executor | ~1 |
| verifier | ~1 |
| repair | ~1200 |
Planner is the main bottleneck (expected for schema-level reasoning).
Safety/executor/verifier stay **single-digit ms**.
---
### โŒ Failure Modes (Why EM is low)
Even when EM = 0, **SM and ExecAcc are often 1.0**.
Typical causes:
* Capitalization differences (`Age` vs `age`)
* Different column ordering
* LIMIT differences
* Alias mismatch
* Gold SQL is `EMPTY` but the model infers a valid SQL
In real-world systems, **execution correctness matters more than exact string match**.
---
### ๐Ÿ“‚ Reproducing the Benchmark
```bash
export SPIDER_ROOT="$PWD/data/spider"
PYTHONPATH=$PWD \
python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug
PYTHONPATH=$PWD \
python benchmarks/plot_results.py
```
Artifacts saved under:
```
benchmarks/results_pro/<timestamp>/
summary.json
eval.jsonl
metrics_overview.png
latency_histogram.png
latency_per_stage.png
errors_overview.png
```
---
# **7) API Usage**
## ๐Ÿ” NL โ†’ SQL
```bash
curl -X POST "http://localhost:8000/api/v1/nl2sql" \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key" \
-d '{
"query": "Top 5 customers by total invoice amount",
"db_id": null
}'
```
### โœ” Sample Response (accurate)
```json
{
"ambiguous": false,
"sql": "SELECT ...",
"rationale": "Explanation of why this SQL was generated.",
"result": {
"rows": 5,
"columns": ["CustomerId", "Total"],
"rows_data": [
[1, 39.6],
[2, 38.7],
[3, 35.4]
]
},
"traces": [
{"stage": "detector", "duration_ms": 1},
{"stage": "planner", "duration_ms": 8943},
{"stage": "generator","duration_ms": 1722},
{"stage": "safety", "duration_ms": 2},
{"stage": "executor", "duration_ms": 1},
{"stage": "verifier", "duration_ms": 1},
{"stage": "repair", "duration_ms": 522}
]
}
```
---
## ๐Ÿ“ Upload a SQLite DB
```bash
curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \
-H "X-API-Key: dev-key" \
-F "file=@/path/to/db.sqlite"
```
---
## ๐Ÿ“‘ Schema Preview
```bash
curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \
-H "X-API-Key: dev-key"
```
---
# **8) Environment Variables**
| Variable | Purpose |
| ---------------------- | ---------------------------------------- |
| `API_KEYS` | Comma-separated list of backend API keys |
| `API_KEY` | Used by Gradio UI to call the backend |
| `DEV_MODE` | Enables strict ambiguity detection |
| `NL2SQL_CACHE_TTL_SEC` | Cache TTL |
| `NL2SQL_CACHE_MAX` | Max cache entries |
| `SPIDER_ROOT` | Path to Spider dataset |
| `USE_MOCK` | Skip execution (for testing) |
> Gradio uses `API_KEY` โ†’ backend expects it as `X-API-Key`.
> Backend accepts multiple keys via `API_KEYS`.
---
# **9) Future Work**
### 1) Streaming SQL Generation (SSE)
### 2) Redis Distributed Cache
### 3) Multi-Model Planner/Generator
### 4) A/B Testing Framework
### 5) Schema Embeddings
### 6) Nightly CI Benchmarks
### 7) Advanced Repair (diff-based)
### 8) Helm / Compose Deployment Template
---
# **10) License**
MIT License.