Spaces:
Sleeping
Sleeping
Melika Kheirieh
feat(api): expose executor result in NL2SQLResponse and return full result to UI
3ef53b4
| title: NL2SQL Copilot โ Full Stack Demo | |
| emoji: ๐งฉ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| # ๐งฉ **NL2SQL Copilot โ Natural-Language โ Safe SQL** | |
| [](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml) | |
| [](#) | |
| [](LICENSE) | |
| **Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI.** | |
| Generates *safe, verified, executable SQL* through a multi-stage agentic pipeline. | |
| Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI. | |
| ๐ **Live Demo:** | |
| ๐ **[https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot](https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot)** | |
| --- | |
| # **1) Quick Start** | |
| ```bash | |
| git clone https://github.com/melika-kheirieh/nl2sql-copilot | |
| cd nl2sql-copilot | |
| make setup # install dependencies | |
| make run # start API + Gradio UI | |
| ``` | |
| Open: | |
| * [http://localhost:8000](http://localhost:8000) (FastAPI Swagger UI) | |
| * [http://localhost:7860](http://localhost:7860) (Gradio Demo) | |
| --- | |
| # **2) Demo (Gradio UI)** | |
| The demo supports: | |
| * Uploading any SQLite database | |
| * Asking natural-language questions | |
| * Viewing generated SQL | |
| * Viewing query results | |
| * Full multi-stage trace (detector โ planner โ generator โ safety โ executor โ verifier โ repair) | |
| * Per-stage timings | |
| * Example queries | |
| * And a default demo DB (no upload required) | |
| Everything runs on the same backend as the API. | |
| --- | |
| # **3) Agentic Architecture** | |
| ``` | |
| user query | |
| โ | |
| detector (ambiguity, missing info) | |
| planner (schema reasoning + task decomposition) | |
| generator (SQL generation) | |
| safety (SELECT-only validation) | |
| executor (sandboxed DB execution) | |
| verifier (semantic + execution checks) | |
| repair (minimal-diff SQL repair loop) | |
| โ | |
| final SQL + result + traces | |
| ``` | |
| ### โ๏ธ Tech Stack | |
| * FastAPI | |
| * Pydantic-AI | |
| * SQLiteAdapter | |
| * Prometheus + Grafana | |
| * pytest + mypy + Makefile | |
| * Gradio UI | |
| The pipeline is fully modular: each stage has a clean, swappable interface. | |
| --- | |
| # **4) Evolution (Prototype โ Copilot)** | |
| This project is the **second-generation, production-grade** version of an earlier prototype: | |
| ๐ [https://github.com/melika-kheirieh/nl2sql-copilot-prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype) | |
| The prototype explored single-step, prompt-based SQL generation. | |
| The current version is a **complete architectural redesign**, adding: | |
| * multi-stage agentic pipeline | |
| * schema introspection | |
| * safety guardrails | |
| * self-repair loop | |
| * caching | |
| * observability | |
| * Spider benchmarks | |
| * multi-DB support with upload + TTL handling | |
| This repository is the first **end-to-end, production-oriented** version. | |
| --- | |
| # **5) Key Features** | |
| ### โ Agentic Pipeline | |
| Planner โ Generator โ Safety โ Executor โ Verifier โ Repair. | |
| ### โ Schema-Aware | |
| Automatic schema preview for any uploaded SQLite database. | |
| ### โ Safety by Design | |
| * SELECT-only | |
| * Column/table validation | |
| * No multi-statement SQL | |
| * Prevents schema hallucination | |
| ### โ Self-Repair | |
| Automatic minimal-diff correction when SQL fails. | |
| ### โ Caching | |
| TTL-based, with key = (db_id, normalized_query, schema_hash). | |
| Hit/miss metrics included. | |
| ### โ Observability | |
| * Per-stage latency | |
| * Pipeline success ratio | |
| * Repair success rate | |
| * Cache hit ratio | |
| * p95 latency | |
| * Full Grafana dashboard | |
| ### โ Benchmarks | |
| Reproducible Spider evaluation with plots + summary. | |
| --- | |
| # **6) Benchmarks (Spider dev, 20 samples)** | |
| [](#benchmarks) | |
| Evaluated on a curated 20-sample subset of the Spider **dev** split | |
| (focused on `concert_singer`), using the full production pipeline. | |
| ### ๐งฎ Summary | |
| * **Total samples:** 20 | |
| * **Successful runs:** 20/20 (**100%**) | |
| * **Exact Match (EM):** **0.10** | |
| * **Structural Match (SM):** **0.70** | |
| * **Execution Accuracy:** **0.725** | |
| This reflects a *production-oriented* NL2SQL system: | |
| the model optimizes for **executable SQL**, not literal gold-string alignment. | |
| --- | |
| ### โฑ Latency | |
| * **Avg latency:** ~**8066 ms** | |
| * **p50:** ~**9229 ms** | |
| * **p95:** ~**14936 ms** | |
| Latency is **bimodal**: | |
| simple queries โ fast, reasoning-heavy queries โ planner-dominated. | |
| --- | |
| ### โ๏ธ Per-Stage Latency | |
| | Stage | Avg latency (ms) | | |
| | --------- | ---------------- | | |
| | detector | ~1 | | |
| | planner | ~8360 | | |
| | generator | ~1645 | | |
| | safety | ~2 | | |
| | executor | ~1 | | |
| | verifier | ~1 | | |
| | repair | ~1200 | | |
| Planner is the main bottleneck (expected for schema-level reasoning). | |
| Safety/executor/verifier stay **single-digit ms**. | |
| --- | |
| ### โ Failure Modes (Why EM is low) | |
| Even when EM = 0, **SM and ExecAcc are often 1.0**. | |
| Typical causes: | |
| * Capitalization differences (`Age` vs `age`) | |
| * Different column ordering | |
| * LIMIT differences | |
| * Alias mismatch | |
| * Gold SQL is `EMPTY` but the model infers a valid SQL | |
| In real-world systems, **execution correctness matters more than exact string match**. | |
| --- | |
| ### ๐ Reproducing the Benchmark | |
| ```bash | |
| export SPIDER_ROOT="$PWD/data/spider" | |
| PYTHONPATH=$PWD \ | |
| python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug | |
| PYTHONPATH=$PWD \ | |
| python benchmarks/plot_results.py | |
| ``` | |
| Artifacts saved under: | |
| ``` | |
| benchmarks/results_pro/<timestamp>/ | |
| summary.json | |
| eval.jsonl | |
| metrics_overview.png | |
| latency_histogram.png | |
| latency_per_stage.png | |
| errors_overview.png | |
| ``` | |
| --- | |
| # **7) API Usage** | |
| ## ๐ NL โ SQL | |
| ```bash | |
| curl -X POST "http://localhost:8000/api/v1/nl2sql" \ | |
| -H "Content-Type: application/json" \ | |
| -H "X-API-Key: dev-key" \ | |
| -d '{ | |
| "query": "Top 5 customers by total invoice amount", | |
| "db_id": null | |
| }' | |
| ``` | |
| ### โ Sample Response (accurate) | |
| ```json | |
| { | |
| "ambiguous": false, | |
| "sql": "SELECT ...", | |
| "rationale": "Explanation of why this SQL was generated.", | |
| "result": { | |
| "rows": 5, | |
| "columns": ["CustomerId", "Total"], | |
| "rows_data": [ | |
| [1, 39.6], | |
| [2, 38.7], | |
| [3, 35.4] | |
| ] | |
| }, | |
| "traces": [ | |
| {"stage": "detector", "duration_ms": 1}, | |
| {"stage": "planner", "duration_ms": 8943}, | |
| {"stage": "generator","duration_ms": 1722}, | |
| {"stage": "safety", "duration_ms": 2}, | |
| {"stage": "executor", "duration_ms": 1}, | |
| {"stage": "verifier", "duration_ms": 1}, | |
| {"stage": "repair", "duration_ms": 522} | |
| ] | |
| } | |
| ``` | |
| --- | |
| ## ๐ Upload a SQLite DB | |
| ```bash | |
| curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \ | |
| -H "X-API-Key: dev-key" \ | |
| -F "file=@/path/to/db.sqlite" | |
| ``` | |
| --- | |
| ## ๐ Schema Preview | |
| ```bash | |
| curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \ | |
| -H "X-API-Key: dev-key" | |
| ``` | |
| --- | |
| # **8) Environment Variables** | |
| | Variable | Purpose | | |
| | ---------------------- | ---------------------------------------- | | |
| | `API_KEYS` | Comma-separated list of backend API keys | | |
| | `API_KEY` | Used by Gradio UI to call the backend | | |
| | `DEV_MODE` | Enables strict ambiguity detection | | |
| | `NL2SQL_CACHE_TTL_SEC` | Cache TTL | | |
| | `NL2SQL_CACHE_MAX` | Max cache entries | | |
| | `SPIDER_ROOT` | Path to Spider dataset | | |
| | `USE_MOCK` | Skip execution (for testing) | | |
| > Gradio uses `API_KEY` โ backend expects it as `X-API-Key`. | |
| > Backend accepts multiple keys via `API_KEYS`. | |
| --- | |
| # **9) Future Work** | |
| ### 1) Streaming SQL Generation (SSE) | |
| ### 2) Redis Distributed Cache | |
| ### 3) Multi-Model Planner/Generator | |
| ### 4) A/B Testing Framework | |
| ### 5) Schema Embeddings | |
| ### 6) Nightly CI Benchmarks | |
| ### 7) Advanced Repair (diff-based) | |
| ### 8) Helm / Compose Deployment Template | |
| --- | |
| # **10) License** | |
| MIT License. | |