Spaces:
Sleeping
title: NL2SQL Copilot β Full Stack Demo
emoji: π§©
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
π§© NL2SQL Copilot
A production-grade Text-to-SQL Copilot that converts natural-language questions into safe, verified SQL. Built for analytics engineers who need accuracy, transparency, and control β powered by FastAPI, LangGraph, and Pydantic-AI.
π Overview
NL2SQL Copilot is an agentic, modular pipeline that plans, generates, verifies, and repairs SQL queries.
It ensures correctness and safety through structured stages, evaluation on the Spider dataset, and full observability support.
π‘ Designed for read-only production databases with self-repair, metrics, and CI/CD baked in.
π§ Agentic Architecture
Natural Language
β
[ Detector ]
β
[ Planner ]
β
[ Generator (LLM) ]
β
[ Safety ]
β
[ Executor ]
β
[ Verifier ]
β
[ Repair ]
Each stage is isolated, configurable via YAML, and observable through structured traces and Prometheus metrics.
| Stage | Responsibility |
|---|---|
| Detector | Identify whether a query is Text-to-SQL |
| Planner | Extract user intent and SQL plan |
| Generator | Call LLM to synthesize SQL |
| Safety | Block unsafe or non-SELECT queries |
| Executor | Execute query in read-only sandbox |
| Verifier | Compare results, detect mismatch |
| Repair | Self-healing loop triggered on failure |
π Benchmark (Spider dataset)
Dataset: Spider by Yale LILY Lab. Evaluated on the Spider dev subset (20 samples) using the reproducible evaluation toolkit.
| Metric | Value |
|---|---|
| EM (Exact Match) | 0.15 |
| SM (Structural Match) | 0.70 |
| ExecAcc (Execution Accuracy) | 0.73 |
| Avg Latency | 8.11 s |
| p50 Latency | 9.42 s |
| p95 Latency | 13.88 s |
High Structural Match and Execution Accuracy indicate strong semantic correctness; lower EM reflects harmless formatting differences.
Run reproducible benchmarks:
export SPIDER_ROOT="$PWD/data/spider"
PYTHONPATH=$PWD python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20
PYTHONPATH=$PWD python benchmarks/plot_results.py
Results & plots β benchmarks/results_pro/20251109-171247/
βοΈ Key Features
β Agentic architecture β multi-stage pipeline with feedback loop
π‘οΈ Safety layer β SELECT-only guardrails and AST validation
π Self-repair β automatic retry when verification fails
π Reproducible evaluation β integrated Spider / Dr.Spider benchmarking
π¦ Config-driven design β YAML pipeline factory
π§© Plug-and-play adapters β SQLite / PostgreSQL / OpenAI / Anthropic / Ollama
π§ FastAPI service + Streamlit UI β demo or API mode
π§° CI/CD ready β Makefile, Ruff, Mypy, Pytest, Docker, GitHub Actions
π Observability stack β Prometheus & Grafana metrics for latency and errors
π§© Observability & GenAIOps
Monitor every stage of the pipeline in real-time:
/metricsendpoint exposed via FastAPIPrometheus + Grafana stack with
make obs-upMetrics tracked:
nl2sql_stage_latency_msnl2sql_stage_error_totalnl2sql_query_exec_countnl2sql_repair_success_rate
make obs-up # start Prometheus + Grafana
make obs-down # stop the stack
π§ͺ Quick Start
1οΈβ£ Clone & Run
git clone https://github.com/melika-kheirieh/nl2sql-copilot.git
cd nl2sql-copilot
make run
Or build with Docker:
docker build -t nl2sql-copilot .
docker run --rm -p 8000:8000 nl2sql-copilot
API available at http://localhost:8000/docs Streamlit demo at http://localhost:7860
π§ For Developers & CI/CD
make lint # Ruff
make typecheck # Mypy
make test # Pytest
make bench # Run benchmark suite
CI/CD Highlights
- Runs on GitHub Actions (
make check) - Enforces formatting, typing, tests, and Docker build
- Publishes Docker image to GHCR on successful merge
π― Why it matters
- Bridges natural language and databases with measurable reliability
- Provides reproducible evaluation for continuous model tracking
- Delivers production-level resilience via self-repair and observability
- Demonstrates AI software engineering beyond prompt design
π€ Author
Melika Kheirieh AI Engineer & Researcher in Natural Language Interfaces for Databases GitHub Β· LinkedIn
This project evolved from NL2SQL Copilot Prototype, refactored into a production-grade, modular agent.
π License
MIT Β© 2025 Melika Kheirieh
