nl2sql-copilot / README.md
Melika Kheirieh
feat(api): expose executor result in NL2SQLResponse and return full result to UI
3ef53b4
|
raw
history blame
8.09 kB
metadata
title: NL2SQL Copilot โ€” Full Stack Demo
emoji: ๐Ÿงฉ
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false

๐Ÿงฉ NL2SQL Copilot โ€” Natural-Language โ†’ Safe SQL

CI Docker License: MIT

Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI. Generates safe, verified, executable SQL through a multi-stage agentic pipeline. Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI.

๐Ÿš€ Live Demo: ๐Ÿ‘‰ https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot


1) Quick Start

git clone https://github.com/melika-kheirieh/nl2sql-copilot
cd nl2sql-copilot
make setup       # install dependencies
make run         # start API + Gradio UI

Open:


2) Demo (Gradio UI)

The demo supports:

  • Uploading any SQLite database
  • Asking natural-language questions
  • Viewing generated SQL
  • Viewing query results
  • Full multi-stage trace (detector โ†’ planner โ†’ generator โ†’ safety โ†’ executor โ†’ verifier โ†’ repair)
  • Per-stage timings
  • Example queries
  • And a default demo DB (no upload required)

Everything runs on the same backend as the API.


3) Agentic Architecture

user query
    โ†“
detector      (ambiguity, missing info)
planner       (schema reasoning + task decomposition)
generator     (SQL generation)
safety        (SELECT-only validation)
executor      (sandboxed DB execution)
verifier      (semantic + execution checks)
repair        (minimal-diff SQL repair loop)
    โ†“
final SQL + result + traces

โš™๏ธ Tech Stack

  • FastAPI
  • Pydantic-AI
  • SQLiteAdapter
  • Prometheus + Grafana
  • pytest + mypy + Makefile
  • Gradio UI

The pipeline is fully modular: each stage has a clean, swappable interface.


4) Evolution (Prototype โ†’ Copilot)

This project is the second-generation, production-grade version of an earlier prototype: ๐Ÿ‘‰ https://github.com/melika-kheirieh/nl2sql-copilot-prototype

The prototype explored single-step, prompt-based SQL generation. The current version is a complete architectural redesign, adding:

  • multi-stage agentic pipeline
  • schema introspection
  • safety guardrails
  • self-repair loop
  • caching
  • observability
  • Spider benchmarks
  • multi-DB support with upload + TTL handling

This repository is the first end-to-end, production-oriented version.


5) Key Features

โœ” Agentic Pipeline

Planner โ†’ Generator โ†’ Safety โ†’ Executor โ†’ Verifier โ†’ Repair.

โœ” Schema-Aware

Automatic schema preview for any uploaded SQLite database.

โœ” Safety by Design

  • SELECT-only
  • Column/table validation
  • No multi-statement SQL
  • Prevents schema hallucination

โœ” Self-Repair

Automatic minimal-diff correction when SQL fails.

โœ” Caching

TTL-based, with key = (db_id, normalized_query, schema_hash). Hit/miss metrics included.

โœ” Observability

  • Per-stage latency
  • Pipeline success ratio
  • Repair success rate
  • Cache hit ratio
  • p95 latency
  • Full Grafana dashboard

โœ” Benchmarks

Reproducible Spider evaluation with plots + summary.


6) Benchmarks (Spider dev, 20 samples)

Benchmarks

Evaluated on a curated 20-sample subset of the Spider dev split (focused on concert_singer), using the full production pipeline.

๐Ÿงฎ Summary

  • Total samples: 20
  • Successful runs: 20/20 (100%)
  • Exact Match (EM): 0.10
  • Structural Match (SM): 0.70
  • Execution Accuracy: 0.725

This reflects a production-oriented NL2SQL system: the model optimizes for executable SQL, not literal gold-string alignment.


โฑ Latency

  • Avg latency: ~8066 ms
  • p50: ~9229 ms
  • p95: ~14936 ms

Latency is bimodal: simple queries โ†’ fast, reasoning-heavy queries โ†’ planner-dominated.


โš™๏ธ Per-Stage Latency

Stage Avg latency (ms)
detector ~1
planner ~8360
generator ~1645
safety ~2
executor ~1
verifier ~1
repair ~1200

Planner is the main bottleneck (expected for schema-level reasoning). Safety/executor/verifier stay single-digit ms.


โŒ Failure Modes (Why EM is low)

Even when EM = 0, SM and ExecAcc are often 1.0.

Typical causes:

  • Capitalization differences (Age vs age)
  • Different column ordering
  • LIMIT differences
  • Alias mismatch
  • Gold SQL is EMPTY but the model infers a valid SQL

In real-world systems, execution correctness matters more than exact string match.


๐Ÿ“‚ Reproducing the Benchmark

export SPIDER_ROOT="$PWD/data/spider"

PYTHONPATH=$PWD \
  python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug

PYTHONPATH=$PWD \
  python benchmarks/plot_results.py

Artifacts saved under:

benchmarks/results_pro/<timestamp>/
    summary.json
    eval.jsonl
    metrics_overview.png
    latency_histogram.png
    latency_per_stage.png
    errors_overview.png

7) API Usage

๐Ÿ” NL โ†’ SQL

curl -X POST "http://localhost:8000/api/v1/nl2sql" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-key" \
  -d '{
        "query": "Top 5 customers by total invoice amount",
        "db_id": null
      }'

โœ” Sample Response (accurate)

{
  "ambiguous": false,
  "sql": "SELECT ...",
  "rationale": "Explanation of why this SQL was generated.",
  "result": {
    "rows": 5,
    "columns": ["CustomerId", "Total"],
    "rows_data": [
      [1, 39.6],
      [2, 38.7],
      [3, 35.4]
    ]
  },
  "traces": [
    {"stage": "detector", "duration_ms": 1},
    {"stage": "planner",  "duration_ms": 8943},
    {"stage": "generator","duration_ms": 1722},
    {"stage": "safety",   "duration_ms": 2},
    {"stage": "executor", "duration_ms": 1},
    {"stage": "verifier", "duration_ms": 1},
    {"stage": "repair",   "duration_ms": 522}
  ]
}

๐Ÿ“ Upload a SQLite DB

curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \
  -H "X-API-Key: dev-key" \
  -F "file=@/path/to/db.sqlite"

๐Ÿ“‘ Schema Preview

curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \
  -H "X-API-Key: dev-key"

8) Environment Variables

Variable Purpose
API_KEYS Comma-separated list of backend API keys
API_KEY Used by Gradio UI to call the backend
DEV_MODE Enables strict ambiguity detection
NL2SQL_CACHE_TTL_SEC Cache TTL
NL2SQL_CACHE_MAX Max cache entries
SPIDER_ROOT Path to Spider dataset
USE_MOCK Skip execution (for testing)

Gradio uses API_KEY โ†’ backend expects it as X-API-Key. Backend accepts multiple keys via API_KEYS.


9) Future Work

1) Streaming SQL Generation (SSE)

2) Redis Distributed Cache

3) Multi-Model Planner/Generator

4) A/B Testing Framework

5) Schema Embeddings

6) Nightly CI Benchmarks

7) Advanced Repair (diff-based)

8) Helm / Compose Deployment Template


10) License

MIT License.