Spaces:

melikakheirieh
/

nl2sql-copilot

Sleeping

App Files Files Community

github-actions[bot] commited on Dec 20, 2025

Commit

5900a5b

1 Parent(s): e70c579

Sync from GitHub main @ 83abcf66d264b1bb94e95894191d576387b90086

Browse files

Files changed (3) hide show

.dockerignore +35 -15
Dockerfile +39 -16
README.md +126 -256

.dockerignore CHANGED Viewed

@@ -1,23 +1,43 @@
-# Ignore all data except demo.db
-data/*
-!data/demo.db
-# Python cache
-__pycache__
-*.pyc
-# Git
 .git
-.gitignore
-# Local environments
-venv/
-.env/
 .env.*
-# Other files you don't want in Docker context
 .vscode/
 .idea/
 dist/
 build/
-coverage/

+# VCS
 .git
+.github
+# Secrets
+.env
 .env.*
+# Python caches / test caches
+__pycache__/
+*.py[cod]
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.hypothesis/
+# Coverage artifacts
+.coverage
+coverage.xml
+htmlcov/
+# IDE
 .vscode/
 .idea/
+# Build outputs
 dist/
 build/
+*.egg-info/
+# Big / local-only data
+data/spider/
+data/uploads/
+data/tmp/
+# SQLite runtime side-files
+**/*.db-wal
+**/*.db-shm
+**/*.db-journal
+# Generated benchmark outputs
+benchmarks/results*/
+benchmarks/results_pro/

Dockerfile CHANGED Viewed

@@ -1,30 +1,53 @@
-FROM python:3.12-slim
 ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONUNBUFFERED=1 \
-    PIP_NO_CACHE_DIR=1 \
-    PORT=7860 \
-    GRADIO_SERVER_NAME=0.0.0.0
-WORKDIR /home/user/app
-# Copy requirements first
 COPY requirements.txt .
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    gcc build-essential && \
-    pip install --no-cache-dir -U pip && \
-    pip install --no-cache-dir -r requirements.txt && \
-    apt-get purge -y gcc build-essential && \
-    apt-get autoremove -y && apt-get clean -y
-# Copy full repo — but due to .dockerignore, ONLY demo.db from data/ is included
-COPY . .
-# Optional debug
-# RUN ls -R /home/user/app/data
 EXPOSE 7860
-ENTRYPOINT []
 CMD ["python", "-u", "start.py"]

+# -------------------------------------
+# Base image (runtime)
+# -------------------------------------
+FROM python:3.12-slim AS base
+# Prevent Python from writing .pyc files and force stdout flush
 ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONUNBUFFERED=1 \
+    PORT=7860
+# Install tini (proper init process) + curl (for healthcheck)
+RUN apt-get update && apt-get install -y --no-install-recommends tini curl \
+   && rm -rf /car/lib/apt/lists/*
+WORKDIR /app
+# -------------------------------------
+# Builder stage (dependencies)
+# -------------------------------------
+FROM base AS builder
+WORKDIR /app
 COPY requirements.txt .
 RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential gcc \
+    && pip install --upgrade pip \
+    && pip install --no-cache-dir -r requirements.txt \
+    && rm -rf /var/lib/apt/lists/*
+# -------------------------------------
+# Builder stage (dependencies)
+# -------------------------------------
+FROM base AS final
+WORKDIR /app
+# Copy installed dependencies from builder
+COPY --from=builder /usr/local /usr/local
+COPY . .
+# Expose ports (FastAPI + Gradio)
 EXPOSE 7860
+EXPOSE 8000
+# tini handles PID1, zombie reaping, and signals
+ENTRYPOINT ["tini", "--"]
 CMD ["python", "-u", "start.py"]

README.md CHANGED Viewed

@@ -6,335 +6,205 @@ colorTo: blue
 sdk: docker
 pinned: false
 ---
-# 🧩 **NL2SQL Copilot — Natural-Language → Safe SQL**
 [![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
 [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
-**Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI.**
-Generates *safe, verified, executable SQL* through a multi-stage agentic pipeline.
-Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI.
-🚀 **Live Demo:**
-👉 **[https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot](https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot)**
 ---
-# **1) Quick Start**
-```bash
-git clone https://github.com/melika-kheirieh/nl2sql-copilot
-cd nl2sql-copilot
-make setup       # install dependencies
-make run         # start API + Gradio UI
-```
-Open:
-* [http://localhost:8000](http://localhost:8000) (FastAPI Swagger UI)
-* [http://localhost:7860](http://localhost:7860) (Gradio Demo)
 ---
-# **2) Demo (Gradio UI)**
-The demo supports:
-* Uploading any SQLite database
-* Asking natural-language questions
-* Viewing generated SQL
-* Viewing query results
-* Full multi-stage trace (detector → planner → generator → safety → executor → verifier → repair)
-* Per-stage timings
-* Example queries
-* And a default demo DB (no upload required)
-Everything runs on the same backend as the API.
 ---
-# **3) Agentic Architecture**
-```
-user query
-    ↓
-detector      (ambiguity, missing info)
-planner       (schema reasoning + task decomposition)
-generator     (SQL generation)
-safety        (SELECT-only validation)
-executor      (sandboxed DB execution)
-verifier      (semantic + execution checks)
-repair        (minimal-diff SQL repair loop)
-    ↓
-final SQL + result + traces
-```
-### ⚙️ Tech Stack
-* FastAPI
-* Pydantic-AI
-* SQLiteAdapter
-* Prometheus + Grafana
-* pytest + mypy + Makefile
-* Gradio UI
-The pipeline is fully modular: each stage has a clean, swappable interface.
 ---
-# **4) Evolution (Prototype → Copilot)**
-This project is the **second-generation, production-grade** version of an earlier prototype:
-👉 [https://github.com/melika-kheirieh/nl2sql-copilot-prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype)
-The prototype explored single-step, prompt-based SQL generation.
-The current version is a **complete architectural redesign**, adding:
-* multi-stage agentic pipeline
-* schema introspection
-* safety guardrails
-* self-repair loop
-* caching
-* observability
-* Spider benchmarks
-* multi-DB support with upload + TTL handling
-This repository is the first **end-to-end, production-oriented** version.
----
-# **5) Key Features**
-### ✔ Agentic Pipeline
-Planner → Generator → Safety → Executor → Verifier → Repair.
-### ✔ Schema-Aware
-Automatic schema preview for any uploaded SQLite database.
-### ✔ Safety by Design
-* SELECT-only
-* Column/table validation
-* No multi-statement SQL
-* Prevents schema hallucination
-### ✔ Self-Repair
-Automatic minimal-diff correction when SQL fails.
-### ✔ Caching
-TTL-based, with key = (db_id, normalized_query, schema_hash).
-Hit/miss metrics included.
-### ✔ Observability
-* Per-stage latency
-* Pipeline success ratio
-* Repair success rate
-* Cache hit ratio
-* p95 latency
-* Full Grafana dashboard
-### ✔ Benchmarks
-Reproducible Spider evaluation with plots + summary.
 ---
-# **6) Benchmarks (Spider dev, 20 samples)**
-[![Benchmarks](https://img.shields.io/badge/Benchmarks-Spider%20dev-blue)](#benchmarks)
-Evaluated on a curated 20-sample subset of the Spider **dev** split
-(focused on `concert_singer`), using the full production pipeline.
-### 🧮 Summary
-* **Total samples:** 20
-* **Successful runs:** 20/20 (**100%**)
-* **Exact Match (EM):** **0.10**
-* **Structural Match (SM):** **0.70**
-* **Execution Accuracy:** **0.725**
-This reflects a *production-oriented* NL2SQL system:
-the model optimizes for **executable SQL**, not literal gold-string alignment.
 ---
-### ⏱ Latency
-* **Avg latency:** ~**8066 ms**
-* **p50:** ~**9229 ms**
-* **p95:** ~**14936 ms**
-Latency is **bimodal**:
-simple queries → fast, reasoning-heavy queries → planner-dominated.
 ---
-### ⚙️ Per-Stage Latency
-| Stage     | Avg latency (ms) |
-| --------- | ---------------- |
-| detector  | ~1               |
-| planner   | ~8360            |
-| generator | ~1645            |
-| safety    | ~2               |
-| executor  | ~1               |
-| verifier  | ~1               |
-| repair    | ~1200            |
-Planner is the main bottleneck (expected for schema-level reasoning).
-Safety/executor/verifier stay **single-digit ms**.
 ---
-### ❌ Failure Modes (Why EM is low)
-Even when EM = 0, **SM and ExecAcc are often 1.0**.
-Typical causes:
-* Capitalization differences (`Age` vs `age`)
-* Different column ordering
-* LIMIT differences
-* Alias mismatch
-* Gold SQL is `EMPTY` but the model infers a valid SQL
-In real-world systems, **execution correctness matters more than exact string match**.
 ---
-### 📂 Reproducing the Benchmark
-```bash
-export SPIDER_ROOT="$PWD/data/spider"
-PYTHONPATH=$PWD \
-  python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug
-PYTHONPATH=$PWD \
-  python benchmarks/plot_results.py
-```
-Artifacts saved under:
-```
-benchmarks/results_pro/<timestamp>/
-    summary.json
-    eval.jsonl
-    metrics_overview.png
-    latency_histogram.png
-    latency_per_stage.png
-    errors_overview.png
-```
 ---
-# **7) API Usage**
-## 🔍 NL → SQL
-```bash
-curl -X POST "http://localhost:8000/api/v1/nl2sql" \
-  -H "Content-Type: application/json" \
-  -H "X-API-Key: dev-key" \
-  -d '{
-        "query": "Top 5 customers by total invoice amount",
-        "db_id": null
-      }'
-```
-### ✔ Sample Response (accurate)
-```json
-{
-  "ambiguous": false,
-  "sql": "SELECT ...",
-  "rationale": "Explanation of why this SQL was generated.",
-  "result": {
-    "rows": 5,
-    "columns": ["CustomerId", "Total"],
-    "rows_data": [
-      [1, 39.6],
-      [2, 38.7],
-      [3, 35.4]
-    ]
-  },
-  "traces": [
-    {"stage": "detector", "duration_ms": 1},
-    {"stage": "planner",  "duration_ms": 8943},
-    {"stage": "generator","duration_ms": 1722},
-    {"stage": "safety",   "duration_ms": 2},
-    {"stage": "executor", "duration_ms": 1},
-    {"stage": "verifier", "duration_ms": 1},
-    {"stage": "repair",   "duration_ms": 522}
-  ]
-}
-```
----
-## 📁 Upload a SQLite DB
-```bash
-curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \
-  -H "X-API-Key: dev-key" \
-  -F "file=@/path/to/db.sqlite"
-```
 ---
-## 📑 Schema Preview
 ```bash
-curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \
-  -H "X-API-Key: dev-key"
 ```
----
-# **8) Environment Variables**
-| Variable               | Purpose                                  |
-| ---------------------- | ---------------------------------------- |
-| `API_KEYS`             | Comma-separated list of backend API keys |
-| `API_KEY`              | Used by Gradio UI to call the backend    |
-| `DEV_MODE`             | Enables strict ambiguity detection       |
-| `NL2SQL_CACHE_TTL_SEC` | Cache TTL                                |
-| `NL2SQL_CACHE_MAX`     | Max cache entries                        |
-| `SPIDER_ROOT`          | Path to Spider dataset                   |
-| `USE_MOCK`             | Skip execution (for testing)             |
-> Gradio uses `API_KEY` → backend expects it as `X-API-Key`.
-> Backend accepts multiple keys via `API_KEYS`.
 ---
-# **9) Future Work**
-### 1) Streaming SQL Generation (SSE)
-### 2) Redis Distributed Cache
-### 3) Multi-Model Planner/Generator
-### 4) A/B Testing Framework
-### 5) Schema Embeddings
-### 6) Nightly CI Benchmarks
-### 7) Advanced Repair (diff-based)
-### 8) Helm / Compose Deployment Template
 ---
-# **10) License**
-MIT License.

 sdk: docker
 pinned: false
 ---
+# NL2SQL Copilot — Safety-First, Production-Grade Text-to-SQL
 [![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
 [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+A **production-oriented Natural Language → SQL system** built around **explicit safety guarantees, verification, evaluation, and observability**.
+This project treats LLMs as **untrusted components** inside a constrained, measurable system — not as autonomous agents.
 ---
+## Demo (End-to-End)
+A live interactive demo is available on Hugging Face Spaces: 👉 [**Try the Demo**](https://huggingface.co/spaces/melikakheirieh/nl2sql-copilot)
+<p align="center">
+  <img src="docs/assets/screenshots/demo_list_albums_total_sales.png" width="700">
+</p>
 ---
+## Why this exists
+Most Text-to-SQL demos answer:
+> *“Can the model generate SQL?”*
+This project answers a harder question:
+> **“Can NL→SQL be operated safely as a production system?”**
+That means:
+- controlling **what the model sees** (context engineering),
+- constraining **what it is allowed to execute** (safety),
+- verifying results before returning them,
+- and continuously measuring **accuracy, latency, and cost**.
 ---
+## What the system does
+- Converts natural-language questions into **safe, verified SQL**
+- Enforces **SELECT-only execution policies** (no DDL / DML)
+- Uses **explicit context engineering** (schema packing + rules)
+- Applies **execution and verification guardrails**
+- Tracks **per-stage latency, errors, and cost signals**
+- Evaluates accuracy on **Spider** with a structured error taxonomy
+- Exposes **Prometheus metrics** and **Grafana dashboards**
 ---
+## Architecture & Pipeline
+<p align="center">
+  <img src="docs/assets/architecture.png" width="720">
+</p>
+```
+Detector
+→ Planner
+→ Generator
+→ Safety Guard
+→ Executor
+→ Verifier
+→ Repair (bounded)
+````
+Each stage:
+- has a single responsibility,
+- emits structured traces,
+- is independently testable.
 ---
+## Core design principles
+### 1) Context engineering over prompt cleverness
+The model never sees the raw database blindly.
+Instead, it receives:
+- a **deterministic schema pack**,
+- explicit constraints (e.g. SELECT-only, LIMIT rules),
+- and a bounded context budget.
 ---
+### 2) Safety is enforced, not suggested
+Safety policies are **system-level constraints**, not prompt instructions.
+Current guarantees:
+- Single-statement execution
+- `SELECT` / `WITH` only
+- No DDL / DML
+- Execution time & result guards
+Violations are **blocked**, not repaired.
 ---
+### 3) Verification before trust
+Queries are executed in a controlled environment and verified for:
+- structural validity,
+- schema consistency,
+- execution correctness.
+Errors are surfaced explicitly and classified — not hidden.
 ---
+### 4) Repair for reliability, not illusion
+Repair exists to improve **system robustness**, not to chase accuracy at all costs.
+- Triggered only for eligible error classes
+- Disabled for safety violations
+- Strictly bounded (no infinite loops)
 ---
+## Repository structure
+```text
+app/                 # FastAPI service (routes, schemas, wiring)
+nl2sql/              # Core NL→SQL pipeline
+adapters/            # Adapter implementations (DBs, LLMs)
+benchmarks/          # Evaluation runners & outputs
+tests/               # Unit & integration tests
+prometheus/          # Prometheus configuration
+grafana/             # Grafana provisioning
+alertmanager/        # Alertmanager config
+alert-receiver/      # Webhook receiver for alert testing
+infra/               # Docker Compose & infra glue
+configs/             # Runtime configs
+scripts/             # Tooling & helpers
+demo/                # Demo app
+ui/                  # UI surface
+docs/                # Docs & screenshots
+data/                # Local data & demo DBs
+````
 ---
+## Observability & GenAIOps
+<p align="center">
+  <img src="docs/assets/grafana.png" width="720">
+</p>
+Tracked signals include:
+* End-to-end latency (p50 / p95)
+* Per-stage latency
+* Success / failure counts
+* Safety blocks
+* Repair attempts & win-rate
+* Cache hit / miss ratio
+* Token usage (prompt / completion)
+These metrics make **accuracy vs latency vs cost trade-offs** explicit.
 ---
+## Evaluation
+The system is evaluated on the **Spider benchmark**.
 ```bash
+make eval-spider
 ```
+Metrics:
+* Exact Match (EM)
+* Execution Accuracy (ExecAcc)
+* Semantic Match (SM)
+* Latency distributions
+* Error taxonomy breakdown
+A **golden regression set** is used to detect accuracy regressions.
 ---
+## Roadmap
+* AST-based SQL allowlisting
+* Query cost heuristics (EXPLAIN-based)
+* Cross-database adapters
+* CI-level eval gating
 ---
+## What this project is *not*
+* Not a prompt-only demo
+* Not an autonomous agent playground
+* Not optimized for leaderboard chasing
+It is a **deliberately constrained, observable, and defendable AI system** —
+built to be discussed seriously in production engineering interviews.