github-actions[bot] commited on
Commit
5900a5b
·
1 Parent(s): e70c579

Sync from GitHub main @ 83abcf66d264b1bb94e95894191d576387b90086

Browse files
Files changed (3) hide show
  1. .dockerignore +35 -15
  2. Dockerfile +39 -16
  3. README.md +126 -256
.dockerignore CHANGED
@@ -1,23 +1,43 @@
1
- # Ignore all data except demo.db
2
- data/*
3
- !data/demo.db
4
-
5
- # Python cache
6
- __pycache__
7
- *.pyc
8
-
9
- # Git
10
  .git
11
- .gitignore
12
 
13
- # Local environments
14
- venv/
15
- .env/
16
  .env.*
17
 
18
- # Other files you don't want in Docker context
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  .vscode/
20
  .idea/
 
 
21
  dist/
22
  build/
23
- coverage/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VCS
 
 
 
 
 
 
 
 
2
  .git
3
+ .github
4
 
5
+ # Secrets
6
+ .env
 
7
  .env.*
8
 
9
+ # Python caches / test caches
10
+ __pycache__/
11
+ *.py[cod]
12
+ .pytest_cache/
13
+ .mypy_cache/
14
+ .ruff_cache/
15
+ .hypothesis/
16
+
17
+ # Coverage artifacts
18
+ .coverage
19
+ coverage.xml
20
+ htmlcov/
21
+
22
+ # IDE
23
  .vscode/
24
  .idea/
25
+
26
+ # Build outputs
27
  dist/
28
  build/
29
+ *.egg-info/
30
+
31
+ # Big / local-only data
32
+ data/spider/
33
+ data/uploads/
34
+ data/tmp/
35
+
36
+ # SQLite runtime side-files
37
+ **/*.db-wal
38
+ **/*.db-shm
39
+ **/*.db-journal
40
+
41
+ # Generated benchmark outputs
42
+ benchmarks/results*/
43
+ benchmarks/results_pro/
Dockerfile CHANGED
@@ -1,30 +1,53 @@
1
- FROM python:3.12-slim
 
 
 
2
 
 
3
  ENV PYTHONDONTWRITEBYTECODE=1 \
4
  PYTHONUNBUFFERED=1 \
5
- PIP_NO_CACHE_DIR=1 \
6
- PORT=7860 \
7
- GRADIO_SERVER_NAME=0.0.0.0
8
 
9
- WORKDIR /home/user/app
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- # Copy requirements first
12
  COPY requirements.txt .
13
 
14
  RUN apt-get update && apt-get install -y --no-install-recommends \
15
- gcc build-essential && \
16
- pip install --no-cache-dir -U pip && \
17
- pip install --no-cache-dir -r requirements.txt && \
18
- apt-get purge -y gcc build-essential && \
19
- apt-get autoremove -y && apt-get clean -y
20
 
21
- # Copy full repo — but due to .dockerignore, ONLY demo.db from data/ is included
22
- COPY . .
23
 
24
- # Optional debug
25
- # RUN ls -R /home/user/app/data
 
 
26
 
 
 
 
 
 
 
 
 
27
  EXPOSE 7860
 
 
 
 
28
 
29
- ENTRYPOINT []
30
  CMD ["python", "-u", "start.py"]
 
1
+ # -------------------------------------
2
+ # Base image (runtime)
3
+ # -------------------------------------
4
+ FROM python:3.12-slim AS base
5
 
6
+ # Prevent Python from writing .pyc files and force stdout flush
7
  ENV PYTHONDONTWRITEBYTECODE=1 \
8
  PYTHONUNBUFFERED=1 \
9
+ PORT=7860
 
 
10
 
11
+ # Install tini (proper init process) + curl (for healthcheck)
12
+ RUN apt-get update && apt-get install -y --no-install-recommends tini curl \
13
+ && rm -rf /car/lib/apt/lists/*
14
+
15
+ WORKDIR /app
16
+
17
+
18
+ # -------------------------------------
19
+ # Builder stage (dependencies)
20
+ # -------------------------------------
21
+ FROM base AS builder
22
+
23
+ WORKDIR /app
24
 
 
25
  COPY requirements.txt .
26
 
27
  RUN apt-get update && apt-get install -y --no-install-recommends \
28
+ build-essential gcc \
29
+ && pip install --upgrade pip \
30
+ && pip install --no-cache-dir -r requirements.txt \
31
+ && rm -rf /var/lib/apt/lists/*
 
32
 
 
 
33
 
34
+ # -------------------------------------
35
+ # Builder stage (dependencies)
36
+ # -------------------------------------
37
+ FROM base AS final
38
 
39
+ WORKDIR /app
40
+
41
+ # Copy installed dependencies from builder
42
+ COPY --from=builder /usr/local /usr/local
43
+
44
+ COPY . .
45
+
46
+ # Expose ports (FastAPI + Gradio)
47
  EXPOSE 7860
48
+ EXPOSE 8000
49
+
50
+ # tini handles PID1, zombie reaping, and signals
51
+ ENTRYPOINT ["tini", "--"]
52
 
 
53
  CMD ["python", "-u", "start.py"]
README.md CHANGED
@@ -6,335 +6,205 @@ colorTo: blue
6
  sdk: docker
7
  pinned: false
8
  ---
9
- # 🧩 **NL2SQL Copilot — Natural-Language → Safe SQL**
 
 
10
  [![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
11
  [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
12
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
13
 
 
14
 
15
- **Modular Text-to-SQL Copilot built with FastAPI & Pydantic-AI.**
16
- Generates *safe, verified, executable SQL* through a multi-stage agentic pipeline.
17
- Includes: schema introspection, self-repair, Spider benchmarks, Prometheus metrics, and a full demo UI.
18
-
19
- 🚀 **Live Demo:**
20
- 👉 **[https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot](https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot)**
21
 
22
  ---
23
 
24
- # **1) Quick Start**
25
-
26
- ```bash
27
- git clone https://github.com/melika-kheirieh/nl2sql-copilot
28
- cd nl2sql-copilot
29
- make setup # install dependencies
30
- make run # start API + Gradio UI
31
- ```
32
-
33
- Open:
34
-
35
- * [http://localhost:8000](http://localhost:8000) (FastAPI Swagger UI)
36
- * [http://localhost:7860](http://localhost:7860) (Gradio Demo)
37
 
38
  ---
39
 
40
- # **2) Demo (Gradio UI)**
41
 
42
- The demo supports:
 
43
 
44
- * Uploading any SQLite database
45
- * Asking natural-language questions
46
- * Viewing generated SQL
47
- * Viewing query results
48
- * Full multi-stage trace (detector → planner → generator → safety → executor → verifier → repair)
49
- * Per-stage timings
50
- * Example queries
51
- * And a default demo DB (no upload required)
52
 
53
- Everything runs on the same backend as the API.
 
 
 
 
54
 
55
  ---
56
 
57
- # **3) Agentic Architecture**
58
 
59
- ```
60
- user query
61
-
62
- detector (ambiguity, missing info)
63
- planner (schema reasoning + task decomposition)
64
- generator (SQL generation)
65
- safety (SELECT-only validation)
66
- executor (sandboxed DB execution)
67
- verifier (semantic + execution checks)
68
- repair (minimal-diff SQL repair loop)
69
-
70
- final SQL + result + traces
71
- ```
72
-
73
- ### ⚙️ Tech Stack
74
-
75
- * FastAPI
76
- * Pydantic-AI
77
- * SQLiteAdapter
78
- * Prometheus + Grafana
79
- * pytest + mypy + Makefile
80
- * Gradio UI
81
-
82
- The pipeline is fully modular: each stage has a clean, swappable interface.
83
 
84
  ---
85
 
86
- # **4) Evolution (Prototype → Copilot)**
87
-
88
- This project is the **second-generation, production-grade** version of an earlier prototype:
89
- 👉 [https://github.com/melika-kheirieh/nl2sql-copilot-prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype)
90
-
91
- The prototype explored single-step, prompt-based SQL generation.
92
- The current version is a **complete architectural redesign**, adding:
93
-
94
- * multi-stage agentic pipeline
95
- * schema introspection
96
- * safety guardrails
97
- * self-repair loop
98
- * caching
99
- * observability
100
- * Spider benchmarks
101
- * multi-DB support with upload + TTL handling
102
 
103
- This repository is the first **end-to-end, production-oriented** version.
 
 
104
 
105
- ---
106
-
107
- # **5) Key Features**
108
-
109
- ### ✔ Agentic Pipeline
110
-
111
- Planner → Generator → Safety → Executor → Verifier → Repair.
112
-
113
- ### ✔ Schema-Aware
114
-
115
- Automatic schema preview for any uploaded SQLite database.
116
-
117
- ### ✔ Safety by Design
118
-
119
- * SELECT-only
120
- * Column/table validation
121
- * No multi-statement SQL
122
- * Prevents schema hallucination
123
-
124
- ### ✔ Self-Repair
125
-
126
- Automatic minimal-diff correction when SQL fails.
127
-
128
- ### ✔ Caching
129
-
130
- TTL-based, with key = (db_id, normalized_query, schema_hash).
131
- Hit/miss metrics included.
132
-
133
- ### ✔ Observability
134
 
135
- * Per-stage latency
136
- * Pipeline success ratio
137
- * Repair success rate
138
- * Cache hit ratio
139
- * p95 latency
140
- * Full Grafana dashboard
 
141
 
142
- ### ✔ Benchmarks
143
 
144
- Reproducible Spider evaluation with plots + summary.
 
 
 
145
 
146
  ---
147
 
148
- # **6) Benchmarks (Spider dev, 20 samples)**
149
 
150
- [![Benchmarks](https://img.shields.io/badge/Benchmarks-Spider%20dev-blue)](#benchmarks)
 
151
 
152
- Evaluated on a curated 20-sample subset of the Spider **dev** split
153
- (focused on `concert_singer`), using the full production pipeline.
154
-
155
- ### 🧮 Summary
156
-
157
- * **Total samples:** 20
158
- * **Successful runs:** 20/20 (**100%**)
159
- * **Exact Match (EM):** **0.10**
160
- * **Structural Match (SM):** **0.70**
161
- * **Execution Accuracy:** **0.725**
162
-
163
- This reflects a *production-oriented* NL2SQL system:
164
- the model optimizes for **executable SQL**, not literal gold-string alignment.
165
 
166
  ---
167
 
168
- ### Latency
 
169
 
170
- * **Avg latency:** ~**8066 ms**
171
- * **p50:** ~**9229 ms**
172
- * **p95:** ~**14936 ms**
 
 
173
 
174
- Latency is **bimodal**:
175
- simple queries → fast, reasoning-heavy queries → planner-dominated.
176
 
177
  ---
178
 
179
- ### ⚙️ Per-Stage Latency
180
-
181
- | Stage | Avg latency (ms) |
182
- | --------- | ---------------- |
183
- | detector | ~1 |
184
- | planner | ~8360 |
185
- | generator | ~1645 |
186
- | safety | ~2 |
187
- | executor | ~1 |
188
- | verifier | ~1 |
189
- | repair | ~1200 |
190
 
191
- Planner is the main bottleneck (expected for schema-level reasoning).
192
- Safety/executor/verifier stay **single-digit ms**.
193
 
194
  ---
195
 
196
- ### Failure Modes (Why EM is low)
 
197
 
198
- Even when EM = 0, **SM and ExecAcc are often 1.0**.
199
-
200
- Typical causes:
201
-
202
- * Capitalization differences (`Age` vs `age`)
203
- * Different column ordering
204
- * LIMIT differences
205
- * Alias mismatch
206
- * Gold SQL is `EMPTY` but the model infers a valid SQL
207
-
208
- In real-world systems, **execution correctness matters more than exact string match**.
209
 
210
  ---
211
 
212
- ### 📂 Reproducing the Benchmark
213
 
214
- ```bash
215
- export SPIDER_ROOT="$PWD/data/spider"
 
 
216
 
217
- PYTHONPATH=$PWD \
218
- python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20 --debug
219
 
220
- PYTHONPATH=$PWD \
221
- python benchmarks/plot_results.py
222
- ```
 
223
 
224
- Artifacts saved under:
 
 
225
 
226
- ```
227
- benchmarks/results_pro/<timestamp>/
228
- summary.json
229
- eval.jsonl
230
- metrics_overview.png
231
- latency_histogram.png
232
- latency_per_stage.png
233
- errors_overview.png
234
- ```
235
 
236
  ---
237
 
238
- # **7) API Usage**
239
-
240
- ## 🔍 NL → SQL
241
 
242
- ```bash
243
- curl -X POST "http://localhost:8000/api/v1/nl2sql" \
244
- -H "Content-Type: application/json" \
245
- -H "X-API-Key: dev-key" \
246
- -d '{
247
- "query": "Top 5 customers by total invoice amount",
248
- "db_id": null
249
- }'
250
- ```
251
 
252
- ### Sample Response (accurate)
253
-
254
- ```json
255
- {
256
- "ambiguous": false,
257
- "sql": "SELECT ...",
258
- "rationale": "Explanation of why this SQL was generated.",
259
- "result": {
260
- "rows": 5,
261
- "columns": ["CustomerId", "Total"],
262
- "rows_data": [
263
- [1, 39.6],
264
- [2, 38.7],
265
- [3, 35.4]
266
- ]
267
- },
268
- "traces": [
269
- {"stage": "detector", "duration_ms": 1},
270
- {"stage": "planner", "duration_ms": 8943},
271
- {"stage": "generator","duration_ms": 1722},
272
- {"stage": "safety", "duration_ms": 2},
273
- {"stage": "executor", "duration_ms": 1},
274
- {"stage": "verifier", "duration_ms": 1},
275
- {"stage": "repair", "duration_ms": 522}
276
- ]
277
- }
278
- ```
279
 
280
- ---
281
-
282
- ## 📁 Upload a SQLite DB
 
 
 
 
283
 
284
- ```bash
285
- curl -X POST "http://localhost:8000/api/v1/nl2sql/upload_db" \
286
- -H "X-API-Key: dev-key" \
287
- -F "file=@/path/to/db.sqlite"
288
- ```
289
 
290
  ---
291
 
292
- ## 📑 Schema Preview
 
 
293
 
294
  ```bash
295
- curl "http://localhost:8000/api/v1/nl2sql/schema?db_id=<uuid>" \
296
- -H "X-API-Key: dev-key"
297
  ```
298
 
299
- ---
300
-
301
- # **8) Environment Variables**
302
 
303
- | Variable | Purpose |
304
- | ---------------------- | ---------------------------------------- |
305
- | `API_KEYS` | Comma-separated list of backend API keys |
306
- | `API_KEY` | Used by Gradio UI to call the backend |
307
- | `DEV_MODE` | Enables strict ambiguity detection |
308
- | `NL2SQL_CACHE_TTL_SEC` | Cache TTL |
309
- | `NL2SQL_CACHE_MAX` | Max cache entries |
310
- | `SPIDER_ROOT` | Path to Spider dataset |
311
- | `USE_MOCK` | Skip execution (for testing) |
312
 
313
- > Gradio uses `API_KEY` backend expects it as `X-API-Key`.
314
- > Backend accepts multiple keys via `API_KEYS`.
315
 
316
  ---
317
 
318
- # **9) Future Work**
319
-
320
- ### 1) Streaming SQL Generation (SSE)
321
 
322
- ### 2) Redis Distributed Cache
323
-
324
- ### 3) Multi-Model Planner/Generator
325
-
326
- ### 4) A/B Testing Framework
327
-
328
- ### 5) Schema Embeddings
329
-
330
- ### 6) Nightly CI Benchmarks
331
-
332
- ### 7) Advanced Repair (diff-based)
333
-
334
- ### 8) Helm / Compose Deployment Template
335
 
336
  ---
337
 
338
- # **10) License**
 
 
 
 
339
 
340
- MIT License.
 
 
6
  sdk: docker
7
  pinned: false
8
  ---
9
+
10
+ # NL2SQL Copilot — Safety-First, Production-Grade Text-to-SQL
11
+
12
  [![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
13
  [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
14
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
15
 
16
+ A **production-oriented Natural Language → SQL system** built around **explicit safety guarantees, verification, evaluation, and observability**.
17
 
18
+ This project treats LLMs as **untrusted components** inside a constrained, measurable system — not as autonomous agents.
 
 
 
 
 
19
 
20
  ---
21
 
22
+ ## Demo (End-to-End)
23
+ A live interactive demo is available on Hugging Face Spaces: 👉 [**Try the Demo**](https://huggingface.co/spaces/melikakheirieh/nl2sql-copilot)
24
+ <p align="center">
25
+ <img src="docs/assets/screenshots/demo_list_albums_total_sales.png" width="700">
26
+ </p>
 
 
 
 
 
 
 
 
27
 
28
  ---
29
 
30
+ ## Why this exists
31
 
32
+ Most Text-to-SQL demos answer:
33
+ > *“Can the model generate SQL?”*
34
 
35
+ This project answers a harder question:
36
+ > **“Can NL→SQL be operated safely as a production system?”**
 
 
 
 
 
 
37
 
38
+ That means:
39
+ - controlling **what the model sees** (context engineering),
40
+ - constraining **what it is allowed to execute** (safety),
41
+ - verifying results before returning them,
42
+ - and continuously measuring **accuracy, latency, and cost**.
43
 
44
  ---
45
 
46
+ ## What the system does
47
 
48
+ - Converts natural-language questions into **safe, verified SQL**
49
+ - Enforces **SELECT-only execution policies** (no DDL / DML)
50
+ - Uses **explicit context engineering** (schema packing + rules)
51
+ - Applies **execution and verification guardrails**
52
+ - Tracks **per-stage latency, errors, and cost signals**
53
+ - Evaluates accuracy on **Spider** with a structured error taxonomy
54
+ - Exposes **Prometheus metrics** and **Grafana dashboards**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ---
57
 
58
+ ## Architecture & Pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ <p align="center">
61
+ <img src="docs/assets/architecture.png" width="720">
62
+ </p>
63
 
64
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ Detector
67
+ Planner
68
+ Generator
69
+ Safety Guard
70
+ Executor
71
+ Verifier
72
+ → Repair (bounded)
73
 
74
+ ````
75
 
76
+ Each stage:
77
+ - has a single responsibility,
78
+ - emits structured traces,
79
+ - is independently testable.
80
 
81
  ---
82
 
83
+ ## Core design principles
84
 
85
+ ### 1) Context engineering over prompt cleverness
86
+ The model never sees the raw database blindly.
87
 
88
+ Instead, it receives:
89
+ - a **deterministic schema pack**,
90
+ - explicit constraints (e.g. SELECT-only, LIMIT rules),
91
+ - and a bounded context budget.
 
 
 
 
 
 
 
 
 
92
 
93
  ---
94
 
95
+ ### 2) Safety is enforced, not suggested
96
+ Safety policies are **system-level constraints**, not prompt instructions.
97
 
98
+ Current guarantees:
99
+ - Single-statement execution
100
+ - `SELECT` / `WITH` only
101
+ - No DDL / DML
102
+ - Execution time & result guards
103
 
104
+ Violations are **blocked**, not repaired.
 
105
 
106
  ---
107
 
108
+ ### 3) Verification before trust
109
+ Queries are executed in a controlled environment and verified for:
110
+ - structural validity,
111
+ - schema consistency,
112
+ - execution correctness.
 
 
 
 
 
 
113
 
114
+ Errors are surfaced explicitly and classified not hidden.
 
115
 
116
  ---
117
 
118
+ ### 4) Repair for reliability, not illusion
119
+ Repair exists to improve **system robustness**, not to chase accuracy at all costs.
120
 
121
+ - Triggered only for eligible error classes
122
+ - Disabled for safety violations
123
+ - Strictly bounded (no infinite loops)
 
 
 
 
 
 
 
 
124
 
125
  ---
126
 
127
+ ## Repository structure
128
 
129
+ ```text
130
+ app/ # FastAPI service (routes, schemas, wiring)
131
+ nl2sql/ # Core NL→SQL pipeline
132
+ adapters/ # Adapter implementations (DBs, LLMs)
133
 
134
+ benchmarks/ # Evaluation runners & outputs
135
+ tests/ # Unit & integration tests
136
 
137
+ prometheus/ # Prometheus configuration
138
+ grafana/ # Grafana provisioning
139
+ alertmanager/ # Alertmanager config
140
+ alert-receiver/ # Webhook receiver for alert testing
141
 
142
+ infra/ # Docker Compose & infra glue
143
+ configs/ # Runtime configs
144
+ scripts/ # Tooling & helpers
145
 
146
+ demo/ # Demo app
147
+ ui/ # UI surface
148
+ docs/ # Docs & screenshots
149
+ data/ # Local data & demo DBs
150
+ ````
 
 
 
 
151
 
152
  ---
153
 
154
+ ## Observability & GenAIOps
 
 
155
 
156
+ <p align="center">
157
+ <img src="docs/assets/grafana.png" width="720">
158
+ </p>
 
 
 
 
 
 
159
 
160
+ Tracked signals include:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
+ * End-to-end latency (p50 / p95)
163
+ * Per-stage latency
164
+ * Success / failure counts
165
+ * Safety blocks
166
+ * Repair attempts & win-rate
167
+ * Cache hit / miss ratio
168
+ * Token usage (prompt / completion)
169
 
170
+ These metrics make **accuracy vs latency vs cost trade-offs** explicit.
 
 
 
 
171
 
172
  ---
173
 
174
+ ## Evaluation
175
+
176
+ The system is evaluated on the **Spider benchmark**.
177
 
178
  ```bash
179
+ make eval-spider
 
180
  ```
181
 
182
+ Metrics:
 
 
183
 
184
+ * Exact Match (EM)
185
+ * Execution Accuracy (ExecAcc)
186
+ * Semantic Match (SM)
187
+ * Latency distributions
188
+ * Error taxonomy breakdown
 
 
 
 
189
 
190
+ A **golden regression set** is used to detect accuracy regressions.
 
191
 
192
  ---
193
 
194
+ ## Roadmap
 
 
195
 
196
+ * AST-based SQL allowlisting
197
+ * Query cost heuristics (EXPLAIN-based)
198
+ * Cross-database adapters
199
+ * CI-level eval gating
 
 
 
 
 
 
 
 
 
200
 
201
  ---
202
 
203
+ ## What this project is *not*
204
+
205
+ * Not a prompt-only demo
206
+ * Not an autonomous agent playground
207
+ * Not optimized for leaderboard chasing
208
 
209
+ It is a **deliberately constrained, observable, and defendable AI system** —
210
+ built to be discussed seriously in production engineering interviews.