github-actions[bot] commited on
Commit
2eb4c72
·
1 Parent(s): f0b4004

Sync from GitHub main @ d09d49d22ebb4eb7626a612fb689f1c69e543e0f

Browse files
Files changed (2) hide show
  1. README.md +164 -58
  2. demo/app.py +2 -2
README.md CHANGED
@@ -13,74 +13,103 @@ pinned: false
13
  [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
14
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
15
 
16
- A **production-oriented Natural Language → SQL system** built around **explicit safety guarantees, verification, evaluation, and observability**.
 
17
 
18
- This project treats LLMs as **untrusted components** inside a constrained, measurable system — not as autonomous agents.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ---
21
 
22
  ## Demo (End-to-End)
23
- A live interactive demo is available on Hugging Face Spaces: 👉 [**Try the Demo**](https://huggingface.co/spaces/melikakheirieh/nl2sql-copilot)
 
 
 
24
  <p align="center">
25
  <img src="docs/assets/screenshots/demo_list_albums_total_sales.png" width="700">
26
  </p>
27
 
28
-
29
 
30
  ## Quickstart (Local)
31
 
32
  ### 1) Install
 
33
  ```bash
34
  make install
35
  ```
36
 
37
  ### 2) Run API (Terminal 1)
 
38
  ```bash
39
  make demo-up
40
  ```
41
 
42
  ### 3) Smoke (Terminal 2)
 
43
  ```bash
44
  make demo-smoke
45
  ```
46
 
47
  ### 4) Observability stack (optional)
 
48
  ```bash
49
  make infra-up
50
  ```
51
 
52
  Then (optional Prometheus snapshot):
 
53
  ```bash
54
  make demo-metrics
55
  ```
56
 
57
  ---
58
 
59
- ## Why this exists
60
-
61
- Most Text-to-SQL demos answer:
62
- > *“Can the model generate SQL?”*
63
-
64
- This project answers a harder question:
65
- > **“Can NL→SQL be operated safely as a production system?”**
66
-
67
- That means:
68
- - controlling **what the model sees** (context engineering),
69
- - constraining **what it is allowed to execute** (safety),
70
- - verifying results before returning them,
71
- - and continuously measuring **accuracy, latency, and cost**.
72
-
73
- ---
74
-
75
  ## What the system does
76
 
77
- - Converts natural-language questions into **safe, verified SQL**
78
- - Enforces **SELECT-only execution policies** (no DDL / DML)
79
- - Uses **explicit context engineering** (schema packing + rules)
80
- - Applies **execution and verification guardrails**
81
- - Tracks **per-stage latency, errors, and cost signals**
82
- - Evaluates accuracy on **Spider** with a structured error taxonomy
83
- - Exposes **Prometheus metrics** and **Grafana dashboards**
84
 
85
  ---
86
 
@@ -91,7 +120,6 @@ That means:
91
  </p>
92
 
93
  ```
94
-
95
  Detector
96
  → Planner
97
  → Generator
@@ -99,57 +127,67 @@ Detector
99
  → Executor
100
  → Verifier
101
  → Repair (bounded)
 
102
 
103
- ````
 
104
 
105
  Each stage:
106
- - has a single responsibility,
107
- - emits structured traces,
108
- - is independently testable.
 
109
 
110
  ---
111
 
112
  ## Core design principles
113
 
114
  ### 1) Context engineering over prompt cleverness
 
115
  The model never sees the raw database blindly.
116
 
117
  Instead, it receives:
118
- - a **deterministic schema pack**,
119
- - explicit constraints (e.g. SELECT-only, LIMIT rules),
120
- - and a bounded context budget.
 
121
 
122
  ---
123
 
124
  ### 2) Safety is enforced, not suggested
 
125
  Safety policies are **system-level constraints**, not prompt instructions.
126
 
127
  Current guarantees:
128
- - Single-statement execution
129
- - `SELECT` / `WITH` only
130
- - No DDL / DML
131
- - Execution time & result guards
 
132
 
133
  Violations are **blocked**, not repaired.
134
 
135
  ---
136
 
137
  ### 3) Verification before trust
 
138
  Queries are executed in a controlled environment and verified for:
139
- - structural validity,
140
- - schema consistency,
141
- - execution correctness.
 
142
 
143
  Errors are surfaced explicitly and classified — not hidden.
144
 
145
  ---
146
 
147
  ### 4) Repair for reliability, not illusion
 
148
  Repair exists to improve **system robustness**, not to chase accuracy at all costs.
149
 
150
- - Triggered only for eligible error classes
151
- - Disabled for safety violations
152
- - Strictly bounded (no infinite loops)
153
 
154
  ---
155
 
@@ -163,8 +201,7 @@ adapters/ # Adapter implementations (DBs, LLMs)
163
  benchmarks/ # Evaluation runners & outputs
164
  tests/ # Unit & integration tests
165
 
166
-
167
- infra/ # Docker Compose + observability stack (Prometheus/Grafana/Alertmanager)
168
  configs/ # Runtime configs
169
  scripts/ # Tooling & helpers
170
 
@@ -172,7 +209,7 @@ demo/ # Demo app
172
  ui/ # UI surface
173
  docs/ # Docs & screenshots
174
  data/ # Local data & demo DBs
175
- ````
176
 
177
  ---
178
 
@@ -182,7 +219,12 @@ data/ # Local data & demo DBs
182
  <img src="docs/assets/grafana.png" width="720">
183
  </p>
184
 
185
- Tracked signals include:
 
 
 
 
 
186
 
187
  * End-to-end latency (p50 / p95)
188
  * Per-stage latency
@@ -192,27 +234,91 @@ Tracked signals include:
192
  * Cache hit / miss ratio
193
  * Token usage (prompt / completion)
194
 
195
- These metrics make **accuracy vs latency vs cost trade-offs** explicit.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  ---
198
 
199
  ## Evaluation
200
 
201
- The system is evaluated on the **Spider benchmark**.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
  ```bash
204
- make eval-spider
205
  ```
206
 
207
- Metrics:
 
 
 
 
 
 
208
 
209
  * Exact Match (EM)
210
- * Execution Accuracy (ExecAcc)
211
- * Semantic Match (SM)
212
- * Latency distributions
213
- * Error taxonomy breakdown
 
 
 
 
 
 
 
214
 
215
- A **golden regression set** is used to detect accuracy regressions.
216
 
217
  ---
218
 
 
13
  [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
14
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
15
 
16
+ A **production-oriented, multi-stage Natural Language → SQL system** built around
17
+ **explicit safety guarantees, verification, evaluation, and observability** — **not a prompt-only demo**.
18
 
19
+ This project treats LLMs as **untrusted components** inside a constrained, measurable system,
20
+ and operates them through an explicit pipeline with verification and a **bounded repair loop**.
21
+
22
+ ---
23
+
24
+ ## At a glance
25
+
26
+ * **Agentic NL2SQL pipeline** with explicit planning, verification, and a bounded repair loop
27
+ * **Safety-first execution** enforced at the system level (single-statement, SELECT-only)
28
+ * **Failure-aware design** with verifier-driven repair and a structured error taxonomy
29
+ * **Built-in evaluation**: lightweight smoke runs and Spider-based benchmarks with artifacts
30
+ * **Real observability**: Prometheus/Grafana metrics with CI-level drift guards
31
+
32
+ Reported metrics are **honest engineering baselines**, intended to validate system behavior
33
+ and debuggability — not to claim state-of-the-art accuracy.
34
+
35
+ ---
36
+
37
+ ## Why this exists
38
+
39
+ Most Text-to-SQL demos answer:
40
+
41
+ > *“Can the model generate SQL?”*
42
+
43
+ This project answers a harder question:
44
+
45
+ > **“Can NL→SQL be operated safely as a production system?”**
46
+
47
+ That means:
48
+
49
+ * controlling **what the model sees** (deterministic schema packing and context budgeting),
50
+ * constraining **what it is allowed to execute** (system-enforced safety, not prompt suggestions),
51
+ * verifying results before returning them,
52
+ * and continuously measuring **latency, failures, and repair behavior**.
53
+
54
+ This repository is intentionally scoped to answer that question with
55
+ **engineering guarantees** — not model cleverness or prompt tricks.
56
 
57
  ---
58
 
59
  ## Demo (End-to-End)
60
+
61
+ A live interactive demo is available on Hugging Face Spaces:
62
+ 👉 [**Try the Demo**](https://huggingface.co/spaces/melikakheirieh/nl2sql-copilot)
63
+
64
  <p align="center">
65
  <img src="docs/assets/screenshots/demo_list_albums_total_sales.png" width="700">
66
  </p>
67
 
68
+ ---
69
 
70
  ## Quickstart (Local)
71
 
72
  ### 1) Install
73
+
74
  ```bash
75
  make install
76
  ```
77
 
78
  ### 2) Run API (Terminal 1)
79
+
80
  ```bash
81
  make demo-up
82
  ```
83
 
84
  ### 3) Smoke (Terminal 2)
85
+
86
  ```bash
87
  make demo-smoke
88
  ```
89
 
90
  ### 4) Observability stack (optional)
91
+
92
  ```bash
93
  make infra-up
94
  ```
95
 
96
  Then (optional Prometheus snapshot):
97
+
98
  ```bash
99
  make demo-metrics
100
  ```
101
 
102
  ---
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ## What the system does
105
 
106
+ * Converts natural-language questions into **safe, verified SQL**
107
+ * Enforces **SELECT-only execution policies** (no DDL / DML)
108
+ * Uses **explicit context engineering** (schema packing + rules)
109
+ * Applies **execution and verification guardrails**
110
+ * Tracks **per-stage latency, errors, and cost signals**
111
+ * Evaluates behavior on **Spider** with a structured error taxonomy
112
+ * Exposes **Prometheus metrics** and **Grafana dashboards**
113
 
114
  ---
115
 
 
120
  </p>
121
 
122
  ```
 
123
  Detector
124
  → Planner
125
  → Generator
 
127
  → Executor
128
  → Verifier
129
  → Repair (bounded)
130
+ ```
131
 
132
+ The pipeline is designed so that **failures are explicit, classified, and observable** —
133
+ not hidden behind retries or prompt heuristics.
134
 
135
  Each stage:
136
+
137
+ * has a single responsibility,
138
+ * emits structured traces,
139
+ * and is independently testable.
140
 
141
  ---
142
 
143
  ## Core design principles
144
 
145
  ### 1) Context engineering over prompt cleverness
146
+
147
  The model never sees the raw database blindly.
148
 
149
  Instead, it receives:
150
+
151
+ * a **deterministic schema pack**,
152
+ * explicit constraints (e.g. SELECT-only, LIMIT rules),
153
+ * and a bounded context budget.
154
 
155
  ---
156
 
157
  ### 2) Safety is enforced, not suggested
158
+
159
  Safety policies are **system-level constraints**, not prompt instructions.
160
 
161
  Current guarantees:
162
+
163
+ * Single-statement execution
164
+ * `SELECT` / `WITH` only
165
+ * No DDL / DML
166
+ * Execution time & result guards
167
 
168
  Violations are **blocked**, not repaired.
169
 
170
  ---
171
 
172
  ### 3) Verification before trust
173
+
174
  Queries are executed in a controlled environment and verified for:
175
+
176
+ * structural validity,
177
+ * schema consistency,
178
+ * execution correctness.
179
 
180
  Errors are surfaced explicitly and classified — not hidden.
181
 
182
  ---
183
 
184
  ### 4) Repair for reliability, not illusion
185
+
186
  Repair exists to improve **system robustness**, not to chase accuracy at all costs.
187
 
188
+ * Triggered only for eligible error classes
189
+ * Disabled for safety violations
190
+ * Strictly bounded (no infinite loops)
191
 
192
  ---
193
 
 
201
  benchmarks/ # Evaluation runners & outputs
202
  tests/ # Unit & integration tests
203
 
204
+ infra/ # Docker Compose + observability stack
 
205
  configs/ # Runtime configs
206
  scripts/ # Tooling & helpers
207
 
 
209
  ui/ # UI surface
210
  docs/ # Docs & screenshots
211
  data/ # Local data & demo DBs
212
+ ```
213
 
214
  ---
215
 
 
219
  <img src="docs/assets/grafana.png" width="720">
220
  </p>
221
 
222
+ Observability in this system is a **first-class design concern**, not an afterthought.
223
+ Metrics are used to make **LLM behavior measurable, debuggable, and regressible**
224
+ across safety, latency, failures, and repair — both in live runs and during evaluation.
225
+
226
+ The following signals are tracked to expose **system-level trade-offs**
227
+ between accuracy, latency, safety, and cost:
228
 
229
  * End-to-end latency (p50 / p95)
230
  * Per-stage latency
 
234
  * Cache hit / miss ratio
235
  * Token usage (prompt / completion)
236
 
237
+ Repair behavior is intentionally observable.
238
+ Repair attempts, outcomes, and win-rates are tracked explicitly,
239
+ making it possible to distinguish genuine recovery from masked failures.
240
+
241
+ ### Metric drift guards (CI)
242
+
243
+ To prevent silent regressions, the repository includes CI-level guards
244
+ that validate metric wiring and naming consistency across code, dashboards, and rules.
245
+
246
+ This ensures that observability **cannot silently decay** as the system evolves.
247
+
248
+ ```bash
249
+ make metrics-check
250
+ ```
251
+
252
+ Observability and evaluation are intentionally aligned.
253
+ Latency distributions, failure classes, and repair outcomes observed during benchmarks
254
+ map directly to runtime metrics and dashboards,
255
+ closing the loop between offline evaluation and live operation.
256
 
257
  ---
258
 
259
  ## Evaluation
260
 
261
+ Evaluation results in this repository are reported as **engineering baselines**.
262
+ Their purpose is to validate **system behavior** — such as failure modes,
263
+ latency distributions, and repair effectiveness —
264
+ rather than to optimize leaderboard scores or claim state-of-the-art accuracy.
265
+
266
+ ### Benchmark dashboard
267
+
268
+ Evaluation runs can be inspected interactively via a Streamlit dashboard,
269
+ designed for **diagnostics and system-level inspection** rather than score reporting.
270
+
271
+ <p align="center">
272
+ <img src="docs/assets/benchmark_dashboard.png" width="720">
273
+ </p>
274
+
275
+ The dashboard exposes:
276
+ - end-to-end and per-stage latency distributions (p50 / p95),
277
+ - success and failure rates,
278
+ - and high-level signals for repair and system behavior during evaluation.
279
+
280
+ This view is intentionally scoped to **behavioral and operational signals**,
281
+ and complements the raw artifacts written to `benchmarks/results*/`.
282
+
283
+ ### Evaluation modes
284
+
285
+ #### 1) Eval Lite (Smoke & diagnostics)
286
+
287
+ Eval Lite focuses on **operational signals**, not gold accuracy:
288
+
289
+ * end-to-end and per-stage latency
290
+ * success vs failure rates
291
+ * repair attempts and outcomes
292
+ * error class distribution
293
+
294
+ Designed for **fast feedback**, regression detection, and system diagnostics.
295
 
296
  ```bash
297
+ make eval-smoke
298
  ```
299
 
300
+ Artifacts are written to `benchmarks/results/<timestamp>/`.
301
+
302
+ ---
303
+
304
+ #### 2) Eval Pro (Spider benchmark)
305
+
306
+ Eval Pro runs the pipeline against the **Spider benchmark** to evaluate:
307
 
308
  * Exact Match (EM)
309
+ * Execution Accuracy
310
+ * Semantic Match
311
+ * latency distributions
312
+ * structured error taxonomy
313
+
314
+ Intended for **reproducible, comparable runs**, not ad-hoc demos.
315
+
316
+ ```bash
317
+ make eval-pro-smoke # ~20 examples
318
+ make eval-pro # ~200 examples
319
+ ```
320
 
321
+ Artifacts are written to `benchmarks/results_pro/<timestamp>/`.
322
 
323
  ---
324
 
demo/app.py CHANGED
@@ -145,7 +145,7 @@ def query_to_sql(
145
 
146
  # Compute simple latency badge from traces (sum of duration_ms)
147
  badges_text = ""
148
- if traces and all("duration_ms" in t for t in traces):
149
  total_ms = sum(float(t.get("duration_ms", 0.0)) for t in traces)
150
  badges_text = f"latency≈{int(total_ms)}ms"
151
 
@@ -200,7 +200,7 @@ def build_ui() -> gr.Blocks:
200
  )
201
  debug = gr.Checkbox(
202
  label="Debug (UI only)",
203
- value=True,
204
  scale=1,
205
  )
206
  run = gr.Button("Run")
 
145
 
146
  # Compute simple latency badge from traces (sum of duration_ms)
147
  badges_text = ""
148
+ if _debug_flag and traces and all("duration_ms" in t for t in traces):
149
  total_ms = sum(float(t.get("duration_ms", 0.0)) for t in traces)
150
  badges_text = f"latency≈{int(total_ms)}ms"
151
 
 
200
  )
201
  debug = gr.Checkbox(
202
  label="Debug (UI only)",
203
+ value=False,
204
  scale=1,
205
  )
206
  run = gr.Button("Run")