Melika Kheirieh commited on
Commit
8f50117
Β·
1 Parent(s): 7b9903c

docs(readme): revamp and polish README for production showcase

Browse files
Files changed (34) hide show
  1. README.md +126 -134
  2. benchmarks/results_pro/20251109-110149/eval.jsonl +0 -5
  3. benchmarks/results_pro/20251109-110149/latency_histogram.png +0 -0
  4. benchmarks/results_pro/20251109-110149/latency_per_stage.png +0 -0
  5. benchmarks/results_pro/20251109-110149/metrics_overview.png +0 -0
  6. benchmarks/results_pro/20251109-110149/results.csv +0 -6
  7. benchmarks/results_pro/20251109-110149/summary.json +0 -21
  8. benchmarks/results_pro/20251109-123149/eval.jsonl +0 -5
  9. benchmarks/results_pro/20251109-123149/latency_histogram.png +0 -0
  10. benchmarks/results_pro/20251109-123149/latency_per_stage.png +0 -0
  11. benchmarks/results_pro/20251109-123149/metrics_overview.png +0 -0
  12. benchmarks/results_pro/20251109-123149/results.csv +0 -6
  13. benchmarks/results_pro/20251109-123149/summary.json +0 -21
  14. benchmarks/results_pro/20251109-123424/eval.jsonl +0 -5
  15. benchmarks/results_pro/20251109-123424/results.csv +0 -6
  16. benchmarks/results_pro/20251109-123424/summary.json +0 -21
  17. benchmarks/results_pro/20251109-124602/eval.jsonl +0 -5
  18. benchmarks/results_pro/20251109-124602/latency_histogram.png +0 -0
  19. benchmarks/results_pro/20251109-124602/latency_per_stage.png +0 -0
  20. benchmarks/results_pro/20251109-124602/metrics_overview.png +0 -0
  21. benchmarks/results_pro/20251109-124602/results.csv +0 -6
  22. benchmarks/results_pro/20251109-124602/summary.json +0 -21
  23. benchmarks/results_pro/20251109-125209/eval.jsonl +0 -5
  24. benchmarks/results_pro/20251109-125209/latency_histogram.png +0 -0
  25. benchmarks/results_pro/20251109-125209/latency_per_stage.png +0 -0
  26. benchmarks/results_pro/20251109-125209/metrics_overview.png +0 -0
  27. benchmarks/results_pro/20251109-125209/results.csv +0 -6
  28. benchmarks/results_pro/20251109-125209/summary.json +0 -21
  29. benchmarks/results_pro/20251109-171247/eval.jsonl +20 -0
  30. benchmarks/results_pro/20251109-171247/latency_histogram.png +0 -0
  31. benchmarks/results_pro/20251109-171247/latency_per_stage.png +0 -0
  32. benchmarks/results_pro/20251109-171247/metrics_overview.png +0 -0
  33. benchmarks/results_pro/20251109-171247/results.csv +21 -0
  34. benchmarks/results_pro/20251109-171247/summary.json +21 -0
README.md CHANGED
@@ -1,198 +1,190 @@
1
  # 🧩 NL2SQL Copilot
2
 
3
- A modular **Text-to-SQL Copilot** that converts natural-language questions into **safe, verified SQL queries**.
4
- Built with **FastAPI**, **LangGraph**, and **SQLAlchemy**, designed for **read-only databases** and benchmarked on **Spider** and **Dr.Spider** datasets.
 
5
 
6
- ---
7
-
8
- > πŸ’‘ **Why it matters**
9
- > In real analytics teams, analysts often need quick insights without writing SQL.
10
- > **NL2SQL Copilot** bridges that gap by translating plain-English questions into validated, read-only SQL β€” reducing query errors and saving hours of analyst time.
11
- >
12
- > 🧬 **Evolution Note**
13
- > This repository is the next-generation version of [NL2SQL Copilot Prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype).
14
- > It refactors the original prototype into a **production-grade, modular architecture** β€”
15
- > adding configuration-driven pipelines, safety layers, benchmarks, and a Streamlit UI for evaluation.
16
 
17
  ---
18
 
19
- ## πŸš€ Quick Start
20
-
21
- ### 1️⃣ Clone the repo
22
- ```bash
23
- git clone https://github.com/melika-kheirieh/nl2sql-copilot.git
24
- cd nl2sql-copilot
25
- ````
26
-
27
- ### 2️⃣ Build and run with Docker
28
 
29
- ```bash
30
- docker build -t nl2sql-copilot .
31
- docker run --rm -p 8000:8000 nl2sql-copilot
32
- ```
33
 
34
- Then open [http://localhost:8000/docs](http://localhost:8000/docs) πŸš€
35
- Or launch the [Streamlit Demo](http://localhost:7860) to test it interactively.
36
 
37
  ---
38
 
39
- ## 🧠 Demo
40
 
41
- 🎯 **Live Demo:** [Try it on Hugging Face Spaces β†’](https://huggingface.co/spaces/melika-kheirieh/nl2sql-copilot)
42
 
43
- You can ask a question in plain English β€” the Copilot plans, generates, verifies, and safely executes an SQL query.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- **User Query**
46
 
47
- > show top 5 albums by total sales
48
 
49
- **Generated SQL**
 
 
 
 
 
 
 
 
50
 
51
- ```sql
52
- SELECT albums.Title, SUM(invoice_items.UnitPrice * invoice_items.Quantity) AS total_sales
53
- FROM albums
54
- JOIN tracks ON albums.AlbumId = tracks.AlbumId
55
- JOIN invoice_items ON invoice_items.TrackId = tracks.TrackId
56
- GROUP BY albums.Title
57
- ORDER BY total_sales DESC
58
- LIMIT 5;
59
- ```
60
 
61
- **Execution Result (preview)**
62
 
63
- | Album | Total Sales |
64
- | ----------------- | ----------- |
65
- | Greatest Hits | 155.34 |
66
- | Let There Be Rock | 133.09 |
67
- | Big Ones | 128.44 |
68
 
69
- **Trace**
 
 
 
 
 
 
 
70
 
71
- ```json
72
- [
73
- {"stage": "planner", "duration_ms": 38, "summary": "Identified SQL intent"},
74
- {"stage": "generator", "duration_ms": 201, "summary": "LLM generated SQL"},
75
- {"stage": "safety", "duration_ms": 6, "summary": "Validated SELECT-only"},
76
- {"stage": "executor", "duration_ms": 92, "summary": "Executed successfully"}
77
- ]
78
- ```
79
 
80
- ![Demo Screenshot](docs/demo-screenshot.png)
81
 
82
- ---
 
 
 
 
83
 
84
- ## 🧱 Project Structure
85
 
86
- ```
87
- nl2sql-copilot/
88
- β”‚
89
- β”œβ”€β”€ app/ # FastAPI app, routers, schemas
90
- β”œβ”€β”€ nl2sql/ # Core pipeline (Planner β†’ Generator β†’ Safety β†’ Executor β†’ Verifier)
91
- β”œβ”€β”€ adapters/ # Database and LLM adapters
92
- β”œβ”€β”€ benchmarks/ # Evaluation scripts and results
93
- β”œβ”€β”€ ui/ # Streamlit dashboard (demo + benchmark)
94
- β”œβ”€β”€ configs/ # Pipeline configs (YAML-based)
95
- β”‚
96
- β”œβ”€β”€ Dockerfile
97
- β”œβ”€β”€ requirements.in / .txt
98
- └── README.md
99
- ```
100
 
101
  ---
102
 
103
- ## βš™οΈ How It Works
104
 
105
- The Copilot runs a **multi-stage pipeline** ensuring every SQL query is both correct and safe:
106
 
107
- ```
108
- Natural Language
109
- ↓
110
- [ Planner ] β†’ [ Generator (LLM) ] β†’ [ Safety ] β†’ [ Executor ] β†’ [ Verifier ] β†’ [ Repair ]
111
- ```
112
 
113
- Each stage is modular and configurable via `configs/pipeline.yaml`.
114
- All queries execute inside a **read-only sandbox**.
115
 
116
- ---
117
 
118
- ## πŸ”’ Safety Layer
119
 
120
- Before execution, every SQL statement is validated:
121
 
122
- | Rule | Example Blocked |
123
- | ------------------ | ----------------------------- |
124
- | DML not allowed | `DELETE FROM users` |
125
- | Multi-statement | `SELECT *; DROP TABLE users` |
126
- | Forbidden keywords | `ALTER`, `TRUNCATE`, `UPDATE` |
127
 
128
- βœ… Only safe, single-statement `SELECT` queries are executed.
 
 
129
 
130
  ---
131
 
132
- ## πŸ“Š Benchmark (sample)
 
 
133
 
134
- Evaluated on a subset of the [Spider](https://yale-lily.github.io/spider) dataset using `gpt-4o-mini`:
 
 
135
 
136
- | Query | Type | Correct | Latency (ms) | Model |
137
- | --------------------------- | ------------- | ------- | ------------ | ----------- |
138
- | list all artists | simple select | βœ… | 118 | gpt-4o-mini |
139
- | total invoices per country | aggregation | βœ… | 127 | gpt-4o-mini |
140
- | top 3 customers by spending | aggregation | βœ… | 141 | gpt-4o-mini |
141
- | albums released before 2000 | filter | βœ… | 122 | gpt-4o-mini |
142
- | top 5 sales by genre | join | βœ… | 149 | gpt-4o-mini |
143
 
144
- *(see `benchmarks/results.csv` for detailed results)*
 
 
 
145
 
146
  ---
147
 
148
- ## 🧩 Key Features
149
 
150
- * βœ… **Modular pipeline** (Planner β†’ Generator β†’ Safety β†’ Executor β†’ Verifier β†’ Repair)
151
- * πŸ›‘οΈ **SQL safety filters** (SELECT-only, blacklist, AST validation)
152
- * πŸ” **Self-repair loop** for failed executions
153
- * 🧠 **LLM-driven generator** (OpenAI / Ollama / Anthropic)
154
- * πŸ“Š **Evaluation toolkit** for latency / accuracy / cost
155
- * βš™οΈ **Config-driven architecture** (`Pipeline.from_config("configs/pipeline.yaml")`)
156
- * 🧰 **PostgreSQL + SQLite adapters**
157
- * πŸŽ›οΈ **Streamlit UI** for interactive demo & benchmark
158
- * 🧩 Built with **FastAPI**, **LangGraph**, **Pydantic-AI**, **SQLAlchemy**
159
 
160
- ---
 
 
 
 
161
 
162
- ## 🧰 Tech Stack
163
 
164
- | Layer | Tools / Libraries |
165
- | ------------- | ----------------------------------------- |
166
- | Backend API | FastAPI, Uvicorn |
167
- | Pipeline Core | Python 3.12, Pydantic, SQLGlot |
168
- | LLM Interface | Pydantic-AI (OpenAI / Anthropic / Ollama) |
169
- | Database | SQLite (default), PostgreSQL |
170
- | Evaluation | Spider / Dr.Spider |
171
- | UI | Streamlit + Plotly |
172
- | CI/CD | GitHub Actions, Makefile, Docker |
173
 
174
  ---
175
 
176
- ## πŸ§ͺ Development
177
 
178
  ```bash
179
- pip install -r requirements.txt
180
- pytest -q
181
- ruff check .
182
- mypy .
183
  ```
184
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ---
186
 
187
- ## 🧭 Roadmap
 
 
 
 
188
 
189
- * [ ] Add multilingual query support (Persian / English)
190
- * [ ] Improve self-repair accuracy
191
- * [ ] Add cost tracking per query
192
- * [ ] Integrate Prometheus metrics
193
 
194
  ---
195
 
196
  ## πŸ“„ License
197
 
198
- MIT Β© 2025 [Melika Kheirieh](https://github.com/melika-kheirieh)
 
1
  # 🧩 NL2SQL Copilot
2
 
3
+ [![CI](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml/badge.svg)](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
4
+ [![Docker](https://img.shields.io/badge/docker-ready-blue?logo=docker)](#)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
 
7
+ A production-grade **Text-to-SQL Copilot** that converts natural-language questions into **safe, verified SQL**.
8
+ Built for analytics engineers who need accuracy, transparency, and control β€” powered by **FastAPI**, **LangGraph**, and **Pydantic-AI**.
 
 
 
 
 
 
 
 
9
 
10
  ---
11
 
12
+ ## πŸš€ Overview
 
 
 
 
 
 
 
 
13
 
14
+ `NL2SQL Copilot` is an **agentic, modular pipeline** that plans, generates, verifies, and repairs SQL queries.
15
+ It ensures correctness and safety through structured stages, evaluation on the **Spider** dataset, and full observability support.
 
 
16
 
17
+ > πŸ’‘ Designed for **read-only production databases** with **self-repair**, **metrics**, and **CI/CD** baked in.
 
18
 
19
  ---
20
 
21
+ ## 🧠 Agentic Architecture
22
 
23
+ ```
24
 
25
+ Natural Language
26
+ ↓
27
+ [ Detector ]
28
+ ↓
29
+ [ Planner ]
30
+ ↓
31
+ [ Generator (LLM) ]
32
+ ↓
33
+ [ Safety ]
34
+ ↓
35
+ [ Executor ]
36
+ ↓
37
+ [ Verifier ]
38
+ ↓
39
+ [ Repair ]
40
 
41
+ ````
42
 
43
+ Each stage is isolated, configurable via YAML, and observable through structured traces and Prometheus metrics.
44
 
45
+ | Stage | Responsibility |
46
+ |--------|----------------|
47
+ | **Detector** | Identify whether a query is Text-to-SQL |
48
+ | **Planner** | Extract user intent and SQL plan |
49
+ | **Generator** | Call LLM to synthesize SQL |
50
+ | **Safety** | Block unsafe or non-SELECT queries |
51
+ | **Executor** | Execute query in read-only sandbox |
52
+ | **Verifier** | Compare results, detect mismatch |
53
+ | **Repair** | Self-healing loop triggered on failure |
54
 
55
+ ---
 
 
 
 
 
 
 
 
56
 
57
+ ## πŸ“Š Benchmark (Spider dataset)
58
 
59
+ Dataset: [Spider](https://yale-lily.github.io/spider) by Yale LILY Lab.
60
+ Evaluated on the **Spider dev subset (20 samples)** using the reproducible evaluation toolkit.
 
 
 
61
 
62
+ | Metric | Value |
63
+ |--------|--------|
64
+ | EM (Exact Match) | 0.15 |
65
+ | SM (Structural Match) | 0.70 |
66
+ | ExecAcc (Execution Accuracy) | 0.73 |
67
+ | Avg Latency | 8.11 s |
68
+ | p50 Latency | 9.42 s |
69
+ | p95 Latency | 13.88 s |
70
 
71
+ > High **Structural Match** and **Execution Accuracy** indicate strong semantic correctness;
72
+ > lower EM reflects harmless formatting differences.
 
 
 
 
 
 
73
 
74
+ Run reproducible benchmarks:
75
 
76
+ ```bash
77
+ export SPIDER_ROOT="$PWD/data/spider"
78
+ PYTHONPATH=$PWD python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20
79
+ PYTHONPATH=$PWD python benchmarks/plot_results.py
80
+ ````
81
 
82
+ Results & plots β†’ `benchmarks/results_pro/20251109-171247/`
83
 
84
+ ![Metrics Overview](benchmarks/results_pro/20251109-171247/metrics_overview.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
86
  ---
87
 
88
+ ## βš™οΈ Key Features
89
 
90
+ βœ… **Agentic architecture** – multi-stage pipeline with feedback loop
91
 
92
+ πŸ›‘οΈ **Safety layer** – SELECT-only guardrails and AST validation
 
 
 
 
93
 
94
+ πŸ” **Self-repair** – automatic retry when verification fails
 
95
 
96
+ πŸ“Š **Reproducible evaluation** – integrated Spider / Dr.Spider benchmarking
97
 
98
+ πŸ“¦ **Config-driven design** – YAML pipeline factory
99
 
100
+ 🧩 **Plug-and-play adapters** – SQLite / PostgreSQL / OpenAI / Anthropic / Ollama
101
 
102
+ 🧠 **FastAPI service + Streamlit UI** – demo or API mode
 
 
 
 
103
 
104
+ 🧰 **CI/CD ready** – Makefile, Ruff, Mypy, Pytest, Docker, GitHub Actions
105
+
106
+ πŸ“ˆ **Observability stack** – Prometheus & Grafana metrics for latency and errors
107
 
108
  ---
109
 
110
+ ## 🧩 Observability & GenAIOps
111
+
112
+ Monitor every stage of the pipeline in real-time:
113
 
114
+ * `/metrics` endpoint exposed via FastAPI
115
+ * Prometheus + Grafana stack with `make obs-up`
116
+ * Metrics tracked:
117
 
118
+ * `nl2sql_stage_latency_ms`
119
+ * `nl2sql_stage_error_total`
120
+ * `nl2sql_query_exec_count`
121
+ * `nl2sql_repair_success_rate`
 
 
 
122
 
123
+ ```bash
124
+ make obs-up # start Prometheus + Grafana
125
+ make obs-down # stop the stack
126
+ ```
127
 
128
  ---
129
 
130
+ ## πŸ§ͺ Quick Start
131
 
132
+ ### 1️⃣ Clone & Run
 
 
 
 
 
 
 
 
133
 
134
+ ```bash
135
+ git clone https://github.com/melika-kheirieh/nl2sql-copilot.git
136
+ cd nl2sql-copilot
137
+ make run
138
+ ```
139
 
140
+ Or build with Docker:
141
 
142
+ ```bash
143
+ docker build -t nl2sql-copilot .
144
+ docker run --rm -p 8000:8000 nl2sql-copilot
145
+ ```
146
+
147
+ API available at [http://localhost:8000/docs](http://localhost:8000/docs)
148
+ Streamlit demo at [http://localhost:7860](http://localhost:7860)
 
 
149
 
150
  ---
151
 
152
+ ## 🧭 For Developers & CI/CD
153
 
154
  ```bash
155
+ make lint # Ruff
156
+ make typecheck # Mypy
157
+ make test # Pytest
158
+ make bench # Run benchmark suite
159
  ```
160
 
161
+ ### CI/CD Highlights
162
+
163
+ * Runs on GitHub Actions (`make check`)
164
+ * Enforces formatting, typing, tests, and Docker build
165
+ * Publishes Docker image to GHCR on successful merge
166
+
167
+ ---
168
+
169
+ ## 🎯 Why it matters
170
+
171
+ * Bridges **natural language and databases** with measurable reliability
172
+ * Provides **reproducible evaluation** for continuous model tracking
173
+ * Delivers **production-level resilience** via self-repair and observability
174
+ * Demonstrates **AI software engineering** beyond prompt design
175
+
176
  ---
177
 
178
+ ## πŸ‘€ Author
179
+
180
+ **Melika Kheirieh**
181
+ AI Engineer & Researcher in Natural Language Interfaces for Databases
182
+ [GitHub](https://github.com/melika-kheirieh) Β· [LinkedIn](https://www.linkedin.com/in/melika-kheirieh-03a7b5176/)
183
 
184
+ > This project evolved from [NL2SQL Copilot Prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype), refactored into a production-grade, modular agent.
 
 
 
185
 
186
  ---
187
 
188
  ## πŸ“„ License
189
 
190
+ MIT Β© 2025 Melika Kheirieh
benchmarks/results_pro/20251109-110149/eval.jsonl DELETED
@@ -1,5 +0,0 @@
1
- {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer limit 1", "ok": true, "latency_ms": 12419, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7389, "summary": "ok", "notes": {"len_plan": 1297}, "token_in": 265, "token_out": 305, "cost_usd": 0.00022274999999999997, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3333, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 787, "token_out": 19, "cost_usd": 0.00012945, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 812, "summary": "ok", "notes": {"old_sql_len": 27, "new_sql_len": 35, "attempt": 1}, "token_in": 313, "token_out": 8, "cost_usd": 5.175e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 867, "summary": "ok", "notes": {"old_sql_len": 35, "new_sql_len": 35, "attempt": 2}, "token_in": 316, "token_out": 8, "cost_usd": 5.2199999999999995e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
2
- {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer limit 1", "ok": true, "latency_ms": 13653, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8492, "summary": "ok", "notes": {"len_plan": 1444}, "token_in": 266, "token_out": 343, "cost_usd": 0.0002457, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3127, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 826, "token_out": 19, "cost_usd": 0.00013529999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 914, "summary": "ok", "notes": {"old_sql_len": 27, "new_sql_len": 35, "attempt": 1}, "token_in": 313, "token_out": 8, "cost_usd": 5.175e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1108, "summary": "ok", "notes": {"old_sql_len": 35, "new_sql_len": 35, "attempt": 2}, "token_in": 316, "token_out": 8, "cost_usd": 5.2199999999999995e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
3
- {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
4
- {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc LIMIT 10", "ok": true, "latency_ms": 12306, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 6684, "summary": "ok", "notes": {"len_plan": 1253}, "token_in": 276, "token_out": 287, "cost_usd": 0.0002136, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3456, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 780, "token_out": 37, "cost_usd": 0.0001392, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 911, "summary": "ok", "notes": {"old_sql_len": 55, "new_sql_len": 64, "attempt": 1}, "token_in": 320, "token_out": 21, "cost_usd": 6.0599999999999996e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 64}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1239, "summary": "ok", "notes": {"old_sql_len": 64, "new_sql_len": 64, "attempt": 2}, "token_in": 323, "token_out": 21, "cost_usd": 6.104999999999999e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 64}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
5
- {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 14824, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9466, "summary": "ok", "notes": {"len_plan": 1418}, "token_in": 274, "token_out": 352, "cost_usd": 0.00025229999999999995, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 2949, "summary": "ok", "notes": {"rationale_len": 87}, "token_in": 843, "token_out": 46, "cost_usd": 0.00015404999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1139, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80, "attempt": 1}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1250, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72, "attempt": 2}, "token_in": 332, "token_out": 21, "cost_usd": 6.24e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
 
 
 
 
 
 
benchmarks/results_pro/20251109-110149/latency_histogram.png DELETED
Binary file (17.4 kB)
 
benchmarks/results_pro/20251109-110149/latency_per_stage.png DELETED
Binary file (30.1 kB)
 
benchmarks/results_pro/20251109-110149/metrics_overview.png DELETED
Binary file (20.2 kB)
 
benchmarks/results_pro/20251109-110149/results.csv DELETED
@@ -1,6 +0,0 @@
1
- db_id,query,ok,em,sm,exec_acc,latency_ms
2
- concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,12419
3
- concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,13653
4
- concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
- concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,12306
6
- concert_singer,"What is the average, minimum, and maximum age of all singers from France?",βœ…,0.0,1.0,1.0,14824
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-110149/summary.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "timestamp": "2025-11-09T11:02:43",
3
- "split": "dev",
4
- "config": "configs/sqlite_pipeline.yaml",
5
- "total": 5,
6
- "success": 5,
7
- "success_rate": 1.0,
8
- "avg_latency_ms": 10640.4,
9
- "p50_latency_ms": 12419.0,
10
- "p95_latency_ms": 14589.8,
11
- "EM": 0.4,
12
- "SM": 0.8,
13
- "ExecAcc": 0.8,
14
- "detector_avg_ms": 0.0,
15
- "planner_avg_ms": 8007.75,
16
- "generator_avg_ms": 3216.25,
17
- "safety_avg_ms": 2.0,
18
- "executor_avg_ms": 1.25,
19
- "verifier_avg_ms": 0.58,
20
- "repair_avg_ms": 1030.0
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-123149/eval.jsonl DELETED
@@ -1,5 +0,0 @@
1
- {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 11573, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7978, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 3588, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 796, "token_out": 19, "cost_usd": 0.0001308, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 3, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
2
- {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9087, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7653, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 1432, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 789, "token_out": 19, "cost_usd": 0.00012974999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
3
- {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
4
- {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 10200, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8373, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 1824, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 801, "token_out": 37, "cost_usd": 0.00014235, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
5
- {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": false, "latency_ms": 20765, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9562, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 4303, "summary": "ok", "notes": {"rationale_len": 67}, "token_in": 827, "token_out": 42, "cost_usd": 0.00014924999999999997, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 5379, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1516, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72}, "token_in": 332, "token_out": 25, "cost_usd": 6.48e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
 
 
 
 
 
 
benchmarks/results_pro/20251109-123149/latency_histogram.png DELETED
Binary file (15.6 kB)
 
benchmarks/results_pro/20251109-123149/latency_per_stage.png DELETED
Binary file (30.4 kB)
 
benchmarks/results_pro/20251109-123149/metrics_overview.png DELETED
Binary file (22 kB)
 
benchmarks/results_pro/20251109-123149/results.csv DELETED
@@ -1,6 +0,0 @@
1
- db_id,query,ok,em,sm,exec_acc,latency_ms
2
- concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,11573
3
- concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,9087
4
- concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
- concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,10200
6
- concert_singer,"What is the average, minimum, and maximum age of all singers from France?",❌,0.0,1.0,1.0,20765
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-123149/summary.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "timestamp": "2025-11-09T12:32:41",
3
- "split": "dev",
4
- "config": "configs/sqlite_pipeline.yaml",
5
- "total": 5,
6
- "success": 4,
7
- "success_rate": 0.8,
8
- "avg_latency_ms": 10325.0,
9
- "p50_latency_ms": 10200.0,
10
- "p95_latency_ms": 18926.6,
11
- "EM": 0.4,
12
- "SM": 0.8,
13
- "ExecAcc": 0.8,
14
- "detector_avg_ms": 0.0,
15
- "planner_avg_ms": 8391.5,
16
- "generator_avg_ms": 2786.75,
17
- "safety_avg_ms": 1.5,
18
- "executor_avg_ms": 1.33,
19
- "verifier_avg_ms": 0.0,
20
- "repair_avg_ms": 3447.5
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-123424/eval.jsonl DELETED
@@ -1,5 +0,0 @@
1
- {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 10712, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8455, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2253, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 792, "token_out": 19, "cost_usd": 0.0001302, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
2
- {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12981, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9963, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 3018, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 800, "token_out": 19, "cost_usd": 0.0001314, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 0, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 0, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
3
- {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
4
- {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9753, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7528, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2224, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 816, "token_out": 37, "cost_usd": 0.0001446, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
5
- {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": false, "latency_ms": 12406, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7105, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2892, "summary": "ok", "notes": {"rationale_len": 67}, "token_in": 854, "token_out": 42, "cost_usd": 0.00015329999999999999, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1162, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1241, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72}, "token_in": 332, "token_out": 21, "cost_usd": 6.24e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 0, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
 
 
 
 
 
 
benchmarks/results_pro/20251109-123424/results.csv DELETED
@@ -1,6 +0,0 @@
1
- db_id,query,ok,em,sm,exec_acc,latency_ms
2
- concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,10712
3
- concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,12981
4
- concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
- concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,9753
6
- concert_singer,"What is the average, minimum, and maximum age of all singers from France?",❌,0.0,1.0,1.0,12406
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-123424/summary.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "timestamp": "2025-11-09T12:35:10",
3
- "split": "dev",
4
- "config": "configs/sqlite_pipeline.yaml",
5
- "total": 5,
6
- "success": 4,
7
- "success_rate": 0.8,
8
- "avg_latency_ms": 9170.4,
9
- "p50_latency_ms": 10712.0,
10
- "p95_latency_ms": 12866.0,
11
- "EM": 0.4,
12
- "SM": 0.8,
13
- "ExecAcc": 0.8,
14
- "detector_avg_ms": 0.0,
15
- "planner_avg_ms": 8262.75,
16
- "generator_avg_ms": 2596.75,
17
- "safety_avg_ms": 1.25,
18
- "executor_avg_ms": 0.67,
19
- "verifier_avg_ms": 0.0,
20
- "repair_avg_ms": 1201.5
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-124602/eval.jsonl DELETED
@@ -1,5 +0,0 @@
1
- {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9852, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8318, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1528, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
2
- {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12321, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9326, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2994, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
3
- {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
4
- {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 8611, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6746, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1863, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
5
- {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 9742, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 5959, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1603, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1131, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1038, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
 
 
 
 
 
 
benchmarks/results_pro/20251109-124602/latency_histogram.png DELETED
Binary file (16.1 kB)
 
benchmarks/results_pro/20251109-124602/latency_per_stage.png DELETED
Binary file (28.9 kB)
 
benchmarks/results_pro/20251109-124602/metrics_overview.png DELETED
Binary file (19 kB)
 
benchmarks/results_pro/20251109-124602/results.csv DELETED
@@ -1,6 +0,0 @@
1
- db_id,query,ok,em,sm,exec_acc,latency_ms
2
- concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,9852
3
- concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,12321
4
- concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
- concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,8611
6
- concert_singer,"What is the average, minimum, and maximum age of all singers from France?",βœ…,0.0,1.0,1.0,9742
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-124602/summary.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "timestamp": "2025-11-09T12:46:43",
3
- "split": "dev",
4
- "config": "configs/sqlite_pipeline.yaml",
5
- "total": 5,
6
- "success": 5,
7
- "success_rate": 1.0,
8
- "avg_latency_ms": 8105.2,
9
- "p50_latency_ms": 9742.0,
10
- "p95_latency_ms": 11827.2,
11
- "EM": 0.4,
12
- "SM": 0.8,
13
- "ExecAcc": 0.8,
14
- "detector_avg_ms": 1.0,
15
- "planner_avg_ms": 7587.25,
16
- "generator_avg_ms": 1997.0,
17
- "safety_avg_ms": 1.33,
18
- "executor_avg_ms": 1.0,
19
- "verifier_avg_ms": 1.0,
20
- "repair_avg_ms": 1084.5
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-125209/eval.jsonl DELETED
@@ -1,5 +0,0 @@
1
- {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9129, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6689, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2430, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
2
- {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 10009, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9046, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 960, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
3
- {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
4
- {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9560, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8063, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1494, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
5
- {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 12661, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7869, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2410, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1179, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1191, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
 
 
 
 
 
 
benchmarks/results_pro/20251109-125209/latency_histogram.png DELETED
Binary file (16.1 kB)
 
benchmarks/results_pro/20251109-125209/latency_per_stage.png DELETED
Binary file (29.8 kB)
 
benchmarks/results_pro/20251109-125209/metrics_overview.png DELETED
Binary file (19.1 kB)
 
benchmarks/results_pro/20251109-125209/results.csv DELETED
@@ -1,6 +0,0 @@
1
- db_id,query,ok,em,sm,exec_acc,latency_ms
2
- concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,9129
3
- concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,10009
4
- concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
- concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,9560
6
- concert_singer,"What is the average, minimum, and maximum age of all singers from France?",βœ…,0.0,1.0,1.0,12661
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-125209/summary.json DELETED
@@ -1,21 +0,0 @@
1
- {
2
- "timestamp": "2025-11-09T12:52:50",
3
- "split": "dev",
4
- "config": "configs/sqlite_pipeline.yaml",
5
- "total": 5,
6
- "success": 5,
7
- "success_rate": 1.0,
8
- "avg_latency_ms": 8271.8,
9
- "p50_latency_ms": 9560.0,
10
- "p95_latency_ms": 12130.6,
11
- "EM": 0.4,
12
- "SM": 0.8,
13
- "ExecAcc": 0.8,
14
- "detector_avg_ms": 1.0,
15
- "planner_avg_ms": 7916.75,
16
- "generator_avg_ms": 1823.5,
17
- "safety_avg_ms": 1.5,
18
- "executor_avg_ms": 1.5,
19
- "verifier_avg_ms": 1.0,
20
- "repair_avg_ms": 1185.0
21
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
benchmarks/results_pro/20251109-171247/eval.jsonl ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9497, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7643, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1846, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
2
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12602, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 10933, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1667, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
3
+ {"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
4
+ {"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9097, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7143, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1951, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
5
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 11809, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7605, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1247, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1106, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1839, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
6
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age for all French singers?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 9349, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 5609, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1319, "summary": "failed", "notes": {"rationale_len": 84}}, {"stage": "safety", "duration_ms": 4, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1321, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1087, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
7
+ {"source": "spider", "db_id": "concert_singer", "query": "Show the name and the release year of the song by the youngest singer.", "gold_sql": "SELECT song_name , song_release_year FROM singer ORDER BY age LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
8
+ {"source": "spider", "db_id": "concert_singer", "query": "What are the names and release years for all the songs of the youngest singer?", "gold_sql": "SELECT song_name , song_release_year FROM singer ORDER BY age LIMIT 1", "pred_sql": "select Song_Name, Song_release_year from singer where Age = (select min(Age) from singer) LIMIT 1", "ok": true, "latency_ms": 13807, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9433, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1937, "summary": "failed", "notes": {"rationale_len": 105}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 89}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 89, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1154, "summary": "failed", "notes": {"old_sql_len": 89, "new_sql_len": 97}}, {"stage": "safety", "duration_ms": 4, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 97}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 97, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1271, "summary": "failed", "notes": {"old_sql_len": 97, "new_sql_len": 97}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 97}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 97, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
9
+ {"source": "spider", "db_id": "concert_singer", "query": "What are all distinct countries where singers above age 20 are from?", "gold_sql": "SELECT DISTINCT country FROM singer WHERE age > 20", "pred_sql": "select distinct Country from singer where Age > 20", "ok": true, "latency_ms": 8790, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7544, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1244, "summary": "failed", "notes": {"rationale_len": 92}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 50}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 50, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": true, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
10
+ {"source": "spider", "db_id": "concert_singer", "query": "What are the different countries with singers above age 20?", "gold_sql": "SELECT DISTINCT country FROM singer WHERE age > 20", "pred_sql": "select distinct Country from singer where Age > 20", "ok": true, "latency_ms": 7389, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6363, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1025, "summary": "failed", "notes": {"rationale_len": 81}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 50}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 50, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": true, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
11
+ {"source": "spider", "db_id": "concert_singer", "query": "Show all countries and the number of singers in each country.", "gold_sql": "SELECT country , count(*) FROM singer GROUP BY country", "pred_sql": "select Country, count(*) from singer group by Country", "ok": true, "latency_ms": 10349, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8715, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1630, "summary": "failed", "notes": {"rationale_len": 79}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 2, "sql_length": 53}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 53, "has_select": true, "has_from": true, "has_over": false, "has_group_by": true, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
12
+ {"source": "spider", "db_id": "concert_singer", "query": "How many singers are from each country?", "gold_sql": "SELECT country , count(*) FROM singer GROUP BY country", "pred_sql": "select country, count(*) from singer group by country", "ok": true, "latency_ms": 9137, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8036, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1099, "summary": "failed", "notes": {"rationale_len": 46}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 2, "sql_length": 53}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 53, "has_select": true, "has_from": true, "has_over": false, "has_group_by": true, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
13
+ {"source": "spider", "db_id": "concert_singer", "query": "List all song names by singers above the average age.", "gold_sql": "SELECT song_name FROM singer WHERE age > (SELECT avg(age) FROM singer)", "pred_sql": "select Song_Name from singer where Age > (select avg(Age) from singer)", "ok": true, "latency_ms": 10019, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8685, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1331, "summary": "failed", "notes": {"rationale_len": 87}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 70}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 70, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
14
+ {"source": "spider", "db_id": "concert_singer", "query": "What are all the song names by singers who are older than average?", "gold_sql": "SELECT song_name FROM singer WHERE age > (SELECT avg(age) FROM singer)", "pred_sql": "select Song_Name from singer where Age > (select avg(Age) from singer)", "ok": true, "latency_ms": 15309, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 13022, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2284, "summary": "failed", "notes": {"rationale_len": 81}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 70}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 70, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
15
+ {"source": "spider", "db_id": "concert_singer", "query": "Show location and name for all stadiums with a capacity between 5000 and 10000.", "gold_sql": "SELECT LOCATION , name FROM stadium WHERE capacity BETWEEN 5000 AND 10000", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
16
+ {"source": "spider", "db_id": "concert_singer", "query": "What are the locations and names of all stations with capacity between 5000 and 10000?", "gold_sql": "SELECT LOCATION , name FROM stadium WHERE capacity BETWEEN 5000 AND 10000", "pred_sql": "select Location, Name from stadium where Capacity >= 5000 and Capacity <= 10000", "ok": true, "latency_ms": 10709, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6901, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 3805, "summary": "failed", "notes": {"rationale_len": 82}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 0, "col_count": 2, "sql_length": 79}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 79, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
17
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the maximum capacity and the average of all stadiums ?", "gold_sql": "select max(capacity), average from stadium", "pred_sql": "SELECT max(Capacity), avg(Capacity) FROM stadium", "ok": true, "latency_ms": 11707, "em": 0.0, "sm": 0.0, "exec_acc": 0.5, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8334, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1027, "summary": "failed", "notes": {"rationale_len": 51}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1231, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 56}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 56}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 56, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1105, "summary": "failed", "notes": {"old_sql_len": 56, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
18
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the average and maximum capacities for all stadiums ?", "gold_sql": "select avg(capacity) , max(capacity) from stadium", "pred_sql": "select avg(Capacity), max(Capacity) from stadium", "ok": true, "latency_ms": 12690, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9463, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1258, "summary": "failed", "notes": {"rationale_len": 68}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 842, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1119, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
19
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the name and capacity for the stadium with highest average attendance?", "gold_sql": "SELECT name , capacity FROM stadium ORDER BY average DESC LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
20
+ {"source": "spider", "db_id": "concert_singer", "query": "What is the name and capacity for the stadium with the highest average attendance?", "gold_sql": "SELECT name , capacity FROM stadium ORDER BY average DESC LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
benchmarks/results_pro/20251109-171247/latency_histogram.png ADDED
benchmarks/results_pro/20251109-171247/latency_per_stage.png ADDED
benchmarks/results_pro/20251109-171247/metrics_overview.png ADDED
benchmarks/results_pro/20251109-171247/results.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ db_id,query,ok,em,sm,exec_acc,latency_ms
2
+ concert_singer,"How many singers do we have?",βœ…,1.0,1.0,1.0,9497
3
+ concert_singer,"What is the total number of singers?",βœ…,1.0,1.0,1.0,12602
4
+ concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",βœ…,0.0,0.0,0.0,0
5
+ concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",βœ…,0.0,1.0,1.0,9097
6
+ concert_singer,"What is the average, minimum, and maximum age of all singers from France?",βœ…,0.0,1.0,1.0,11809
7
+ concert_singer,"What is the average, minimum, and maximum age for all French singers?",βœ…,0.0,1.0,1.0,9349
8
+ concert_singer,"Show the name and the release year of the song by the youngest singer.",βœ…,0.0,0.0,0.0,0
9
+ concert_singer,"What are the names and release years for all the songs of the youngest singer?",βœ…,0.0,1.0,1.0,13807
10
+ concert_singer,"What are all distinct countries where singers above age 20 are from?",βœ…,0.0,1.0,1.0,8790
11
+ concert_singer,"What are the different countries with singers above age 20?",βœ…,0.0,1.0,1.0,7389
12
+ concert_singer,"Show all countries and the number of singers in each country.",βœ…,0.0,1.0,1.0,10349
13
+ concert_singer,"How many singers are from each country?",βœ…,1.0,1.0,1.0,9137
14
+ concert_singer,"List all song names by singers above the average age.",βœ…,0.0,1.0,1.0,10019
15
+ concert_singer,"What are all the song names by singers who are older than average?",βœ…,0.0,1.0,1.0,15309
16
+ concert_singer,"Show location and name for all stadiums with a capacity between 5000 and 10000.",βœ…,0.0,0.0,0.0,0
17
+ concert_singer,"What are the locations and names of all stations with capacity between 5000 and 10000?",βœ…,0.0,1.0,1.0,10709
18
+ concert_singer,"What is the maximum capacity and the average of all stadiums ?",βœ…,0.0,0.0,0.5,11707
19
+ concert_singer,"What is the average and maximum capacities for all stadiums ?",βœ…,0.0,1.0,1.0,12690
20
+ concert_singer,"What is the name and capacity for the stadium with highest average attendance?",βœ…,0.0,0.0,0.0,0
21
+ concert_singer,"What is the name and capacity for the stadium with the highest average attendance?",βœ…,0.0,0.0,0.0,0
benchmarks/results_pro/20251109-171247/summary.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "timestamp": "2025-11-09T17:15:29",
3
+ "split": "dev",
4
+ "config": "configs/sqlite_pipeline.yaml",
5
+ "total": 20,
6
+ "success": 20,
7
+ "success_rate": 1.0,
8
+ "avg_latency_ms": 8113.0,
9
+ "p50_latency_ms": 9423.0,
10
+ "p95_latency_ms": 13882.1,
11
+ "EM": 0.15,
12
+ "SM": 0.7,
13
+ "ExecAcc": 0.725,
14
+ "detector_avg_ms": 1.0,
15
+ "planner_avg_ms": 8361.93,
16
+ "generator_avg_ms": 1644.67,
17
+ "safety_avg_ms": 1.92,
18
+ "executor_avg_ms": 1.04,
19
+ "verifier_avg_ms": 1.0,
20
+ "repair_avg_ms": 1207.5
21
+ }