Spaces:
Sleeping
Sleeping
Melika Kheirieh
commited on
Commit
Β·
8f50117
1
Parent(s):
7b9903c
docs(readme): revamp and polish README for production showcase
Browse files- README.md +126 -134
- benchmarks/results_pro/20251109-110149/eval.jsonl +0 -5
- benchmarks/results_pro/20251109-110149/latency_histogram.png +0 -0
- benchmarks/results_pro/20251109-110149/latency_per_stage.png +0 -0
- benchmarks/results_pro/20251109-110149/metrics_overview.png +0 -0
- benchmarks/results_pro/20251109-110149/results.csv +0 -6
- benchmarks/results_pro/20251109-110149/summary.json +0 -21
- benchmarks/results_pro/20251109-123149/eval.jsonl +0 -5
- benchmarks/results_pro/20251109-123149/latency_histogram.png +0 -0
- benchmarks/results_pro/20251109-123149/latency_per_stage.png +0 -0
- benchmarks/results_pro/20251109-123149/metrics_overview.png +0 -0
- benchmarks/results_pro/20251109-123149/results.csv +0 -6
- benchmarks/results_pro/20251109-123149/summary.json +0 -21
- benchmarks/results_pro/20251109-123424/eval.jsonl +0 -5
- benchmarks/results_pro/20251109-123424/results.csv +0 -6
- benchmarks/results_pro/20251109-123424/summary.json +0 -21
- benchmarks/results_pro/20251109-124602/eval.jsonl +0 -5
- benchmarks/results_pro/20251109-124602/latency_histogram.png +0 -0
- benchmarks/results_pro/20251109-124602/latency_per_stage.png +0 -0
- benchmarks/results_pro/20251109-124602/metrics_overview.png +0 -0
- benchmarks/results_pro/20251109-124602/results.csv +0 -6
- benchmarks/results_pro/20251109-124602/summary.json +0 -21
- benchmarks/results_pro/20251109-125209/eval.jsonl +0 -5
- benchmarks/results_pro/20251109-125209/latency_histogram.png +0 -0
- benchmarks/results_pro/20251109-125209/latency_per_stage.png +0 -0
- benchmarks/results_pro/20251109-125209/metrics_overview.png +0 -0
- benchmarks/results_pro/20251109-125209/results.csv +0 -6
- benchmarks/results_pro/20251109-125209/summary.json +0 -21
- benchmarks/results_pro/20251109-171247/eval.jsonl +20 -0
- benchmarks/results_pro/20251109-171247/latency_histogram.png +0 -0
- benchmarks/results_pro/20251109-171247/latency_per_stage.png +0 -0
- benchmarks/results_pro/20251109-171247/metrics_overview.png +0 -0
- benchmarks/results_pro/20251109-171247/results.csv +21 -0
- benchmarks/results_pro/20251109-171247/summary.json +21 -0
README.md
CHANGED
|
@@ -1,198 +1,190 @@
|
|
| 1 |
# π§© NL2SQL Copilot
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
|
|
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
> π‘ **Why it matters**
|
| 9 |
-
> In real analytics teams, analysts often need quick insights without writing SQL.
|
| 10 |
-
> **NL2SQL Copilot** bridges that gap by translating plain-English questions into validated, read-only SQL β reducing query errors and saving hours of analyst time.
|
| 11 |
-
>
|
| 12 |
-
> 𧬠**Evolution Note**
|
| 13 |
-
> This repository is the next-generation version of [NL2SQL Copilot Prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype).
|
| 14 |
-
> It refactors the original prototype into a **production-grade, modular architecture** β
|
| 15 |
-
> adding configuration-driven pipelines, safety layers, benchmarks, and a Streamlit UI for evaluation.
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
## π
|
| 20 |
-
|
| 21 |
-
### 1οΈβ£ Clone the repo
|
| 22 |
-
```bash
|
| 23 |
-
git clone https://github.com/melika-kheirieh/nl2sql-copilot.git
|
| 24 |
-
cd nl2sql-copilot
|
| 25 |
-
````
|
| 26 |
-
|
| 27 |
-
### 2οΈβ£ Build and run with Docker
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
docker run --rm -p 8000:8000 nl2sql-copilot
|
| 32 |
-
```
|
| 33 |
|
| 34 |
-
|
| 35 |
-
Or launch the [Streamlit Demo](http://localhost:7860) to test it interactively.
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
-
## π§
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
-
SELECT albums.Title, SUM(invoice_items.UnitPrice * invoice_items.Quantity) AS total_sales
|
| 53 |
-
FROM albums
|
| 54 |
-
JOIN tracks ON albums.AlbumId = tracks.AlbumId
|
| 55 |
-
JOIN invoice_items ON invoice_items.TrackId = tracks.TrackId
|
| 56 |
-
GROUP BY albums.Title
|
| 57 |
-
ORDER BY total_sales DESC
|
| 58 |
-
LIMIT 5;
|
| 59 |
-
```
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
| Greatest Hits | 155.34 |
|
| 66 |
-
| Let There Be Rock | 133.09 |
|
| 67 |
-
| Big Ones | 128.44 |
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
{"stage": "planner", "duration_ms": 38, "summary": "Identified SQL intent"},
|
| 74 |
-
{"stage": "generator", "duration_ms": 201, "summary": "LLM generated SQL"},
|
| 75 |
-
{"stage": "safety", "duration_ms": 6, "summary": "Validated SELECT-only"},
|
| 76 |
-
{"stage": "executor", "duration_ms": 92, "summary": "Executed successfully"}
|
| 77 |
-
]
|
| 78 |
-
```
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
| 87 |
-
nl2sql-copilot/
|
| 88 |
-
β
|
| 89 |
-
βββ app/ # FastAPI app, routers, schemas
|
| 90 |
-
βββ nl2sql/ # Core pipeline (Planner β Generator β Safety β Executor β Verifier)
|
| 91 |
-
βββ adapters/ # Database and LLM adapters
|
| 92 |
-
βββ benchmarks/ # Evaluation scripts and results
|
| 93 |
-
βββ ui/ # Streamlit dashboard (demo + benchmark)
|
| 94 |
-
βββ configs/ # Pipeline configs (YAML-based)
|
| 95 |
-
β
|
| 96 |
-
βββ Dockerfile
|
| 97 |
-
βββ requirements.in / .txt
|
| 98 |
-
βββ README.md
|
| 99 |
-
```
|
| 100 |
|
| 101 |
---
|
| 102 |
|
| 103 |
-
## βοΈ
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
| 108 |
-
Natural Language
|
| 109 |
-
β
|
| 110 |
-
[ Planner ] β [ Generator (LLM) ] β [ Safety ] β [ Executor ] β [ Verifier ] β [ Repair ]
|
| 111 |
-
```
|
| 112 |
|
| 113 |
-
|
| 114 |
-
All queries execute inside a **read-only sandbox**.
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
|
| 123 |
-
| ------------------ | ----------------------------- |
|
| 124 |
-
| DML not allowed | `DELETE FROM users` |
|
| 125 |
-
| Multi-statement | `SELECT *; DROP TABLE users` |
|
| 126 |
-
| Forbidden keywords | `ALTER`, `TRUNCATE`, `UPDATE` |
|
| 127 |
|
| 128 |
-
|
|
|
|
|
|
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
-
##
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
| top 3 customers by spending | aggregation | β
| 141 | gpt-4o-mini |
|
| 141 |
-
| albums released before 2000 | filter | β
| 122 | gpt-4o-mini |
|
| 142 |
-
| top 5 sales by genre | join | β
| 149 | gpt-4o-mini |
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
---
|
| 147 |
|
| 148 |
-
##
|
| 149 |
|
| 150 |
-
|
| 151 |
-
* π‘οΈ **SQL safety filters** (SELECT-only, blacklist, AST validation)
|
| 152 |
-
* π **Self-repair loop** for failed executions
|
| 153 |
-
* π§ **LLM-driven generator** (OpenAI / Ollama / Anthropic)
|
| 154 |
-
* π **Evaluation toolkit** for latency / accuracy / cost
|
| 155 |
-
* βοΈ **Config-driven architecture** (`Pipeline.from_config("configs/pipeline.yaml")`)
|
| 156 |
-
* π§° **PostgreSQL + SQLite adapters**
|
| 157 |
-
* ποΈ **Streamlit UI** for interactive demo & benchmark
|
| 158 |
-
* π§© Built with **FastAPI**, **LangGraph**, **Pydantic-AI**, **SQLAlchemy**
|
| 159 |
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
-
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
| UI | Streamlit + Plotly |
|
| 172 |
-
| CI/CD | GitHub Actions, Makefile, Docker |
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
-
##
|
| 177 |
|
| 178 |
```bash
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
```
|
| 184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
---
|
| 186 |
|
| 187 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
-
|
| 190 |
-
* [ ] Improve self-repair accuracy
|
| 191 |
-
* [ ] Add cost tracking per query
|
| 192 |
-
* [ ] Integrate Prometheus metrics
|
| 193 |
|
| 194 |
---
|
| 195 |
|
| 196 |
## π License
|
| 197 |
|
| 198 |
-
MIT Β© 2025
|
|
|
|
| 1 |
# π§© NL2SQL Copilot
|
| 2 |
|
| 3 |
+
[](https://github.com/melika-kheirieh/nl2sql-copilot/actions/workflows/ci.yml)
|
| 4 |
+
[](#)
|
| 5 |
+
[](LICENSE)
|
| 6 |
|
| 7 |
+
A production-grade **Text-to-SQL Copilot** that converts natural-language questions into **safe, verified SQL**.
|
| 8 |
+
Built for analytics engineers who need accuracy, transparency, and control β powered by **FastAPI**, **LangGraph**, and **Pydantic-AI**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
---
|
| 11 |
|
| 12 |
+
## π Overview
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
`NL2SQL Copilot` is an **agentic, modular pipeline** that plans, generates, verifies, and repairs SQL queries.
|
| 15 |
+
It ensures correctness and safety through structured stages, evaluation on the **Spider** dataset, and full observability support.
|
|
|
|
|
|
|
| 16 |
|
| 17 |
+
> π‘ Designed for **read-only production databases** with **self-repair**, **metrics**, and **CI/CD** baked in.
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
+
## π§ Agentic Architecture
|
| 22 |
|
| 23 |
+
```
|
| 24 |
|
| 25 |
+
Natural Language
|
| 26 |
+
β
|
| 27 |
+
[ Detector ]
|
| 28 |
+
β
|
| 29 |
+
[ Planner ]
|
| 30 |
+
β
|
| 31 |
+
[ Generator (LLM) ]
|
| 32 |
+
β
|
| 33 |
+
[ Safety ]
|
| 34 |
+
β
|
| 35 |
+
[ Executor ]
|
| 36 |
+
β
|
| 37 |
+
[ Verifier ]
|
| 38 |
+
β
|
| 39 |
+
[ Repair ]
|
| 40 |
|
| 41 |
+
````
|
| 42 |
|
| 43 |
+
Each stage is isolated, configurable via YAML, and observable through structured traces and Prometheus metrics.
|
| 44 |
|
| 45 |
+
| Stage | Responsibility |
|
| 46 |
+
|--------|----------------|
|
| 47 |
+
| **Detector** | Identify whether a query is Text-to-SQL |
|
| 48 |
+
| **Planner** | Extract user intent and SQL plan |
|
| 49 |
+
| **Generator** | Call LLM to synthesize SQL |
|
| 50 |
+
| **Safety** | Block unsafe or non-SELECT queries |
|
| 51 |
+
| **Executor** | Execute query in read-only sandbox |
|
| 52 |
+
| **Verifier** | Compare results, detect mismatch |
|
| 53 |
+
| **Repair** | Self-healing loop triggered on failure |
|
| 54 |
|
| 55 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
## π Benchmark (Spider dataset)
|
| 58 |
|
| 59 |
+
Dataset: [Spider](https://yale-lily.github.io/spider) by Yale LILY Lab.
|
| 60 |
+
Evaluated on the **Spider dev subset (20 samples)** using the reproducible evaluation toolkit.
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
| Metric | Value |
|
| 63 |
+
|--------|--------|
|
| 64 |
+
| EM (Exact Match) | 0.15 |
|
| 65 |
+
| SM (Structural Match) | 0.70 |
|
| 66 |
+
| ExecAcc (Execution Accuracy) | 0.73 |
|
| 67 |
+
| Avg Latency | 8.11 s |
|
| 68 |
+
| p50 Latency | 9.42 s |
|
| 69 |
+
| p95 Latency | 13.88 s |
|
| 70 |
|
| 71 |
+
> High **Structural Match** and **Execution Accuracy** indicate strong semantic correctness;
|
| 72 |
+
> lower EM reflects harmless formatting differences.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
Run reproducible benchmarks:
|
| 75 |
|
| 76 |
+
```bash
|
| 77 |
+
export SPIDER_ROOT="$PWD/data/spider"
|
| 78 |
+
PYTHONPATH=$PWD python benchmarks/evaluate_spider_pro.py --spider --split dev --limit 20
|
| 79 |
+
PYTHONPATH=$PWD python benchmarks/plot_results.py
|
| 80 |
+
````
|
| 81 |
|
| 82 |
+
Results & plots β `benchmarks/results_pro/20251109-171247/`
|
| 83 |
|
| 84 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
+
## βοΈ Key Features
|
| 89 |
|
| 90 |
+
β
**Agentic architecture** β multi-stage pipeline with feedback loop
|
| 91 |
|
| 92 |
+
π‘οΈ **Safety layer** β SELECT-only guardrails and AST validation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
π **Self-repair** β automatic retry when verification fails
|
|
|
|
| 95 |
|
| 96 |
+
π **Reproducible evaluation** β integrated Spider / Dr.Spider benchmarking
|
| 97 |
|
| 98 |
+
π¦ **Config-driven design** β YAML pipeline factory
|
| 99 |
|
| 100 |
+
π§© **Plug-and-play adapters** β SQLite / PostgreSQL / OpenAI / Anthropic / Ollama
|
| 101 |
|
| 102 |
+
π§ **FastAPI service + Streamlit UI** β demo or API mode
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
π§° **CI/CD ready** β Makefile, Ruff, Mypy, Pytest, Docker, GitHub Actions
|
| 105 |
+
|
| 106 |
+
π **Observability stack** β Prometheus & Grafana metrics for latency and errors
|
| 107 |
|
| 108 |
---
|
| 109 |
|
| 110 |
+
## π§© Observability & GenAIOps
|
| 111 |
+
|
| 112 |
+
Monitor every stage of the pipeline in real-time:
|
| 113 |
|
| 114 |
+
* `/metrics` endpoint exposed via FastAPI
|
| 115 |
+
* Prometheus + Grafana stack with `make obs-up`
|
| 116 |
+
* Metrics tracked:
|
| 117 |
|
| 118 |
+
* `nl2sql_stage_latency_ms`
|
| 119 |
+
* `nl2sql_stage_error_total`
|
| 120 |
+
* `nl2sql_query_exec_count`
|
| 121 |
+
* `nl2sql_repair_success_rate`
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
```bash
|
| 124 |
+
make obs-up # start Prometheus + Grafana
|
| 125 |
+
make obs-down # stop the stack
|
| 126 |
+
```
|
| 127 |
|
| 128 |
---
|
| 129 |
|
| 130 |
+
## π§ͺ Quick Start
|
| 131 |
|
| 132 |
+
### 1οΈβ£ Clone & Run
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
+
```bash
|
| 135 |
+
git clone https://github.com/melika-kheirieh/nl2sql-copilot.git
|
| 136 |
+
cd nl2sql-copilot
|
| 137 |
+
make run
|
| 138 |
+
```
|
| 139 |
|
| 140 |
+
Or build with Docker:
|
| 141 |
|
| 142 |
+
```bash
|
| 143 |
+
docker build -t nl2sql-copilot .
|
| 144 |
+
docker run --rm -p 8000:8000 nl2sql-copilot
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
API available at [http://localhost:8000/docs](http://localhost:8000/docs)
|
| 148 |
+
Streamlit demo at [http://localhost:7860](http://localhost:7860)
|
|
|
|
|
|
|
| 149 |
|
| 150 |
---
|
| 151 |
|
| 152 |
+
## π§ For Developers & CI/CD
|
| 153 |
|
| 154 |
```bash
|
| 155 |
+
make lint # Ruff
|
| 156 |
+
make typecheck # Mypy
|
| 157 |
+
make test # Pytest
|
| 158 |
+
make bench # Run benchmark suite
|
| 159 |
```
|
| 160 |
|
| 161 |
+
### CI/CD Highlights
|
| 162 |
+
|
| 163 |
+
* Runs on GitHub Actions (`make check`)
|
| 164 |
+
* Enforces formatting, typing, tests, and Docker build
|
| 165 |
+
* Publishes Docker image to GHCR on successful merge
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## π― Why it matters
|
| 170 |
+
|
| 171 |
+
* Bridges **natural language and databases** with measurable reliability
|
| 172 |
+
* Provides **reproducible evaluation** for continuous model tracking
|
| 173 |
+
* Delivers **production-level resilience** via self-repair and observability
|
| 174 |
+
* Demonstrates **AI software engineering** beyond prompt design
|
| 175 |
+
|
| 176 |
---
|
| 177 |
|
| 178 |
+
## π€ Author
|
| 179 |
+
|
| 180 |
+
**Melika Kheirieh**
|
| 181 |
+
AI Engineer & Researcher in Natural Language Interfaces for Databases
|
| 182 |
+
[GitHub](https://github.com/melika-kheirieh) Β· [LinkedIn](https://www.linkedin.com/in/melika-kheirieh-03a7b5176/)
|
| 183 |
|
| 184 |
+
> This project evolved from [NL2SQL Copilot Prototype](https://github.com/melika-kheirieh/nl2sql-copilot-prototype), refactored into a production-grade, modular agent.
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
---
|
| 187 |
|
| 188 |
## π License
|
| 189 |
|
| 190 |
+
MIT Β© 2025 Melika Kheirieh
|
benchmarks/results_pro/20251109-110149/eval.jsonl
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer limit 1", "ok": true, "latency_ms": 12419, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7389, "summary": "ok", "notes": {"len_plan": 1297}, "token_in": 265, "token_out": 305, "cost_usd": 0.00022274999999999997, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3333, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 787, "token_out": 19, "cost_usd": 0.00012945, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 812, "summary": "ok", "notes": {"old_sql_len": 27, "new_sql_len": 35, "attempt": 1}, "token_in": 313, "token_out": 8, "cost_usd": 5.175e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 867, "summary": "ok", "notes": {"old_sql_len": 35, "new_sql_len": 35, "attempt": 2}, "token_in": 316, "token_out": 8, "cost_usd": 5.2199999999999995e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
|
| 2 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer limit 1", "ok": true, "latency_ms": 13653, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8492, "summary": "ok", "notes": {"len_plan": 1444}, "token_in": 266, "token_out": 343, "cost_usd": 0.0002457, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3127, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 826, "token_out": 19, "cost_usd": 0.00013529999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 914, "summary": "ok", "notes": {"old_sql_len": 27, "new_sql_len": 35, "attempt": 1}, "token_in": 313, "token_out": 8, "cost_usd": 5.175e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1108, "summary": "ok", "notes": {"old_sql_len": 35, "new_sql_len": 35, "attempt": 2}, "token_in": 316, "token_out": 8, "cost_usd": 5.2199999999999995e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 35}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
|
| 3 |
-
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
|
| 4 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc LIMIT 10", "ok": true, "latency_ms": 12306, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 6684, "summary": "ok", "notes": {"len_plan": 1253}, "token_in": 276, "token_out": 287, "cost_usd": 0.0002136, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 3456, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 780, "token_out": 37, "cost_usd": 0.0001392, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 911, "summary": "ok", "notes": {"old_sql_len": 55, "new_sql_len": 64, "attempt": 1}, "token_in": 320, "token_out": 21, "cost_usd": 6.0599999999999996e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 64}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1239, "summary": "ok", "notes": {"old_sql_len": 64, "new_sql_len": 64, "attempt": 2}, "token_in": 323, "token_out": 21, "cost_usd": 6.104999999999999e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 64}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
|
| 5 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 14824, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9466, "summary": "ok", "notes": {"len_plan": 1418}, "token_in": 274, "token_out": 352, "cost_usd": 0.00025229999999999995, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "generator", "duration_ms": 2949, "summary": "ok", "notes": {"rationale_len": 87}, "token_in": 843, "token_out": 46, "cost_usd": 0.00015404999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1139, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80, "attempt": 1}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1250, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72, "attempt": 2}, "token_in": 332, "token_out": 21, "cost_usd": 6.24e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 2, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"issues": ["aggregation_without_group_by", "exec_error:preview_failed"]}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}, "skipped": false}, {"stage": "pipeline", "duration_ms": 0, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}, "skipped": false}]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-110149/latency_histogram.png
DELETED
|
Binary file (17.4 kB)
|
|
|
benchmarks/results_pro/20251109-110149/latency_per_stage.png
DELETED
|
Binary file (30.1 kB)
|
|
|
benchmarks/results_pro/20251109-110149/metrics_overview.png
DELETED
|
Binary file (20.2 kB)
|
|
|
benchmarks/results_pro/20251109-110149/results.csv
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
-
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,12419
|
| 3 |
-
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,13653
|
| 4 |
-
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
-
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,12306
|
| 6 |
-
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β
,0.0,1.0,1.0,14824
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-110149/summary.json
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"timestamp": "2025-11-09T11:02:43",
|
| 3 |
-
"split": "dev",
|
| 4 |
-
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
-
"total": 5,
|
| 6 |
-
"success": 5,
|
| 7 |
-
"success_rate": 1.0,
|
| 8 |
-
"avg_latency_ms": 10640.4,
|
| 9 |
-
"p50_latency_ms": 12419.0,
|
| 10 |
-
"p95_latency_ms": 14589.8,
|
| 11 |
-
"EM": 0.4,
|
| 12 |
-
"SM": 0.8,
|
| 13 |
-
"ExecAcc": 0.8,
|
| 14 |
-
"detector_avg_ms": 0.0,
|
| 15 |
-
"planner_avg_ms": 8007.75,
|
| 16 |
-
"generator_avg_ms": 3216.25,
|
| 17 |
-
"safety_avg_ms": 2.0,
|
| 18 |
-
"executor_avg_ms": 1.25,
|
| 19 |
-
"verifier_avg_ms": 0.58,
|
| 20 |
-
"repair_avg_ms": 1030.0
|
| 21 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123149/eval.jsonl
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 11573, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7978, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 3588, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 796, "token_out": 19, "cost_usd": 0.0001308, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 3, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 2 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9087, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7653, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 1432, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 789, "token_out": 19, "cost_usd": 0.00012974999999999998, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 3 |
-
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
|
| 4 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 10200, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8373, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 1824, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 801, "token_out": 37, "cost_usd": 0.00014235, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 2, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 5 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": false, "latency_ms": 20765, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9562, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 4303, "summary": "ok", "notes": {"rationale_len": 67}, "token_in": 827, "token_out": 42, "cost_usd": 0.00014924999999999997, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 5379, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1516, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72}, "token_in": 332, "token_out": 25, "cost_usd": 6.48e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123149/latency_histogram.png
DELETED
|
Binary file (15.6 kB)
|
|
|
benchmarks/results_pro/20251109-123149/latency_per_stage.png
DELETED
|
Binary file (30.4 kB)
|
|
|
benchmarks/results_pro/20251109-123149/metrics_overview.png
DELETED
|
Binary file (22 kB)
|
|
|
benchmarks/results_pro/20251109-123149/results.csv
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
-
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,11573
|
| 3 |
-
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,9087
|
| 4 |
-
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
-
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,10200
|
| 6 |
-
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β,0.0,1.0,1.0,20765
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123149/summary.json
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"timestamp": "2025-11-09T12:32:41",
|
| 3 |
-
"split": "dev",
|
| 4 |
-
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
-
"total": 5,
|
| 6 |
-
"success": 4,
|
| 7 |
-
"success_rate": 0.8,
|
| 8 |
-
"avg_latency_ms": 10325.0,
|
| 9 |
-
"p50_latency_ms": 10200.0,
|
| 10 |
-
"p95_latency_ms": 18926.6,
|
| 11 |
-
"EM": 0.4,
|
| 12 |
-
"SM": 0.8,
|
| 13 |
-
"ExecAcc": 0.8,
|
| 14 |
-
"detector_avg_ms": 0.0,
|
| 15 |
-
"planner_avg_ms": 8391.5,
|
| 16 |
-
"generator_avg_ms": 2786.75,
|
| 17 |
-
"safety_avg_ms": 1.5,
|
| 18 |
-
"executor_avg_ms": 1.33,
|
| 19 |
-
"verifier_avg_ms": 0.0,
|
| 20 |
-
"repair_avg_ms": 3447.5
|
| 21 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123424/eval.jsonl
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 10712, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 8455, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2253, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 792, "token_out": 19, "cost_usd": 0.0001302, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 2 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12981, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 9963, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 3018, "summary": "ok", "notes": {"rationale_len": 30}, "token_in": 800, "token_out": 19, "cost_usd": 0.0001314, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 0, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 0, "summary": "ok", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 3 |
-
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}, "skipped": false}]}
|
| 4 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9753, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7528, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2224, "summary": "ok", "notes": {"rationale_len": 85}, "token_in": 816, "token_out": 37, "cost_usd": 0.0001446, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 1, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
| 5 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": false, "latency_ms": 12406, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 0, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}, "skipped": false}, {"stage": "planner", "duration_ms": 7105, "summary": "ok", "notes": {}, "skipped": false}, {"stage": "generator", "duration_ms": 2892, "summary": "ok", "notes": {"rationale_len": 67}, "token_in": 854, "token_out": 42, "cost_usd": 0.00015329999999999999, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "safety", "duration_ms": 3, "summary": "ok", "notes": {}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1162, "summary": "ok", "notes": {"old_sql_len": 72, "new_sql_len": 80}, "token_in": 328, "token_out": 24, "cost_usd": 6.36e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 1, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "repair", "duration_ms": 1241, "summary": "ok", "notes": {"old_sql_len": 80, "new_sql_len": 72}, "token_in": 332, "token_out": 21, "cost_usd": 6.24e-05, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "executor", "duration_ms": 0, "summary": "ok", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}, {"stage": "verifier", "duration_ms": 0, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}, "token_in": null, "token_out": null, "cost_usd": null, "sql_length": null, "row_count": null, "verified": null, "error_type": null, "repair_attempts": null, "skipped": false}]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123424/results.csv
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
-
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,10712
|
| 3 |
-
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,12981
|
| 4 |
-
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
-
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,9753
|
| 6 |
-
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β,0.0,1.0,1.0,12406
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-123424/summary.json
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"timestamp": "2025-11-09T12:35:10",
|
| 3 |
-
"split": "dev",
|
| 4 |
-
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
-
"total": 5,
|
| 6 |
-
"success": 4,
|
| 7 |
-
"success_rate": 0.8,
|
| 8 |
-
"avg_latency_ms": 9170.4,
|
| 9 |
-
"p50_latency_ms": 10712.0,
|
| 10 |
-
"p95_latency_ms": 12866.0,
|
| 11 |
-
"EM": 0.4,
|
| 12 |
-
"SM": 0.8,
|
| 13 |
-
"ExecAcc": 0.8,
|
| 14 |
-
"detector_avg_ms": 0.0,
|
| 15 |
-
"planner_avg_ms": 8262.75,
|
| 16 |
-
"generator_avg_ms": 2596.75,
|
| 17 |
-
"safety_avg_ms": 1.25,
|
| 18 |
-
"executor_avg_ms": 0.67,
|
| 19 |
-
"verifier_avg_ms": 0.0,
|
| 20 |
-
"repair_avg_ms": 1201.5
|
| 21 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-124602/eval.jsonl
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9852, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8318, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1528, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 2 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12321, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9326, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2994, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 3 |
-
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 4 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 8611, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6746, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1863, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 5 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 9742, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 5959, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1603, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1131, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1038, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-124602/latency_histogram.png
DELETED
|
Binary file (16.1 kB)
|
|
|
benchmarks/results_pro/20251109-124602/latency_per_stage.png
DELETED
|
Binary file (28.9 kB)
|
|
|
benchmarks/results_pro/20251109-124602/metrics_overview.png
DELETED
|
Binary file (19 kB)
|
|
|
benchmarks/results_pro/20251109-124602/results.csv
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
-
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,9852
|
| 3 |
-
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,12321
|
| 4 |
-
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
-
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,8611
|
| 6 |
-
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β
,0.0,1.0,1.0,9742
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-124602/summary.json
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"timestamp": "2025-11-09T12:46:43",
|
| 3 |
-
"split": "dev",
|
| 4 |
-
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
-
"total": 5,
|
| 6 |
-
"success": 5,
|
| 7 |
-
"success_rate": 1.0,
|
| 8 |
-
"avg_latency_ms": 8105.2,
|
| 9 |
-
"p50_latency_ms": 9742.0,
|
| 10 |
-
"p95_latency_ms": 11827.2,
|
| 11 |
-
"EM": 0.4,
|
| 12 |
-
"SM": 0.8,
|
| 13 |
-
"ExecAcc": 0.8,
|
| 14 |
-
"detector_avg_ms": 1.0,
|
| 15 |
-
"planner_avg_ms": 7587.25,
|
| 16 |
-
"generator_avg_ms": 1997.0,
|
| 17 |
-
"safety_avg_ms": 1.33,
|
| 18 |
-
"executor_avg_ms": 1.0,
|
| 19 |
-
"verifier_avg_ms": 1.0,
|
| 20 |
-
"repair_avg_ms": 1084.5
|
| 21 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-125209/eval.jsonl
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9129, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6689, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2430, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 2 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 10009, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9046, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 960, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 3 |
-
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 4 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9560, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8063, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1494, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 5 |
-
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 12661, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7869, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2410, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1179, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "repair", "duration_ms": 1191, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-125209/latency_histogram.png
DELETED
|
Binary file (16.1 kB)
|
|
|
benchmarks/results_pro/20251109-125209/latency_per_stage.png
DELETED
|
Binary file (29.8 kB)
|
|
|
benchmarks/results_pro/20251109-125209/metrics_overview.png
DELETED
|
Binary file (19.1 kB)
|
|
|
benchmarks/results_pro/20251109-125209/results.csv
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
-
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,9129
|
| 3 |
-
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,10009
|
| 4 |
-
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
-
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,9560
|
| 6 |
-
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β
,0.0,1.0,1.0,12661
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-125209/summary.json
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"timestamp": "2025-11-09T12:52:50",
|
| 3 |
-
"split": "dev",
|
| 4 |
-
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
-
"total": 5,
|
| 6 |
-
"success": 5,
|
| 7 |
-
"success_rate": 1.0,
|
| 8 |
-
"avg_latency_ms": 8271.8,
|
| 9 |
-
"p50_latency_ms": 9560.0,
|
| 10 |
-
"p95_latency_ms": 12130.6,
|
| 11 |
-
"EM": 0.4,
|
| 12 |
-
"SM": 0.8,
|
| 13 |
-
"ExecAcc": 0.8,
|
| 14 |
-
"detector_avg_ms": 1.0,
|
| 15 |
-
"planner_avg_ms": 7916.75,
|
| 16 |
-
"generator_avg_ms": 1823.5,
|
| 17 |
-
"safety_avg_ms": 1.5,
|
| 18 |
-
"executor_avg_ms": 1.5,
|
| 19 |
-
"verifier_avg_ms": 1.0,
|
| 20 |
-
"repair_avg_ms": 1185.0
|
| 21 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
benchmarks/results_pro/20251109-171247/eval.jsonl
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"source": "spider", "db_id": "concert_singer", "query": "How many singers do we have?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 9497, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7643, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1846, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 2, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 2 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the total number of singers?", "gold_sql": "SELECT count(*) FROM singer", "pred_sql": "select count(*) from singer", "ok": true, "latency_ms": 12602, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 10933, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1667, "summary": "failed", "notes": {"rationale_len": 30}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 1, "sql_length": 27}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 27, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 3 |
+
{"source": "spider", "db_id": "concert_singer", "query": "Show name, country, age for all singers ordered by age from the oldest to the youngest.", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 4 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are the names, countries, and ages for every singer in descending order of age?", "gold_sql": "SELECT name , country , age FROM singer ORDER BY age DESC", "pred_sql": "select Name, Country, Age from singer order by Age desc", "ok": true, "latency_ms": 9097, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7143, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1951, "summary": "failed", "notes": {"rationale_len": 85}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 6, "col_count": 3, "sql_length": 55}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 55, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 5 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age of all singers from France?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 11809, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7605, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1247, "summary": "failed", "notes": {"rationale_len": 67}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1106, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 3, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1839, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 6 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the average, minimum, and maximum age for all French singers?", "gold_sql": "SELECT avg(age) , min(age) , max(age) FROM singer WHERE country = 'France'", "pred_sql": "select avg(Age), min(Age), max(Age) from singer where Country = 'France'", "ok": true, "latency_ms": 9349, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 5609, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1319, "summary": "failed", "notes": {"rationale_len": 84}}, {"stage": "safety", "duration_ms": 4, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1321, "summary": "failed", "notes": {"old_sql_len": 72, "new_sql_len": 80}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 80}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 80, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1087, "summary": "failed", "notes": {"old_sql_len": 80, "new_sql_len": 72}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 3, "sql_length": 72}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 72, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 7 |
+
{"source": "spider", "db_id": "concert_singer", "query": "Show the name and the release year of the song by the youngest singer.", "gold_sql": "SELECT song_name , song_release_year FROM singer ORDER BY age LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 8 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are the names and release years for all the songs of the youngest singer?", "gold_sql": "SELECT song_name , song_release_year FROM singer ORDER BY age LIMIT 1", "pred_sql": "select Song_Name, Song_release_year from singer where Age = (select min(Age) from singer) LIMIT 1", "ok": true, "latency_ms": 13807, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9433, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1937, "summary": "failed", "notes": {"rationale_len": 105}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 89}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 89, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1154, "summary": "failed", "notes": {"old_sql_len": 89, "new_sql_len": 97}}, {"stage": "safety", "duration_ms": 4, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 97}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 97, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1271, "summary": "failed", "notes": {"old_sql_len": 97, "new_sql_len": 97}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 97}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 97, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 9 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are all distinct countries where singers above age 20 are from?", "gold_sql": "SELECT DISTINCT country FROM singer WHERE age > 20", "pred_sql": "select distinct Country from singer where Age > 20", "ok": true, "latency_ms": 8790, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 7544, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1244, "summary": "failed", "notes": {"rationale_len": 92}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 50}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 50, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": true, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 10 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are the different countries with singers above age 20?", "gold_sql": "SELECT DISTINCT country FROM singer WHERE age > 20", "pred_sql": "select distinct Country from singer where Age > 20", "ok": true, "latency_ms": 7389, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6363, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1025, "summary": "failed", "notes": {"rationale_len": 81}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 50}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 50, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": true, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 11 |
+
{"source": "spider", "db_id": "concert_singer", "query": "Show all countries and the number of singers in each country.", "gold_sql": "SELECT country , count(*) FROM singer GROUP BY country", "pred_sql": "select Country, count(*) from singer group by Country", "ok": true, "latency_ms": 10349, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8715, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1630, "summary": "failed", "notes": {"rationale_len": 79}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 2, "sql_length": 53}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 53, "has_select": true, "has_from": true, "has_over": false, "has_group_by": true, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 12 |
+
{"source": "spider", "db_id": "concert_singer", "query": "How many singers are from each country?", "gold_sql": "SELECT country , count(*) FROM singer GROUP BY country", "pred_sql": "select country, count(*) from singer group by country", "ok": true, "latency_ms": 9137, "em": 1.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8036, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1099, "summary": "failed", "notes": {"rationale_len": 46}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 2, "sql_length": 53}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 53, "has_select": true, "has_from": true, "has_over": false, "has_group_by": true, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 13 |
+
{"source": "spider", "db_id": "concert_singer", "query": "List all song names by singers above the average age.", "gold_sql": "SELECT song_name FROM singer WHERE age > (SELECT avg(age) FROM singer)", "pred_sql": "select Song_Name from singer where Age > (select avg(Age) from singer)", "ok": true, "latency_ms": 10019, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8685, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1331, "summary": "failed", "notes": {"rationale_len": 87}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 70}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 70, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 14 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are all the song names by singers who are older than average?", "gold_sql": "SELECT song_name FROM singer WHERE age > (SELECT avg(age) FROM singer)", "pred_sql": "select Song_Name from singer where Age > (select avg(Age) from singer)", "ok": true, "latency_ms": 15309, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 13022, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 2284, "summary": "failed", "notes": {"rationale_len": 81}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 3, "col_count": 1, "sql_length": 70}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 70, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 15 |
+
{"source": "spider", "db_id": "concert_singer", "query": "Show location and name for all stadiums with a capacity between 5000 and 10000.", "gold_sql": "SELECT LOCATION , name FROM stadium WHERE capacity BETWEEN 5000 AND 10000", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 16 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What are the locations and names of all stations with capacity between 5000 and 10000?", "gold_sql": "SELECT LOCATION , name FROM stadium WHERE capacity BETWEEN 5000 AND 10000", "pred_sql": "select Location, Name from stadium where Capacity >= 5000 and Capacity <= 10000", "ok": true, "latency_ms": 10709, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 6901, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 3805, "summary": "failed", "notes": {"rationale_len": 82}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 0, "col_count": 2, "sql_length": 79}}, {"stage": "verifier", "duration_ms": 1, "summary": "ok", "notes": {"sql_length": 79, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": false, "mixes_cols": false, "verified": true, "reason": "ok"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 17 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the maximum capacity and the average of all stadiums ?", "gold_sql": "select max(capacity), average from stadium", "pred_sql": "SELECT max(Capacity), avg(Capacity) FROM stadium", "ok": true, "latency_ms": 11707, "em": 0.0, "sm": 0.0, "exec_acc": 0.5, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 8334, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1027, "summary": "failed", "notes": {"rationale_len": 51}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1231, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 56}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 56}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 56, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1105, "summary": "failed", "notes": {"old_sql_len": 56, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 2, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 18 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the average and maximum capacities for all stadiums ?", "gold_sql": "select avg(capacity) , max(capacity) from stadium", "pred_sql": "select avg(Capacity), max(Capacity) from stadium", "ok": true, "latency_ms": 12690, "em": 0.0, "sm": 1.0, "exec_acc": 1.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "clear", "notes": {"ambiguous": false, "questions_len": 0}}, {"stage": "planner", "duration_ms": 9463, "summary": "ok", "notes": {}}, {"stage": "generator", "duration_ms": 1258, "summary": "failed", "notes": {"rationale_len": 68}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 842, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "repair", "duration_ms": 1119, "summary": "failed", "notes": {"old_sql_len": 48, "new_sql_len": 48}}, {"stage": "safety", "duration_ms": 1, "summary": "failed", "notes": {}}, {"stage": "executor", "duration_ms": 1, "summary": "failed", "notes": {"row_count": 1, "col_count": 2, "sql_length": 48}}, {"stage": "verifier", "duration_ms": 1, "summary": "failed", "notes": {"sql_length": 48, "has_select": true, "has_from": true, "has_over": false, "has_group_by": false, "has_distinct": false, "has_aggregate": true, "mixes_cols": true, "verified": false, "reason": "aggregation-without-groupby"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "auto-verified", "notes": {"reason": "executor succeeded, verifier silent"}}, {"stage": "pipeline", "duration_ms": 1, "summary": "finalize", "notes": {"final_verified": true, "details_len": 0, "need_verification": false}}]}
|
| 19 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the name and capacity for the stadium with highest average attendance?", "gold_sql": "SELECT name , capacity FROM stadium ORDER BY average DESC LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
| 20 |
+
{"source": "spider", "db_id": "concert_singer", "query": "What is the name and capacity for the stadium with the highest average attendance?", "gold_sql": "SELECT name , capacity FROM stadium ORDER BY average DESC LIMIT 1", "pred_sql": "", "ok": true, "latency_ms": 0, "em": 0.0, "sm": 0.0, "exec_acc": 0.0, "error": null, "trace": [{"stage": "detector", "duration_ms": 1, "summary": "ambiguous", "notes": {"ambiguous": true, "questions_len": 1}}]}
|
benchmarks/results_pro/20251109-171247/latency_histogram.png
ADDED
|
benchmarks/results_pro/20251109-171247/latency_per_stage.png
ADDED
|
benchmarks/results_pro/20251109-171247/metrics_overview.png
ADDED
|
benchmarks/results_pro/20251109-171247/results.csv
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
db_id,query,ok,em,sm,exec_acc,latency_ms
|
| 2 |
+
concert_singer,"How many singers do we have?",β
,1.0,1.0,1.0,9497
|
| 3 |
+
concert_singer,"What is the total number of singers?",β
,1.0,1.0,1.0,12602
|
| 4 |
+
concert_singer,"Show name, country, age for all singers ordered by age from the oldest to the youngest.",β
,0.0,0.0,0.0,0
|
| 5 |
+
concert_singer,"What are the names, countries, and ages for every singer in descending order of age?",β
,0.0,1.0,1.0,9097
|
| 6 |
+
concert_singer,"What is the average, minimum, and maximum age of all singers from France?",β
,0.0,1.0,1.0,11809
|
| 7 |
+
concert_singer,"What is the average, minimum, and maximum age for all French singers?",β
,0.0,1.0,1.0,9349
|
| 8 |
+
concert_singer,"Show the name and the release year of the song by the youngest singer.",β
,0.0,0.0,0.0,0
|
| 9 |
+
concert_singer,"What are the names and release years for all the songs of the youngest singer?",β
,0.0,1.0,1.0,13807
|
| 10 |
+
concert_singer,"What are all distinct countries where singers above age 20 are from?",β
,0.0,1.0,1.0,8790
|
| 11 |
+
concert_singer,"What are the different countries with singers above age 20?",β
,0.0,1.0,1.0,7389
|
| 12 |
+
concert_singer,"Show all countries and the number of singers in each country.",β
,0.0,1.0,1.0,10349
|
| 13 |
+
concert_singer,"How many singers are from each country?",β
,1.0,1.0,1.0,9137
|
| 14 |
+
concert_singer,"List all song names by singers above the average age.",β
,0.0,1.0,1.0,10019
|
| 15 |
+
concert_singer,"What are all the song names by singers who are older than average?",β
,0.0,1.0,1.0,15309
|
| 16 |
+
concert_singer,"Show location and name for all stadiums with a capacity between 5000 and 10000.",β
,0.0,0.0,0.0,0
|
| 17 |
+
concert_singer,"What are the locations and names of all stations with capacity between 5000 and 10000?",β
,0.0,1.0,1.0,10709
|
| 18 |
+
concert_singer,"What is the maximum capacity and the average of all stadiums ?",β
,0.0,0.0,0.5,11707
|
| 19 |
+
concert_singer,"What is the average and maximum capacities for all stadiums ?",β
,0.0,1.0,1.0,12690
|
| 20 |
+
concert_singer,"What is the name and capacity for the stadium with highest average attendance?",β
,0.0,0.0,0.0,0
|
| 21 |
+
concert_singer,"What is the name and capacity for the stadium with the highest average attendance?",β
,0.0,0.0,0.0,0
|
benchmarks/results_pro/20251109-171247/summary.json
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"timestamp": "2025-11-09T17:15:29",
|
| 3 |
+
"split": "dev",
|
| 4 |
+
"config": "configs/sqlite_pipeline.yaml",
|
| 5 |
+
"total": 20,
|
| 6 |
+
"success": 20,
|
| 7 |
+
"success_rate": 1.0,
|
| 8 |
+
"avg_latency_ms": 8113.0,
|
| 9 |
+
"p50_latency_ms": 9423.0,
|
| 10 |
+
"p95_latency_ms": 13882.1,
|
| 11 |
+
"EM": 0.15,
|
| 12 |
+
"SM": 0.7,
|
| 13 |
+
"ExecAcc": 0.725,
|
| 14 |
+
"detector_avg_ms": 1.0,
|
| 15 |
+
"planner_avg_ms": 8361.93,
|
| 16 |
+
"generator_avg_ms": 1644.67,
|
| 17 |
+
"safety_avg_ms": 1.92,
|
| 18 |
+
"executor_avg_ms": 1.04,
|
| 19 |
+
"verifier_avg_ms": 1.0,
|
| 20 |
+
"repair_avg_ms": 1207.5
|
| 21 |
+
}
|