Zheyuan Zhao commited on
Add design doc: pipe-sql-validation-loop-design-doc.md
Browse files
docs/pipe-sql-validation-loop-design-doc.md
ADDED
|
@@ -0,0 +1,1058 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Design Document: SQLiteβ β Pipe SQL β SQLiteβ Validation & Feedback Loop
|
| 2 |
+
|
| 3 |
+
## 1. Problem Statement
|
| 4 |
+
|
| 5 |
+
The decompiler transforms standard SQL into pipe SQL. We must prove that this transformation preserves semantics β that the pipe SQL, when executed, returns exactly the same results as the original.
|
| 6 |
+
|
| 7 |
+
This document designs a closed-loop validation system using a **tiered benchmark strategy**: Spider 1.0 as the primary lightweight dataset (~1 GB), with BIRD Mini-Dev as a secondary stress test.
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
SQLiteβ (gold SQL) ββexecuteβββΊ Result Set A
|
| 11 |
+
β
|
| 12 |
+
βΌ
|
| 13 |
+
[Decompiler]
|
| 14 |
+
β
|
| 15 |
+
βΌ
|
| 16 |
+
Pipe SQL (synthesized)
|
| 17 |
+
β
|
| 18 |
+
βΌ
|
| 19 |
+
[SQLGlot transpile pipeβsqlite]
|
| 20 |
+
β
|
| 21 |
+
βΌ
|
| 22 |
+
SQLiteβ (round-tripped) ββexecuteβββΊ Result Set B
|
| 23 |
+
|
| 24 |
+
Result Set A == Result Set B ?
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
**Goal**: For every query in the benchmark, `Result Set A == Result Set B`. Any mismatch triggers a diagnostic feedback loop that identifies the root cause and feeds corrections back into the decompiler.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 2. Data Source: Tiered Benchmark Strategy
|
| 32 |
+
|
| 33 |
+
### 2.1 Why Not BIRD-SQL Directly?
|
| 34 |
+
|
| 35 |
+
The full BIRD-SQL benchmark is **33.4 GB** β most of that bulk comes from a few massive databases (financial, geographic datasets). This creates unnecessary friction for iterative decompiler development where fast feedback cycles matter most.
|
| 36 |
+
|
| 37 |
+
### 2.2 Tier 1 (Primary): Spider 1.0
|
| 38 |
+
|
| 39 |
+
| Property | Value |
|
| 40 |
+
|---|---|
|
| 41 |
+
| Total queries | 10,181 (8,659 train + 1,034 dev + 2,147 test) |
|
| 42 |
+
| Database engine | SQLite |
|
| 43 |
+
| Number of databases | 200 (across 138 domains) |
|
| 44 |
+
| Database size | **~1 GB total** |
|
| 45 |
+
| Difficulty levels | easy, medium, hard, extra hard |
|
| 46 |
+
| Ground truth | Execution-verified SQL with known-correct result sets |
|
| 47 |
+
| Download | yale-lily.github.io/spider |
|
| 48 |
+
|
| 49 |
+
Spider 1.0 is ideal as the primary validation dataset because:
|
| 50 |
+
1. **Lightweight** β ~1 GB total, 200 small SQLite databases. Fast to download, fast to iterate.
|
| 51 |
+
2. All databases are **SQLite** β no external database setup required.
|
| 52 |
+
3. Ground truth SQL is **execution-verified** β we know the gold SQL produces correct results.
|
| 53 |
+
4. Queries cover JOINs, aggregations, subqueries, GROUP BY, ORDER BY, HAVING, nested queries, and set operations.
|
| 54 |
+
5. **Well-studied** β extensive prior work means known failure modes and edge cases are documented.
|
| 55 |
+
6. 200 databases across 138 domains provide broad schema diversity.
|
| 56 |
+
|
| 57 |
+
### 2.3 Tier 2 (Stress Test): BIRD Mini-Dev
|
| 58 |
+
|
| 59 |
+
| Property | Value |
|
| 60 |
+
|---|---|
|
| 61 |
+
| Total queries | 500 (curated high-quality subset) |
|
| 62 |
+
| Database engine | SQLite |
|
| 63 |
+
| Number of databases | 11 |
|
| 64 |
+
| Database size | ~few GB (much smaller than full BIRD's 33.4 GB) |
|
| 65 |
+
| Difficulty levels | simple, moderate, challenging |
|
| 66 |
+
| Download | HuggingFace `birdsql/bird_mini_dev` |
|
| 67 |
+
|
| 68 |
+
BIRD Mini-Dev is used as a secondary stress test because:
|
| 69 |
+
1. Queries are harder than Spider β more complex JOINs, domain-specific reasoning, challenging expressions.
|
| 70 |
+
2. 500 curated queries is manageable but tests edge cases Spider may miss.
|
| 71 |
+
3. Uses the same 11 dev databases as full BIRD but without the massive train databases.
|
| 72 |
+
|
| 73 |
+
### 2.4 Tier 3 (Production Scale-Up): Full BIRD Train
|
| 74 |
+
|
| 75 |
+
Once the decompiler passes Tier 1 and Tier 2, apply it to the full BIRD train set (9,428 queries, 33.4 GB) for large-scale golden corpus generation. This is a one-time batch job β the 30+ GB download is justified only at this stage.
|
| 76 |
+
|
| 77 |
+
### 2.5 Data Format
|
| 78 |
+
|
| 79 |
+
**Spider 1.0:**
|
| 80 |
+
```
|
| 81 |
+
spider/
|
| 82 |
+
βββ dev.json # Question-SQL pairs
|
| 83 |
+
β [
|
| 84 |
+
β {
|
| 85 |
+
β "db_id": "concert_singer",
|
| 86 |
+
β "query": "SELECT count(*) FROM singer",
|
| 87 |
+
β "query_toks": ["SELECT", "count", "(", "*", ")", ...],
|
| 88 |
+
β "question": "How many singers do we have?",
|
| 89 |
+
β "hardness": "easy"
|
| 90 |
+
β },
|
| 91 |
+
β ...
|
| 92 |
+
β ]
|
| 93 |
+
βββ database/
|
| 94 |
+
β βββ concert_singer/
|
| 95 |
+
β β βββ concert_singer.sqlite
|
| 96 |
+
β β βββ schema.sql
|
| 97 |
+
β βββ pets_1/
|
| 98 |
+
β β βββ pets_1.sqlite
|
| 99 |
+
β β βββ schema.sql
|
| 100 |
+
β βββ ... (200 databases)
|
| 101 |
+
βββ dev_gold.sql # Gold SQL per line
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
**BIRD Mini-Dev:**
|
| 105 |
+
```
|
| 106 |
+
bird_mini_dev/
|
| 107 |
+
βββ mini_dev_sqlite.json # Question-SQL pairs
|
| 108 |
+
β [
|
| 109 |
+
β {
|
| 110 |
+
β "question_id": 7,
|
| 111 |
+
β "db_id": "california_schools",
|
| 112 |
+
β "question": "What is the phone number of ...",
|
| 113 |
+
β "evidence": "",
|
| 114 |
+
β "SQL": "SELECT T2.Phone FROM satscores AS T1 INNER JOIN ...",
|
| 115 |
+
β "difficulty": "simple"
|
| 116 |
+
β },
|
| 117 |
+
β ...
|
| 118 |
+
β ]
|
| 119 |
+
βββ mini_dev_databases/
|
| 120 |
+
β βββ california_schools/
|
| 121 |
+
β β βββ california_schools.sqlite
|
| 122 |
+
β βββ ... (11 databases)
|
| 123 |
+
βββ mini_dev_gold.sql
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### 2.6 Working Set
|
| 127 |
+
|
| 128 |
+
| Phase | Dataset | Queries | Purpose |
|
| 129 |
+
|---|---|---|---|
|
| 130 |
+
| Development & iteration | Spider 1.0 dev | 1,034 | Fast feedback loop (~1 GB, seconds to run) |
|
| 131 |
+
| Stress testing | BIRD Mini-Dev | 500 | Harder queries, edge case discovery |
|
| 132 |
+
| Production corpus | Spider 1.0 train | 8,659 | Scale-up validated pipe SQL pairs |
|
| 133 |
+
| Production corpus | BIRD train | 9,428 | Maximum training data (download 33.4 GB only at this stage) |
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## 3. Validation Pipeline Architecture
|
| 138 |
+
|
| 139 |
+
### 3.1 End-to-End Flow
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 143 |
+
β For each (question_id, db_id, gold_sql) β
|
| 144 |
+
β β
|
| 145 |
+
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
|
| 146 |
+
β β gold_sql ββββββΊβ Execute on ββββββΊβ Result Set A β β
|
| 147 |
+
β β (SQLite) β β SQLite DB β β (gold result) β β
|
| 148 |
+
β ββββββββ¬ββββββββ ββββββββββββββββ βββββββββ¬βββββββββ β
|
| 149 |
+
β β β β
|
| 150 |
+
β βΌ β β
|
| 151 |
+
β ββββββββββββββββ β β
|
| 152 |
+
β β Parse with β β β
|
| 153 |
+
β β SQLGlot β β β
|
| 154 |
+
β β (read=sqlite)β β β
|
| 155 |
+
β ββββββββ¬ββββββββ β β
|
| 156 |
+
β β β β
|
| 157 |
+
β βΌ β β
|
| 158 |
+
β ββββββββββββββββ β β
|
| 159 |
+
β β Pre-process β β β
|
| 160 |
+
β β (qualify, β β β
|
| 161 |
+
β β unnest, β β β
|
| 162 |
+
β β simplify) β β β
|
| 163 |
+
β ββββββββ¬ββββββββ β β
|
| 164 |
+
β β β β
|
| 165 |
+
β βΌ β β
|
| 166 |
+
β ββββββββββββββββ β β
|
| 167 |
+
β β Pipe Emitter β β β
|
| 168 |
+
β β (AST β pipe β β β
|
| 169 |
+
β β operators) β β β
|
| 170 |
+
β ββββββββ¬ββββββββ β β
|
| 171 |
+
β β β β
|
| 172 |
+
β βΌ β β
|
| 173 |
+
β ββββββββββββββββ β β
|
| 174 |
+
β β pipe_sql β β β
|
| 175 |
+
β β (canonical β β β
|
| 176 |
+
β β pipe syntax)β β β
|
| 177 |
+
β ββββββββ¬ββββββββ β β
|
| 178 |
+
β β β β
|
| 179 |
+
β βΌ β β
|
| 180 |
+
β ββββββββββββββββ β β
|
| 181 |
+
β β SQLGlot β β β
|
| 182 |
+
β β transpile β β β
|
| 183 |
+
β β pipeβsqlite β β β
|
| 184 |
+
β ββββββββ¬ββββββββ β β
|
| 185 |
+
β β β β
|
| 186 |
+
β βΌ β β
|
| 187 |
+
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β
|
| 188 |
+
β β sqlite2_sql ββββββΊβ Execute on ββββββΊβ Result Set B β β
|
| 189 |
+
β β (round-trip) β β same DB β β (pipe result) β β
|
| 190 |
+
β ββββββββββββββββ ββββββββββββββββ βββββββββ¬βββββββββ β
|
| 191 |
+
β β β
|
| 192 |
+
β βββββββββββΌββββββββββ β
|
| 193 |
+
β β Compare A == B β β
|
| 194 |
+
β βββββββββββ¬ββββββββββ β
|
| 195 |
+
β β β
|
| 196 |
+
β βββββββββββββββββββββββββΌβββββββ β
|
| 197 |
+
β β β β β
|
| 198 |
+
β βΌ βΌ βΌ β
|
| 199 |
+
β MATCH MISMATCH ERROR β
|
| 200 |
+
β (log success) (enter feedback) (triage)β
|
| 201 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
### 3.2 The Three Outcomes
|
| 205 |
+
|
| 206 |
+
| Outcome | Meaning | Action |
|
| 207 |
+
|---|---|---|
|
| 208 |
+
| **MATCH** | `set(Result A) == set(Result B)` | Query is validated. Add to golden corpus. |
|
| 209 |
+
| **MISMATCH** | Both execute, but result sets differ | Enter diagnostic feedback loop (Section 5). |
|
| 210 |
+
| **ERROR** | SQLiteβ fails to execute (syntax error, runtime error) | Enter error triage (Section 6). |
|
| 211 |
+
|
| 212 |
+
An additional sub-outcome exists:
|
| 213 |
+
|
| 214 |
+
| Sub-outcome | Meaning | Action |
|
| 215 |
+
|---|---|---|
|
| 216 |
+
| **DECOMPILE_FAIL** | The decompiler could not transform the query | Log with classification. Attempt fallback strategies. |
|
| 217 |
+
| **TIMEOUT** | Execution exceeds 30-second limit | Use BIRD's standard 30s timeout. Score as ERROR. |
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
## 4. Result Comparison Logic
|
| 222 |
+
|
| 223 |
+
### 4.1 Set-Based Comparison
|
| 224 |
+
|
| 225 |
+
Following standard text-to-SQL evaluation methodology (used by both Spider and BIRD), comparison is **set-based** β row order does not matter:
|
| 226 |
+
|
| 227 |
+
```python
|
| 228 |
+
def compare_results(result_a: List[Tuple], result_b: List[Tuple]) -> bool:
|
| 229 |
+
"""Compare two result sets using set equality."""
|
| 230 |
+
return set(result_a) == set(result_b)
|
| 231 |
+
```
|
| 232 |
+
|
| 233 |
+
This is intentionally strict: every row must match exactly. No fuzzy matching, no type coercion.
|
| 234 |
+
|
| 235 |
+
### 4.2 Enhanced Comparison (For Diagnostic Purposes)
|
| 236 |
+
|
| 237 |
+
When set comparison fails, we compute additional metrics to diagnose the mismatch:
|
| 238 |
+
|
| 239 |
+
```python
|
| 240 |
+
@dataclass
|
| 241 |
+
class ComparisonResult:
|
| 242 |
+
match: bool # set(A) == set(B)
|
| 243 |
+
result_a_rows: int
|
| 244 |
+
result_b_rows: int
|
| 245 |
+
result_a_cols: int
|
| 246 |
+
result_b_cols: int
|
| 247 |
+
|
| 248 |
+
# Diagnostic fields (only computed on mismatch)
|
| 249 |
+
row_count_match: bool # len(A) == len(B)
|
| 250 |
+
col_count_match: bool # same number of columns
|
| 251 |
+
col_types_match: bool # column types compatible
|
| 252 |
+
sorted_match: bool # match after sorting both
|
| 253 |
+
subset_a_in_b: bool # A β B
|
| 254 |
+
subset_b_in_a: bool # B β A
|
| 255 |
+
symmetric_difference: int # |A β³ B|
|
| 256 |
+
sample_diff_rows: List[Tuple] # Up to 5 rows in A but not B
|
| 257 |
+
f1_score: float # Cell-level F1 (soft metric)
|
| 258 |
+
|
| 259 |
+
# Root cause classification
|
| 260 |
+
mismatch_type: str # See Section 5.2
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### 4.3 Floating-Point Tolerance
|
| 264 |
+
|
| 265 |
+
SQLite may return slightly different floating-point results depending on expression evaluation order. We apply a tolerance layer:
|
| 266 |
+
|
| 267 |
+
```python
|
| 268 |
+
def normalize_row(row: Tuple, tolerance: float = 1e-6) -> Tuple:
|
| 269 |
+
"""Normalize a row for comparison."""
|
| 270 |
+
normalized = []
|
| 271 |
+
for val in row:
|
| 272 |
+
if isinstance(val, float):
|
| 273 |
+
normalized.append(round(val, 6))
|
| 274 |
+
elif isinstance(val, str):
|
| 275 |
+
normalized.append(val.strip())
|
| 276 |
+
elif val is None:
|
| 277 |
+
normalized.append(None)
|
| 278 |
+
else:
|
| 279 |
+
normalized.append(val)
|
| 280 |
+
return tuple(normalized)
|
| 281 |
+
|
| 282 |
+
def compare_with_tolerance(result_a, result_b, tolerance=1e-6):
|
| 283 |
+
set_a = set(normalize_row(r, tolerance) for r in result_a)
|
| 284 |
+
set_b = set(normalize_row(r, tolerance) for r in result_b)
|
| 285 |
+
return set_a == set_b
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
### 4.4 Column Order Handling
|
| 289 |
+
|
| 290 |
+
Standard evaluation compares result sets as sets of tuples, which means column order matters (a row `(1, 'Alice')` β `('Alice', 1)`). Since pipe SQL may reorder columns (e.g., `|> AGGREGATE` outputs grouping columns first, then aggregates), the decompiler must ensure the final `|> SELECT` matches the original column order.
|
| 291 |
+
|
| 292 |
+
If column reordering is the sole cause of mismatch, it is classified as a **REORDER** mismatch (non-semantic, fixable by adjusting the final SELECT).
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
## 5. Diagnostic Feedback Loop (Mismatch Handling)
|
| 297 |
+
|
| 298 |
+
### 5.1 Feedback Loop Flow
|
| 299 |
+
|
| 300 |
+
```
|
| 301 |
+
MISMATCH detected
|
| 302 |
+
β
|
| 303 |
+
βΌ
|
| 304 |
+
ββββββββββββββββββ
|
| 305 |
+
β Classify β
|
| 306 |
+
β mismatch type β
|
| 307 |
+
ββββββββββ¬ββββββββ
|
| 308 |
+
β
|
| 309 |
+
ββββββββββββββΌβββββββββββββ¬βββββββββββββββ¬ββββββββββββββββ
|
| 310 |
+
βΌ βΌ βΌ βΌ βΌ
|
| 311 |
+
REORDER EXTRA_ROWS MISSING_ROWS WRONG_VALUES NULL_DIFF
|
| 312 |
+
β β β β β
|
| 313 |
+
βΌ βΌ βΌ βΌ βΌ
|
| 314 |
+
Fix final Diagnose Diagnose Diagnose Diagnose
|
| 315 |
+
SELECT filter/join filter/join expression NULL handling
|
| 316 |
+
projection logic logic rewriting differences
|
| 317 |
+
β β β β β
|
| 318 |
+
βΌ βΌ βΌ βΌ βΌ
|
| 319 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 320 |
+
β Generate Fix Hypothesis β
|
| 321 |
+
β (which transformation rule is at fault?) β
|
| 322 |
+
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
|
| 323 |
+
β
|
| 324 |
+
βΌ
|
| 325 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 326 |
+
β Apply Fix to Decompiler Rule β
|
| 327 |
+
β (update rule, add edge case) β
|
| 328 |
+
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ
|
| 329 |
+
β
|
| 330 |
+
βΌ
|
| 331 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 332 |
+
β Re-run Validation on Affected Queriesβ
|
| 333 |
+
β (regression test) β
|
| 334 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 335 |
+
```
|
| 336 |
+
|
| 337 |
+
### 5.2 Mismatch Classification
|
| 338 |
+
|
| 339 |
+
| Type | Symptom | Likely Root Cause | Fix Strategy |
|
| 340 |
+
|---|---|---|---|
|
| 341 |
+
| **REORDER** | Same rows, different column order | Final `\|> SELECT` doesn't match original column order | Adjust projection rule to preserve original SELECT order |
|
| 342 |
+
| **EXTRA_ROWS** | B has rows not in A | WHERE filter was dropped or weakened during transformation | Check WHERE promotion rule; verify HAVINGβWHERE conversion |
|
| 343 |
+
| **MISSING_ROWS** | A has rows not in B | WHERE filter is too aggressive, or JOIN type changed | Check JOIN linearization (INNER vs LEFT); verify subquery unnesting |
|
| 344 |
+
| **WRONG_VALUES** | Same row count, different values | Expression rewriting error (e.g., aggregate alias mismatch, CASE transformation) | Diff the two SQLs at the expression level; identify which column differs |
|
| 345 |
+
| **NULL_DIFF** | Mismatch only in NULL-containing rows | NULL ordering difference, or LEFT JOIN β INNER JOIN conversion | Check JOIN types and NULL handling in WHERE conditions |
|
| 346 |
+
| **TYPE_DIFF** | Values are "equal" but different types (e.g., `1` vs `1.0`, `"1"` vs `1`) | Type coercion difference between original and CTE-based round-trip | Add type normalization in comparison or fix CAST in decompiler |
|
| 347 |
+
| **DUPLICATE_DIFF** | Set sizes differ but distinct values match | DISTINCT was added or dropped during transformation | Check DISTINCT handling in decompiler |
|
| 348 |
+
|
| 349 |
+
### 5.3 Automated Root Cause Analysis
|
| 350 |
+
|
| 351 |
+
For each mismatch, the system performs automated analysis:
|
| 352 |
+
|
| 353 |
+
```python
|
| 354 |
+
def diagnose_mismatch(
|
| 355 |
+
gold_sql: str,
|
| 356 |
+
pipe_sql: str,
|
| 357 |
+
sqlite2_sql: str,
|
| 358 |
+
result_a: List[Tuple],
|
| 359 |
+
result_b: List[Tuple],
|
| 360 |
+
db_path: str
|
| 361 |
+
) -> DiagnosticReport:
|
| 362 |
+
report = DiagnosticReport()
|
| 363 |
+
|
| 364 |
+
# 1. Column count check
|
| 365 |
+
if len(result_a[0]) != len(result_b[0]):
|
| 366 |
+
report.mismatch_type = "REORDER" if sorted(result_a[0]) == sorted(result_b[0]) else "COL_COUNT"
|
| 367 |
+
report.fix_hint = "Adjust final |> SELECT projection"
|
| 368 |
+
return report
|
| 369 |
+
|
| 370 |
+
# 2. Row count check
|
| 371 |
+
if len(result_a) != len(result_b):
|
| 372 |
+
extra = set(result_b) - set(result_a)
|
| 373 |
+
missing = set(result_a) - set(result_b)
|
| 374 |
+
if extra and not missing:
|
| 375 |
+
report.mismatch_type = "EXTRA_ROWS"
|
| 376 |
+
report.fix_hint = "WHERE filter dropped or weakened"
|
| 377 |
+
elif missing and not extra:
|
| 378 |
+
report.mismatch_type = "MISSING_ROWS"
|
| 379 |
+
report.fix_hint = "WHERE filter too aggressive or JOIN type changed"
|
| 380 |
+
else:
|
| 381 |
+
report.mismatch_type = "ROW_SWAP"
|
| 382 |
+
report.fix_hint = "Both extra and missing rows β logic error in transformation"
|
| 383 |
+
report.sample_extra = list(extra)[:5]
|
| 384 |
+
report.sample_missing = list(missing)[:5]
|
| 385 |
+
return report
|
| 386 |
+
|
| 387 |
+
# 3. Value-level diff (same row count)
|
| 388 |
+
# Sort both and compare position by position
|
| 389 |
+
sorted_a = sorted(result_a)
|
| 390 |
+
sorted_b = sorted(result_b)
|
| 391 |
+
diff_positions = []
|
| 392 |
+
for i, (ra, rb) in enumerate(zip(sorted_a, sorted_b)):
|
| 393 |
+
if ra != rb:
|
| 394 |
+
diff_positions.append((i, ra, rb))
|
| 395 |
+
if diff_positions:
|
| 396 |
+
# Check if it's a NULL issue
|
| 397 |
+
null_diffs = [(i, a, b) for i, a, b in diff_positions
|
| 398 |
+
if any(v is None for v in a + b)]
|
| 399 |
+
if len(null_diffs) == len(diff_positions):
|
| 400 |
+
report.mismatch_type = "NULL_DIFF"
|
| 401 |
+
else:
|
| 402 |
+
report.mismatch_type = "WRONG_VALUES"
|
| 403 |
+
report.sample_diffs = diff_positions[:5]
|
| 404 |
+
|
| 405 |
+
# 4. AST diff between gold_sql and sqlite2_sql
|
| 406 |
+
report.ast_diff = compute_ast_diff(gold_sql, sqlite2_sql)
|
| 407 |
+
|
| 408 |
+
# 5. Identify which transformation rule likely caused the issue
|
| 409 |
+
report.suspected_rule = identify_suspected_rule(report.ast_diff, report.mismatch_type)
|
| 410 |
+
|
| 411 |
+
return report
|
| 412 |
+
```
|
| 413 |
+
|
| 414 |
+
### 5.4 AST Diff for Root Cause Identification
|
| 415 |
+
|
| 416 |
+
Compare the AST of the original `gold_sql` with the AST of `sqlite2_sql` (the round-tripped version) to pinpoint structural differences:
|
| 417 |
+
|
| 418 |
+
```python
|
| 419 |
+
def compute_ast_diff(sql_a: str, sql_b: str) -> List[str]:
|
| 420 |
+
"""Compute structural differences between two SQL ASTs."""
|
| 421 |
+
import sqlglot
|
| 422 |
+
|
| 423 |
+
ast_a = sqlglot.parse_one(sql_a, read="sqlite")
|
| 424 |
+
ast_b = sqlglot.parse_one(sql_b, read="sqlite")
|
| 425 |
+
|
| 426 |
+
diffs = []
|
| 427 |
+
|
| 428 |
+
# Compare FROM clauses
|
| 429 |
+
if ast_a.find(exp.From) and ast_b.find(exp.From):
|
| 430 |
+
if ast_a.find(exp.From).sql() != ast_b.find(exp.From).sql():
|
| 431 |
+
diffs.append(f"FROM changed: {ast_a.find(exp.From).sql()} β {ast_b.find(exp.From).sql()}")
|
| 432 |
+
|
| 433 |
+
# Compare JOIN count and types
|
| 434 |
+
joins_a = list(ast_a.find_all(exp.Join))
|
| 435 |
+
joins_b = list(ast_b.find_all(exp.Join))
|
| 436 |
+
if len(joins_a) != len(joins_b):
|
| 437 |
+
diffs.append(f"JOIN count changed: {len(joins_a)} β {len(joins_b)}")
|
| 438 |
+
for i, (ja, jb) in enumerate(zip(joins_a, joins_b)):
|
| 439 |
+
if ja.side != jb.side or ja.kind != jb.kind:
|
| 440 |
+
diffs.append(f"JOIN[{i}] type changed: {ja.side} {ja.kind} β {jb.side} {jb.kind}")
|
| 441 |
+
|
| 442 |
+
# Compare WHERE conditions
|
| 443 |
+
where_a = ast_a.find(exp.Where)
|
| 444 |
+
where_b = ast_b.find(exp.Where)
|
| 445 |
+
if (where_a is None) != (where_b is None):
|
| 446 |
+
diffs.append(f"WHERE {'added' if where_b else 'dropped'}")
|
| 447 |
+
elif where_a and where_b and where_a.sql() != where_b.sql():
|
| 448 |
+
diffs.append(f"WHERE changed: {where_a.sql()} β {where_b.sql()}")
|
| 449 |
+
|
| 450 |
+
# Compare GROUP BY
|
| 451 |
+
group_a = ast_a.find(exp.Group)
|
| 452 |
+
group_b = ast_b.find(exp.Group)
|
| 453 |
+
if (group_a is None) != (group_b is None):
|
| 454 |
+
diffs.append(f"GROUP BY {'added' if group_b else 'dropped'}")
|
| 455 |
+
|
| 456 |
+
# Compare aggregate functions
|
| 457 |
+
aggs_a = sorted(n.sql() for n in ast_a.find_all(exp.AggFunc))
|
| 458 |
+
aggs_b = sorted(n.sql() for n in ast_b.find_all(exp.AggFunc))
|
| 459 |
+
if aggs_a != aggs_b:
|
| 460 |
+
diffs.append(f"Aggregates changed: {aggs_a} β {aggs_b}")
|
| 461 |
+
|
| 462 |
+
# Compare SELECT expressions
|
| 463 |
+
if isinstance(ast_a, exp.Select) and isinstance(ast_b, exp.Select):
|
| 464 |
+
sels_a = [e.sql() for e in ast_a.expressions]
|
| 465 |
+
sels_b = [e.sql() for e in ast_b.expressions]
|
| 466 |
+
if sels_a != sels_b:
|
| 467 |
+
diffs.append(f"SELECT changed: {sels_a} β {sels_b}")
|
| 468 |
+
|
| 469 |
+
return diffs
|
| 470 |
+
```
|
| 471 |
+
|
| 472 |
+
### 5.5 Suspected Rule Identification
|
| 473 |
+
|
| 474 |
+
Map mismatch types to likely decompiler rules:
|
| 475 |
+
|
| 476 |
+
```python
|
| 477 |
+
RULE_SUSPECTS = {
|
| 478 |
+
"REORDER": ["projection_rule"],
|
| 479 |
+
"EXTRA_ROWS": ["where_rule", "aggregate_rule"],
|
| 480 |
+
"MISSING_ROWS": ["where_rule", "join_rule", "subquery_rule"],
|
| 481 |
+
"WRONG_VALUES": ["aggregate_rule", "window_rule", "projection_rule"],
|
| 482 |
+
"NULL_DIFF": ["join_rule", "where_rule"],
|
| 483 |
+
"TYPE_DIFF": ["projection_rule"],
|
| 484 |
+
"DUPLICATE_DIFF": ["terminal_rule"], # DISTINCT handling
|
| 485 |
+
"COL_COUNT": ["projection_rule", "aggregate_rule"],
|
| 486 |
+
}
|
| 487 |
+
|
| 488 |
+
def identify_suspected_rule(ast_diffs: List[str], mismatch_type: str) -> List[str]:
|
| 489 |
+
suspects = list(RULE_SUSPECTS.get(mismatch_type, []))
|
| 490 |
+
|
| 491 |
+
# Refine based on AST diffs
|
| 492 |
+
for diff in ast_diffs:
|
| 493 |
+
if "JOIN" in diff:
|
| 494 |
+
suspects.insert(0, "join_rule")
|
| 495 |
+
if "WHERE" in diff:
|
| 496 |
+
suspects.insert(0, "where_rule")
|
| 497 |
+
if "GROUP BY" in diff:
|
| 498 |
+
suspects.insert(0, "aggregate_rule")
|
| 499 |
+
if "Aggregates" in diff:
|
| 500 |
+
suspects.insert(0, "aggregate_rule")
|
| 501 |
+
|
| 502 |
+
return list(dict.fromkeys(suspects)) # deduplicate, preserve order
|
| 503 |
+
```
|
| 504 |
+
|
| 505 |
+
---
|
| 506 |
+
|
| 507 |
+
## 6. Error Triage (Execution Failures)
|
| 508 |
+
|
| 509 |
+
### 6.1 Error Categories
|
| 510 |
+
|
| 511 |
+
When `sqlite2_sql` fails to execute, classify the error:
|
| 512 |
+
|
| 513 |
+
| Error Category | Example Error Message | Root Cause | Fix Strategy |
|
| 514 |
+
|---|---|---|---|
|
| 515 |
+
| **SYNTAX** | `near "|>": syntax error` | Pipe syntax leaked into SQLite output (SQLGlot transpile failure) | Fix SQLGlot read/write dialect params |
|
| 516 |
+
| **NO_SUCH_TABLE** | `no such table: __tmp1` | CTE chain broken during transpilation | Ensure all CTEs are properly defined |
|
| 517 |
+
| **NO_SUCH_COLUMN** | `no such column: t1.name` | Column qualification error after pipe transformation | Fix qualify step or alias propagation |
|
| 518 |
+
| **NO_SUCH_FUNCTION** | `no such function: ARRAY_AGG` | BigQuery function leaked into SQLite output | Add function mapping or flag as unsupported |
|
| 519 |
+
| **AMBIGUOUS_COLUMN** | `ambiguous column name: id` | Self-join or multi-table query lost its aliases | Fix AS insertion and alias propagation |
|
| 520 |
+
| **TYPE_ERROR** | `cannot use aggregate in this context` | Aggregate/non-aggregate mixing in wrong position | Fix expression classification in aggregate rule |
|
| 521 |
+
| **TIMEOUT** | Execution exceeded 30s | Query produces cartesian product or infinite recursion | Flag as decompiler bug (likely missing JOIN condition). Use standard 30s timeout. |
|
| 522 |
+
| **PARSE_FAIL** | SQLGlot cannot parse gold SQL | Source SQL uses SQLite-specific syntax SQLGlot doesn't support | Log and skip; count toward coverage metric |
|
| 523 |
+
|
| 524 |
+
### 6.2 Error Handling Flow
|
| 525 |
+
|
| 526 |
+
```python
|
| 527 |
+
def execute_with_error_handling(sql: str, db_path: str, timeout: int = 30) -> ExecutionResult:
|
| 528 |
+
try:
|
| 529 |
+
conn = sqlite3.connect(db_path)
|
| 530 |
+
conn.execute("PRAGMA busy_timeout = 30000")
|
| 531 |
+
cursor = conn.cursor()
|
| 532 |
+
|
| 533 |
+
# Execute with timeout
|
| 534 |
+
result = cursor.execute(sql).fetchall()
|
| 535 |
+
col_names = [desc[0] for desc in cursor.description] if cursor.description else []
|
| 536 |
+
return ExecutionResult(success=True, rows=result, columns=col_names)
|
| 537 |
+
|
| 538 |
+
except sqlite3.OperationalError as e:
|
| 539 |
+
error_msg = str(e)
|
| 540 |
+
if "no such table" in error_msg:
|
| 541 |
+
return ExecutionResult(success=False, error_category="NO_SUCH_TABLE", error_msg=error_msg)
|
| 542 |
+
elif "no such column" in error_msg:
|
| 543 |
+
return ExecutionResult(success=False, error_category="NO_SUCH_COLUMN", error_msg=error_msg)
|
| 544 |
+
elif "no such function" in error_msg:
|
| 545 |
+
return ExecutionResult(success=False, error_category="NO_SUCH_FUNCTION", error_msg=error_msg)
|
| 546 |
+
elif "ambiguous column" in error_msg:
|
| 547 |
+
return ExecutionResult(success=False, error_category="AMBIGUOUS_COLUMN", error_msg=error_msg)
|
| 548 |
+
elif "near" in error_msg:
|
| 549 |
+
return ExecutionResult(success=False, error_category="SYNTAX", error_msg=error_msg)
|
| 550 |
+
else:
|
| 551 |
+
return ExecutionResult(success=False, error_category="OTHER", error_msg=error_msg)
|
| 552 |
+
|
| 553 |
+
except Exception as e:
|
| 554 |
+
return ExecutionResult(success=False, error_category="UNEXPECTED", error_msg=str(e))
|
| 555 |
+
|
| 556 |
+
finally:
|
| 557 |
+
conn.close()
|
| 558 |
+
```
|
| 559 |
+
|
| 560 |
+
---
|
| 561 |
+
|
| 562 |
+
## 7. Known SQLGlot Round-Trip Issues
|
| 563 |
+
|
| 564 |
+
These are confirmed issues in SQLGlot v29.x that will cause false mismatches. The validation loop must account for them.
|
| 565 |
+
|
| 566 |
+
### 7.1 Issues That Cause Silent Wrong Answers
|
| 567 |
+
|
| 568 |
+
| Issue | Description | Impact | Mitigation |
|
| 569 |
+
|---|---|---|---|
|
| 570 |
+
| `LEAST(a, b)` bug | 2-argument `LEAST(a, b)` drops the second argument, outputs just `a` | Wrong values in result | Patch SQLGlot or post-process: detect LEAST/GREATEST with 2 args and rewrite to `MIN(a, b)` / `MAX(a, b)` |
|
| 571 |
+
| `IGNORE NULLS` dropped | SQLGlot silently drops `IGNORE NULLS` from window functions | Wrong NULL handling | Flag queries using IGNORE NULLS as KNOWN_ISSUE |
|
| 572 |
+
| `SAFE_CAST` β `CAST` | Safety semantics lost; runtime errors instead of NULL | Execution error or wrong values | Flag SAFE_CAST queries as KNOWN_ISSUE |
|
| 573 |
+
| `GROUP_CONCAT ... ORDER BY` dropped | ORDER BY inside GROUP_CONCAT is silently removed | Different string concatenation order | Flag or rewrite to subquery-based ordering |
|
| 574 |
+
|
| 575 |
+
### 7.2 Issues That Cause Execution Errors
|
| 576 |
+
|
| 577 |
+
| Issue | Description | Impact | Mitigation |
|
| 578 |
+
|---|---|---|---|
|
| 579 |
+
| Pipe operators SET/DROP/RENAME/DISTINCT/CALL/WITH | Crash with TypeError in v29.x | Cannot use these operators in pipe output | Avoid these operators in decompiler output; use SELECT/WHERE alternatives |
|
| 580 |
+
| `EXCEPT ALL` / `INTERSECT ALL` | SQLite does not support `ALL` modifier | Runtime error | Convert to EXCEPT/INTERSECT (drop ALL if source has no duplicates) |
|
| 581 |
+
| `TABLESAMPLE` | Not supported in SQLite | Runtime error | Avoid TABLESAMPLE in pipe output for SQLite targets |
|
| 582 |
+
|
| 583 |
+
### 7.3 Issues That Cause Cosmetic Differences (Not Semantic)
|
| 584 |
+
|
| 585 |
+
| Issue | Description | Impact |
|
| 586 |
+
|---|---|---|
|
| 587 |
+
| `IFNULL` β `COALESCE` | Function renamed during round-trip | None (semantically identical in SQLite) |
|
| 588 |
+
| `SUBSTR` β `SUBSTRING` | Function renamed | None (SQLite accepts both) |
|
| 589 |
+
| Identifier quoting: `[col]` β `"col"` | Quote style normalized | None |
|
| 590 |
+
|
| 591 |
+
### 7.4 Pre-Validation Filters
|
| 592 |
+
|
| 593 |
+
Before comparing results, apply these filters to exclude known-problematic queries:
|
| 594 |
+
|
| 595 |
+
```python
|
| 596 |
+
KNOWN_ISSUE_PATTERNS = [
|
| 597 |
+
(r'\bLEAST\s*\([^,]+,[^,]+\)', "LEAST_2ARG_BUG"),
|
| 598 |
+
(r'\bGREATEST\s*\([^,]+,[^,]+\)', "GREATEST_2ARG_BUG"),
|
| 599 |
+
(r'IGNORE\s+NULLS', "IGNORE_NULLS_DROPPED"),
|
| 600 |
+
(r'SAFE_CAST', "SAFE_CAST_LOSSY"),
|
| 601 |
+
(r'GROUP_CONCAT\s*\([^)]*ORDER\s+BY', "GROUP_CONCAT_ORDER_DROPPED"),
|
| 602 |
+
(r'EXCEPT\s+ALL|INTERSECT\s+ALL', "SET_OP_ALL_UNSUPPORTED"),
|
| 603 |
+
(r'TABLESAMPLE', "TABLESAMPLE_UNSUPPORTED"),
|
| 604 |
+
]
|
| 605 |
+
|
| 606 |
+
def check_known_issues(sql: str) -> List[str]:
|
| 607 |
+
"""Return list of known issue tags for a query."""
|
| 608 |
+
import re
|
| 609 |
+
issues = []
|
| 610 |
+
for pattern, tag in KNOWN_ISSUE_PATTERNS:
|
| 611 |
+
if re.search(pattern, sql, re.IGNORECASE):
|
| 612 |
+
issues.append(tag)
|
| 613 |
+
return issues
|
| 614 |
+
```
|
| 615 |
+
|
| 616 |
+
Queries with known issues are tracked separately. They still go through the pipeline, but mismatches attributed to known issues are not counted against the decompiler's success rate.
|
| 617 |
+
|
| 618 |
+
---
|
| 619 |
+
|
| 620 |
+
## 8. Feedback Cycle: From Mismatch to Fix
|
| 621 |
+
|
| 622 |
+
### 8.1 The Iteration Loop
|
| 623 |
+
|
| 624 |
+
```
|
| 625 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 626 |
+
β ITERATION N β
|
| 627 |
+
β β
|
| 628 |
+
β 1. Run validation pipeline on all dev queries β
|
| 629 |
+
β β
|
| 630 |
+
β 2. Collect results: β
|
| 631 |
+
β βββ MATCH: 1,200 queries β β
|
| 632 |
+
β βββ MISMATCH: 150 queries β β
|
| 633 |
+
β βββ ERROR: 80 queries β β
|
| 634 |
+
β βββ DECOMPILE_FAIL: 84 queries β β
|
| 635 |
+
β βββ KNOWN_ISSUE: 20 queries ~ β
|
| 636 |
+
β β
|
| 637 |
+
β 3. Analyze MISMATCHes by type: β
|
| 638 |
+
β βββ REORDER: 45 (fix: projection_rule) β
|
| 639 |
+
β βββ EXTRA_ROWS: 30 (fix: where_rule) β
|
| 640 |
+
β βββ MISSING_ROWS: 25 (fix: join_rule) β
|
| 641 |
+
β βββ WRONG_VALUES: 35 (fix: aggregate_rule) β
|
| 642 |
+
β βββ NULL_DIFF: 15 (fix: join_rule) β
|
| 643 |
+
β β
|
| 644 |
+
β 4. Prioritize fixes by impact (most affected queries first) β
|
| 645 |
+
β β
|
| 646 |
+
β 5. Fix decompiler rule(s) β
|
| 647 |
+
β β
|
| 648 |
+
β 6. Regression test: re-run on ALL dev queries β
|
| 649 |
+
β βββ Verify: previous MATCHes still MATCH β
|
| 650 |
+
β βββ Verify: targeted MISMATCHes now MATCH β
|
| 651 |
+
β β
|
| 652 |
+
β 7. If success rate < 95%: goto ITERATION N+1 β
|
| 653 |
+
β Else: proceed to train set production β
|
| 654 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 655 |
+
```
|
| 656 |
+
|
| 657 |
+
### 8.2 Fix Priority Ranking
|
| 658 |
+
|
| 659 |
+
Fixes are prioritized by:
|
| 660 |
+
1. **Blast radius**: How many queries are affected? Fix the rule that causes the most mismatches first.
|
| 661 |
+
2. **Severity**: WRONG_VALUES and MISSING_ROWS are more severe than REORDER.
|
| 662 |
+
3. **Fixability**: Some mismatches are caused by SQLGlot limitations (known issues) and cannot be fixed in the decompiler. Deprioritize these.
|
| 663 |
+
|
| 664 |
+
### 8.3 Regression Guard
|
| 665 |
+
|
| 666 |
+
Every decompiler change must pass the regression test:
|
| 667 |
+
|
| 668 |
+
```python
|
| 669 |
+
def regression_test(
|
| 670 |
+
previous_results: Dict[int, str], # question_id -> "MATCH"/"MISMATCH"/"ERROR"
|
| 671 |
+
current_results: Dict[int, str],
|
| 672 |
+
) -> RegressionReport:
|
| 673 |
+
regressions = []
|
| 674 |
+
improvements = []
|
| 675 |
+
|
| 676 |
+
for qid in previous_results:
|
| 677 |
+
prev = previous_results[qid]
|
| 678 |
+
curr = current_results[qid]
|
| 679 |
+
|
| 680 |
+
if prev == "MATCH" and curr != "MATCH":
|
| 681 |
+
regressions.append(qid) # Previously passing, now failing
|
| 682 |
+
elif prev != "MATCH" and curr == "MATCH":
|
| 683 |
+
improvements.append(qid) # Previously failing, now passing
|
| 684 |
+
|
| 685 |
+
return RegressionReport(
|
| 686 |
+
regressions=regressions,
|
| 687 |
+
improvements=improvements,
|
| 688 |
+
net_change=len(improvements) - len(regressions),
|
| 689 |
+
accept=len(regressions) == 0, # ZERO regressions policy
|
| 690 |
+
)
|
| 691 |
+
```
|
| 692 |
+
|
| 693 |
+
**Policy**: A decompiler change is accepted only if it causes **zero regressions**. If a fix for one mismatch type causes regressions elsewhere, the fix must be refined.
|
| 694 |
+
|
| 695 |
+
---
|
| 696 |
+
|
| 697 |
+
## 9. Execution Harness
|
| 698 |
+
|
| 699 |
+
### 9.1 Main Entry Point
|
| 700 |
+
|
| 701 |
+
```python
|
| 702 |
+
import json
|
| 703 |
+
import sqlite3
|
| 704 |
+
from pathlib import Path
|
| 705 |
+
from dataclasses import dataclass, field, asdict
|
| 706 |
+
from typing import List, Dict, Optional
|
| 707 |
+
import sqlglot
|
| 708 |
+
|
| 709 |
+
@dataclass
|
| 710 |
+
class ValidationRecord:
|
| 711 |
+
question_id: int
|
| 712 |
+
db_id: str
|
| 713 |
+
difficulty: str
|
| 714 |
+
gold_sql: str
|
| 715 |
+
pipe_sql: Optional[str] = None
|
| 716 |
+
sqlite2_sql: Optional[str] = None
|
| 717 |
+
outcome: str = "" # MATCH, MISMATCH, ERROR, DECOMPILE_FAIL, KNOWN_ISSUE
|
| 718 |
+
mismatch_type: Optional[str] = None
|
| 719 |
+
error_category: Optional[str] = None
|
| 720 |
+
error_msg: Optional[str] = None
|
| 721 |
+
known_issues: List[str] = field(default_factory=list)
|
| 722 |
+
diagnostics: Optional[dict] = None
|
| 723 |
+
result_a_rows: int = 0
|
| 724 |
+
result_b_rows: int = 0
|
| 725 |
+
|
| 726 |
+
def normalize_entry(entry: dict, source: str = "spider") -> dict:
|
| 727 |
+
"""Normalize Spider or BIRD JSON entry to a common format."""
|
| 728 |
+
if source == "spider":
|
| 729 |
+
return {
|
| 730 |
+
"question_id": entry.get("question_id", id(entry)),
|
| 731 |
+
"db_id": entry["db_id"],
|
| 732 |
+
"difficulty": entry.get("hardness", "unknown"), # Spider uses "hardness"
|
| 733 |
+
"gold_sql": entry["query"], # Spider uses "query"
|
| 734 |
+
}
|
| 735 |
+
else: # bird
|
| 736 |
+
return {
|
| 737 |
+
"question_id": entry["question_id"],
|
| 738 |
+
"db_id": entry["db_id"],
|
| 739 |
+
"difficulty": entry.get("difficulty", "unknown"), # BIRD uses "difficulty"
|
| 740 |
+
"gold_sql": entry["SQL"], # BIRD uses "SQL"
|
| 741 |
+
}
|
| 742 |
+
|
| 743 |
+
def run_validation(
|
| 744 |
+
dev_json_path: str,
|
| 745 |
+
db_dir: str,
|
| 746 |
+
decompiler, # The decompiler instance
|
| 747 |
+
output_path: str,
|
| 748 |
+
source: str = "spider", # "spider" or "bird"
|
| 749 |
+
) -> Dict[str, int]:
|
| 750 |
+
"""Run the full validation pipeline on the benchmark dev set.
|
| 751 |
+
|
| 752 |
+
Args:
|
| 753 |
+
source: "spider" for Spider 1.0 format, "bird" for BIRD/BIRD Mini-Dev format.
|
| 754 |
+
"""
|
| 755 |
+
|
| 756 |
+
with open(dev_json_path) as f:
|
| 757 |
+
queries = json.load(f)
|
| 758 |
+
|
| 759 |
+
records = []
|
| 760 |
+
counters = {"MATCH": 0, "MISMATCH": 0, "ERROR": 0,
|
| 761 |
+
"DECOMPILE_FAIL": 0, "KNOWN_ISSUE": 0, "TIMEOUT": 0}
|
| 762 |
+
|
| 763 |
+
for raw_entry in queries:
|
| 764 |
+
entry = normalize_entry(raw_entry, source)
|
| 765 |
+
record = ValidationRecord(
|
| 766 |
+
question_id=entry["question_id"],
|
| 767 |
+
db_id=entry["db_id"],
|
| 768 |
+
difficulty=entry["difficulty"],
|
| 769 |
+
gold_sql=entry["gold_sql"],
|
| 770 |
+
)
|
| 771 |
+
|
| 772 |
+
db_path = Path(db_dir) / entry["db_id"] / f"{entry['db_id']}.sqlite"
|
| 773 |
+
|
| 774 |
+
# Step 1: Check for known SQLGlot issues
|
| 775 |
+
record.known_issues = check_known_issues(entry["gold_sql"])
|
| 776 |
+
|
| 777 |
+
# Step 2: Execute gold SQL β Result Set A
|
| 778 |
+
exec_a = execute_with_error_handling(entry["gold_sql"], str(db_path))
|
| 779 |
+
if not exec_a.success:
|
| 780 |
+
record.outcome = "ERROR"
|
| 781 |
+
record.error_category = "GOLD_SQL_FAIL"
|
| 782 |
+
record.error_msg = exec_a.error_msg
|
| 783 |
+
records.append(record)
|
| 784 |
+
counters["ERROR"] += 1
|
| 785 |
+
continue
|
| 786 |
+
record.result_a_rows = len(exec_a.rows)
|
| 787 |
+
|
| 788 |
+
# Step 3: Decompile to pipe SQL
|
| 789 |
+
try:
|
| 790 |
+
result = decompiler.transform(entry["gold_sql"], dialect="sqlite")
|
| 791 |
+
record.pipe_sql = result.pipe_sql
|
| 792 |
+
except Exception as e:
|
| 793 |
+
record.outcome = "DECOMPILE_FAIL"
|
| 794 |
+
record.error_msg = str(e)
|
| 795 |
+
records.append(record)
|
| 796 |
+
counters["DECOMPILE_FAIL"] += 1
|
| 797 |
+
continue
|
| 798 |
+
|
| 799 |
+
# Step 4: Transpile pipe SQL back to SQLite
|
| 800 |
+
try:
|
| 801 |
+
sqlite2 = sqlglot.transpile(
|
| 802 |
+
record.pipe_sql, read="bigquery", write="sqlite"
|
| 803 |
+
)[0]
|
| 804 |
+
record.sqlite2_sql = sqlite2
|
| 805 |
+
except Exception as e:
|
| 806 |
+
record.outcome = "ERROR"
|
| 807 |
+
record.error_category = "TRANSPILE_FAIL"
|
| 808 |
+
record.error_msg = str(e)
|
| 809 |
+
records.append(record)
|
| 810 |
+
counters["ERROR"] += 1
|
| 811 |
+
continue
|
| 812 |
+
|
| 813 |
+
# Step 5: Execute SQLiteβ β Result Set B
|
| 814 |
+
exec_b = execute_with_error_handling(sqlite2, str(db_path))
|
| 815 |
+
if not exec_b.success:
|
| 816 |
+
record.outcome = "ERROR"
|
| 817 |
+
record.error_category = exec_b.error_category
|
| 818 |
+
record.error_msg = exec_b.error_msg
|
| 819 |
+
records.append(record)
|
| 820 |
+
counters["ERROR"] += 1
|
| 821 |
+
continue
|
| 822 |
+
record.result_b_rows = len(exec_b.rows)
|
| 823 |
+
|
| 824 |
+
# Step 6: Compare results
|
| 825 |
+
if compare_with_tolerance(exec_a.rows, exec_b.rows):
|
| 826 |
+
record.outcome = "MATCH"
|
| 827 |
+
counters["MATCH"] += 1
|
| 828 |
+
else:
|
| 829 |
+
if record.known_issues:
|
| 830 |
+
record.outcome = "KNOWN_ISSUE"
|
| 831 |
+
counters["KNOWN_ISSUE"] += 1
|
| 832 |
+
else:
|
| 833 |
+
record.outcome = "MISMATCH"
|
| 834 |
+
record.diagnostics = asdict(diagnose_mismatch(
|
| 835 |
+
entry["SQL"], record.pipe_sql, sqlite2,
|
| 836 |
+
exec_a.rows, exec_b.rows, str(db_path)
|
| 837 |
+
))
|
| 838 |
+
record.mismatch_type = record.diagnostics.get("mismatch_type")
|
| 839 |
+
counters["MISMATCH"] += 1
|
| 840 |
+
|
| 841 |
+
records.append(record)
|
| 842 |
+
|
| 843 |
+
# Write detailed results
|
| 844 |
+
with open(output_path, "w") as f:
|
| 845 |
+
json.dump([asdict(r) for r in records], f, indent=2)
|
| 846 |
+
|
| 847 |
+
# Print summary
|
| 848 |
+
total = len(records)
|
| 849 |
+
print(f"\n{'='*60}")
|
| 850 |
+
print(f"Validation Results: {total} queries")
|
| 851 |
+
print(f"{'='*60}")
|
| 852 |
+
for outcome, count in sorted(counters.items()):
|
| 853 |
+
pct = count / total * 100
|
| 854 |
+
print(f" {outcome:20s}: {count:5d} ({pct:5.1f}%)")
|
| 855 |
+
match_rate = counters["MATCH"] / total * 100
|
| 856 |
+
print(f"{'='*60}")
|
| 857 |
+
print(f" Match rate: {match_rate:.1f}%")
|
| 858 |
+
effective = (counters["MATCH"] + counters["KNOWN_ISSUE"]) / total * 100
|
| 859 |
+
print(f" Effective rate (excl. known issues): {effective:.1f}%")
|
| 860 |
+
|
| 861 |
+
return counters
|
| 862 |
+
```
|
| 863 |
+
|
| 864 |
+
### 9.2 Difficulty-Stratified Reporting
|
| 865 |
+
|
| 866 |
+
Report results stratified by difficulty:
|
| 867 |
+
|
| 868 |
+
```python
|
| 869 |
+
def print_stratified_report(records: List[ValidationRecord], source: str = "spider"):
|
| 870 |
+
"""Print difficulty-stratified results."""
|
| 871 |
+
by_difficulty = defaultdict(lambda: {"total": 0, "match": 0})
|
| 872 |
+
|
| 873 |
+
for r in records:
|
| 874 |
+
by_difficulty[r.difficulty]["total"] += 1
|
| 875 |
+
if r.outcome == "MATCH":
|
| 876 |
+
by_difficulty[r.difficulty]["match"] += 1
|
| 877 |
+
|
| 878 |
+
# Spider uses: easy, medium, hard, extra
|
| 879 |
+
# BIRD uses: simple, moderate, challenging
|
| 880 |
+
if source == "spider":
|
| 881 |
+
levels = ["easy", "medium", "hard", "extra"]
|
| 882 |
+
else:
|
| 883 |
+
levels = ["simple", "moderate", "challenging"]
|
| 884 |
+
|
| 885 |
+
header = f"{'':15s}"
|
| 886 |
+
for d in levels:
|
| 887 |
+
header += f" {d:>12s}"
|
| 888 |
+
header += f" {'total':>10s}"
|
| 889 |
+
print(f"\n{header}")
|
| 890 |
+
print("-" * (15 + 12 * len(levels) + 10))
|
| 891 |
+
|
| 892 |
+
totals = {"total": 0, "match": 0}
|
| 893 |
+
row = f"{'count':15s}"
|
| 894 |
+
for d in levels:
|
| 895 |
+
row += f" {by_difficulty[d]['total']:12d}"
|
| 896 |
+
totals["total"] += by_difficulty[d]["total"]
|
| 897 |
+
totals["match"] += by_difficulty[d]["match"]
|
| 898 |
+
row += f" {totals['total']:10d}"
|
| 899 |
+
print(row)
|
| 900 |
+
|
| 901 |
+
row = f"{'match rate':15s}"
|
| 902 |
+
for d in levels:
|
| 903 |
+
rate = by_difficulty[d]["match"] / max(by_difficulty[d]["total"], 1) * 100
|
| 904 |
+
row += f" {rate:11.1f}%"
|
| 905 |
+
overall = totals["match"] / max(totals["total"], 1) * 100
|
| 906 |
+
row += f" {overall:9.1f}%"
|
| 907 |
+
print(row)
|
| 908 |
+
```
|
| 909 |
+
|
| 910 |
+
---
|
| 911 |
+
|
| 912 |
+
## 10. Intermediate Validation: Pipe SQL Syntax Check
|
| 913 |
+
|
| 914 |
+
Before transpiling pipe SQL back to SQLite, verify the pipe SQL is syntactically valid:
|
| 915 |
+
|
| 916 |
+
```python
|
| 917 |
+
def validate_pipe_syntax(pipe_sql: str) -> Tuple[bool, Optional[str]]:
|
| 918 |
+
"""Verify pipe SQL parses without error."""
|
| 919 |
+
try:
|
| 920 |
+
ast = sqlglot.parse_one(pipe_sql, read="bigquery")
|
| 921 |
+
# Verify it round-trips to valid SQL
|
| 922 |
+
standard = ast.sql(dialect="bigquery")
|
| 923 |
+
return True, None
|
| 924 |
+
except sqlglot.errors.ParseError as e:
|
| 925 |
+
return False, str(e)
|
| 926 |
+
```
|
| 927 |
+
|
| 928 |
+
This catches decompiler bugs that produce syntactically invalid pipe SQL before they reach the execution stage. Syntax errors are cheaper to diagnose than execution mismatches.
|
| 929 |
+
|
| 930 |
+
### 10.1 Prefix Validation
|
| 931 |
+
|
| 932 |
+
Exploit the Prefix Property: every prefix of a valid pipe query (up to a `|>` boundary) is itself a valid query. Validate each prefix independently:
|
| 933 |
+
|
| 934 |
+
```python
|
| 935 |
+
def validate_pipe_prefixes(pipe_sql: str) -> List[Tuple[int, bool, Optional[str]]]:
|
| 936 |
+
"""Validate each prefix of the pipe query."""
|
| 937 |
+
lines = pipe_sql.strip().split("\n")
|
| 938 |
+
results = []
|
| 939 |
+
|
| 940 |
+
prefix = ""
|
| 941 |
+
for i, line in enumerate(lines):
|
| 942 |
+
if i == 0:
|
| 943 |
+
prefix = line
|
| 944 |
+
else:
|
| 945 |
+
prefix += "\n" + line
|
| 946 |
+
|
| 947 |
+
valid, error = validate_pipe_syntax(prefix)
|
| 948 |
+
results.append((i, valid, error))
|
| 949 |
+
|
| 950 |
+
if not valid:
|
| 951 |
+
break # First invalid prefix identifies the broken operator
|
| 952 |
+
|
| 953 |
+
return results
|
| 954 |
+
```
|
| 955 |
+
|
| 956 |
+
If prefix N is valid but prefix N+1 is not, the bug is in the Nth pipe operator β providing precise localization for debugging.
|
| 957 |
+
|
| 958 |
+
---
|
| 959 |
+
|
| 960 |
+
## 11. Output Artifacts
|
| 961 |
+
|
| 962 |
+
### 11.1 Per-Iteration Output
|
| 963 |
+
|
| 964 |
+
Each validation run produces:
|
| 965 |
+
|
| 966 |
+
```
|
| 967 |
+
validation_output/
|
| 968 |
+
βββ iteration_001/
|
| 969 |
+
β βββ results.json # Full per-query results (ValidationRecord array)
|
| 970 |
+
β βββ summary.txt # Stratified match rates
|
| 971 |
+
β βββ mismatches.json # Only MISMATCH records with diagnostics
|
| 972 |
+
β βββ errors.json # Only ERROR records with error details
|
| 973 |
+
β βββ decompile_fails.json # Only DECOMPILE_FAIL records
|
| 974 |
+
β βββ golden_pairs.jsonl # Successfully validated (question, pipe_sql) pairs
|
| 975 |
+
βββ iteration_002/
|
| 976 |
+
β βββ ...
|
| 977 |
+
β βββ regression_report.txt # Comparison with iteration_001
|
| 978 |
+
βββ ...
|
| 979 |
+
```
|
| 980 |
+
|
| 981 |
+
### 11.2 Golden Corpus Output
|
| 982 |
+
|
| 983 |
+
Queries that achieve MATCH are exported as training data:
|
| 984 |
+
|
| 985 |
+
```json
|
| 986 |
+
{
|
| 987 |
+
"question_id": 7,
|
| 988 |
+
"db_id": "california_schools",
|
| 989 |
+
"question": "What is the phone number of the school that has the highest number of test takers with an SAT score of over 1500?",
|
| 990 |
+
"evidence": "",
|
| 991 |
+
"gold_sql": "SELECT T2.Phone FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode ORDER BY T1.NumGE1500 DESC LIMIT 1",
|
| 992 |
+
"pipe_sql": "FROM satscores AS T1\n|> JOIN schools AS T2 ON T1.cds = T2.CDSCode\n|> ORDER BY T1.NumGE1500 DESC\n|> LIMIT 1\n|> SELECT T2.Phone",
|
| 993 |
+
"difficulty": "simple",
|
| 994 |
+
"validation": "MATCH"
|
| 995 |
+
}
|
| 996 |
+
```
|
| 997 |
+
|
| 998 |
+
---
|
| 999 |
+
|
| 1000 |
+
## 12. Success Criteria
|
| 1001 |
+
|
| 1002 |
+
| Metric | Target | Measured On |
|
| 1003 |
+
|---|---|---|
|
| 1004 |
+
| **Match rate** (strict) | β₯ 95% of decompilable queries | Spider dev (1,034 queries) |
|
| 1005 |
+
| **Decompile success rate** | β₯ 90% of all queries | Spider dev |
|
| 1006 |
+
| **Effective rate** (excl. known issues + decompile fails) | β₯ 97% | Successfully decompiled queries |
|
| 1007 |
+
| **Zero regressions per iteration** | 100% | Previous MATCH queries remain MATCH |
|
| 1008 |
+
| **Error rate** (execution failures) | β€ 3% of decompilable queries | Spider dev |
|
| 1009 |
+
| **Match rate by difficulty** | easy β₯ 99%, medium β₯ 97%, hard β₯ 93%, extra hard β₯ 88% | Spider dev |
|
| 1010 |
+
| **Tier 2 stress test** | β₯ 90% match rate | BIRD Mini-Dev (500 queries) |
|
| 1011 |
+
|
| 1012 |
+
### 12.1 Graduation Criteria
|
| 1013 |
+
|
| 1014 |
+
**Tier 1 graduation** (Spider) β the validation loop advances to Tier 2 when:
|
| 1015 |
+
1. Match rate β₯ 95% on Spider dev set (1,034 queries).
|
| 1016 |
+
2. All MISMATCH records have been either fixed or classified as KNOWN_ISSUE.
|
| 1017 |
+
3. The decompiler has been stable (no regressions) for 2 consecutive iterations.
|
| 1018 |
+
4. The golden corpus contains β₯ 900 validated pipe SQL queries from Spider dev.
|
| 1019 |
+
|
| 1020 |
+
**Tier 2 graduation** (BIRD Mini-Dev) β production readiness when:
|
| 1021 |
+
1. Match rate β₯ 90% on BIRD Mini-Dev (500 queries).
|
| 1022 |
+
2. No new categories of MISMATCH discovered (all mismatch types already seen in Tier 1).
|
| 1023 |
+
3. Any BIRD-specific edge cases have been fixed without regressing Spider results.
|
| 1024 |
+
|
| 1025 |
+
At Tier 2 graduation, the decompiler is applied to Spider train (8,659 queries) and then full BIRD train (9,428 queries, 33.4 GB download) for production corpus generation.
|
| 1026 |
+
|
| 1027 |
+
---
|
| 1028 |
+
|
| 1029 |
+
## 13. Implementation Roadmap
|
| 1030 |
+
|
| 1031 |
+
### Week 1: Harness Setup
|
| 1032 |
+
- Download Spider 1.0 (~1 GB: dev.json + database/)
|
| 1033 |
+
- Implement execution harness (`run_validation`)
|
| 1034 |
+
- Implement result comparison logic (set-based + tolerance)
|
| 1035 |
+
- Run baseline: gold SQL β parse β generate SQLite (no pipe) β execute β compare
|
| 1036 |
+
- This baseline measures SQLGlot's round-trip fidelity before the decompiler is involved
|
| 1037 |
+
|
| 1038 |
+
### Week 2: Decompiler Integration
|
| 1039 |
+
- Connect the decompiler (from companion design doc) to the validation harness
|
| 1040 |
+
- Run first full iteration on Spider dev (1,034 queries)
|
| 1041 |
+
- Implement diagnostic feedback (mismatch classification, AST diff)
|
| 1042 |
+
- Triage all errors and mismatches from iteration 1
|
| 1043 |
+
|
| 1044 |
+
### Week 3β4: Fix Cycle (Tier 1)
|
| 1045 |
+
- Fix decompiler rules based on mismatch diagnostics
|
| 1046 |
+
- Run iterations 2βN with regression testing
|
| 1047 |
+
- Target: 80% β 90% β 95% match rate on Spider dev
|
| 1048 |
+
|
| 1049 |
+
### Week 5: Tier 2 Stress Test
|
| 1050 |
+
- Download BIRD Mini-Dev (500 queries, 11 databases)
|
| 1051 |
+
- Run decompiler on BIRD Mini-Dev
|
| 1052 |
+
- Fix any new edge cases discovered
|
| 1053 |
+
- Target: β₯ 90% match rate on BIRD Mini-Dev
|
| 1054 |
+
|
| 1055 |
+
### Week 6: Production
|
| 1056 |
+
- Run decompiler on Spider train set (8,659 queries) β generate first golden corpus
|
| 1057 |
+
- Download full BIRD train set (9,428 queries, 33.4 GB) β generate extended corpus
|
| 1058 |
+
- Feed into trajectory decomposition pipeline (from fine-tuning design doc, Section 7)
|