Zheyuan Zhao commited on
Commit
5f143d7
Β·
verified Β·
1 Parent(s): de284a1

Add design doc: pipe-sql-validation-loop-design-doc.md

Browse files
docs/pipe-sql-validation-loop-design-doc.md ADDED
@@ -0,0 +1,1058 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Document: SQLite₁ β†’ Pipe SQL β†’ SQLiteβ‚‚ Validation & Feedback Loop
2
+
3
+ ## 1. Problem Statement
4
+
5
+ The decompiler transforms standard SQL into pipe SQL. We must prove that this transformation preserves semantics β€” that the pipe SQL, when executed, returns exactly the same results as the original.
6
+
7
+ This document designs a closed-loop validation system using a **tiered benchmark strategy**: Spider 1.0 as the primary lightweight dataset (~1 GB), with BIRD Mini-Dev as a secondary stress test.
8
+
9
+ ```
10
+ SQLite₁ (gold SQL) ──execute──► Result Set A
11
+ β”‚
12
+ β–Ό
13
+ [Decompiler]
14
+ β”‚
15
+ β–Ό
16
+ Pipe SQL (synthesized)
17
+ β”‚
18
+ β–Ό
19
+ [SQLGlot transpile pipe→sqlite]
20
+ β”‚
21
+ β–Ό
22
+ SQLiteβ‚‚ (round-tripped) ──execute──► Result Set B
23
+
24
+ Result Set A == Result Set B ?
25
+ ```
26
+
27
+ **Goal**: For every query in the benchmark, `Result Set A == Result Set B`. Any mismatch triggers a diagnostic feedback loop that identifies the root cause and feeds corrections back into the decompiler.
28
+
29
+ ---
30
+
31
+ ## 2. Data Source: Tiered Benchmark Strategy
32
+
33
+ ### 2.1 Why Not BIRD-SQL Directly?
34
+
35
+ The full BIRD-SQL benchmark is **33.4 GB** β€” most of that bulk comes from a few massive databases (financial, geographic datasets). This creates unnecessary friction for iterative decompiler development where fast feedback cycles matter most.
36
+
37
+ ### 2.2 Tier 1 (Primary): Spider 1.0
38
+
39
+ | Property | Value |
40
+ |---|---|
41
+ | Total queries | 10,181 (8,659 train + 1,034 dev + 2,147 test) |
42
+ | Database engine | SQLite |
43
+ | Number of databases | 200 (across 138 domains) |
44
+ | Database size | **~1 GB total** |
45
+ | Difficulty levels | easy, medium, hard, extra hard |
46
+ | Ground truth | Execution-verified SQL with known-correct result sets |
47
+ | Download | yale-lily.github.io/spider |
48
+
49
+ Spider 1.0 is ideal as the primary validation dataset because:
50
+ 1. **Lightweight** β€” ~1 GB total, 200 small SQLite databases. Fast to download, fast to iterate.
51
+ 2. All databases are **SQLite** β€” no external database setup required.
52
+ 3. Ground truth SQL is **execution-verified** β€” we know the gold SQL produces correct results.
53
+ 4. Queries cover JOINs, aggregations, subqueries, GROUP BY, ORDER BY, HAVING, nested queries, and set operations.
54
+ 5. **Well-studied** β€” extensive prior work means known failure modes and edge cases are documented.
55
+ 6. 200 databases across 138 domains provide broad schema diversity.
56
+
57
+ ### 2.3 Tier 2 (Stress Test): BIRD Mini-Dev
58
+
59
+ | Property | Value |
60
+ |---|---|
61
+ | Total queries | 500 (curated high-quality subset) |
62
+ | Database engine | SQLite |
63
+ | Number of databases | 11 |
64
+ | Database size | ~few GB (much smaller than full BIRD's 33.4 GB) |
65
+ | Difficulty levels | simple, moderate, challenging |
66
+ | Download | HuggingFace `birdsql/bird_mini_dev` |
67
+
68
+ BIRD Mini-Dev is used as a secondary stress test because:
69
+ 1. Queries are harder than Spider β€” more complex JOINs, domain-specific reasoning, challenging expressions.
70
+ 2. 500 curated queries is manageable but tests edge cases Spider may miss.
71
+ 3. Uses the same 11 dev databases as full BIRD but without the massive train databases.
72
+
73
+ ### 2.4 Tier 3 (Production Scale-Up): Full BIRD Train
74
+
75
+ Once the decompiler passes Tier 1 and Tier 2, apply it to the full BIRD train set (9,428 queries, 33.4 GB) for large-scale golden corpus generation. This is a one-time batch job β€” the 30+ GB download is justified only at this stage.
76
+
77
+ ### 2.5 Data Format
78
+
79
+ **Spider 1.0:**
80
+ ```
81
+ spider/
82
+ β”œβ”€β”€ dev.json # Question-SQL pairs
83
+ β”‚ [
84
+ β”‚ {
85
+ β”‚ "db_id": "concert_singer",
86
+ β”‚ "query": "SELECT count(*) FROM singer",
87
+ β”‚ "query_toks": ["SELECT", "count", "(", "*", ")", ...],
88
+ β”‚ "question": "How many singers do we have?",
89
+ β”‚ "hardness": "easy"
90
+ β”‚ },
91
+ β”‚ ...
92
+ β”‚ ]
93
+ β”œβ”€β”€ database/
94
+ β”‚ β”œβ”€β”€ concert_singer/
95
+ β”‚ β”‚ β”œβ”€β”€ concert_singer.sqlite
96
+ β”‚ β”‚ └── schema.sql
97
+ β”‚ β”œβ”€β”€ pets_1/
98
+ β”‚ β”‚ β”œβ”€β”€ pets_1.sqlite
99
+ β”‚ β”‚ └── schema.sql
100
+ β”‚ └── ... (200 databases)
101
+ └── dev_gold.sql # Gold SQL per line
102
+ ```
103
+
104
+ **BIRD Mini-Dev:**
105
+ ```
106
+ bird_mini_dev/
107
+ β”œβ”€β”€ mini_dev_sqlite.json # Question-SQL pairs
108
+ β”‚ [
109
+ β”‚ {
110
+ β”‚ "question_id": 7,
111
+ β”‚ "db_id": "california_schools",
112
+ β”‚ "question": "What is the phone number of ...",
113
+ β”‚ "evidence": "",
114
+ β”‚ "SQL": "SELECT T2.Phone FROM satscores AS T1 INNER JOIN ...",
115
+ β”‚ "difficulty": "simple"
116
+ β”‚ },
117
+ β”‚ ...
118
+ β”‚ ]
119
+ β”œβ”€β”€ mini_dev_databases/
120
+ β”‚ β”œβ”€β”€ california_schools/
121
+ β”‚ β”‚ └── california_schools.sqlite
122
+ β”‚ └── ... (11 databases)
123
+ └── mini_dev_gold.sql
124
+ ```
125
+
126
+ ### 2.6 Working Set
127
+
128
+ | Phase | Dataset | Queries | Purpose |
129
+ |---|---|---|---|
130
+ | Development & iteration | Spider 1.0 dev | 1,034 | Fast feedback loop (~1 GB, seconds to run) |
131
+ | Stress testing | BIRD Mini-Dev | 500 | Harder queries, edge case discovery |
132
+ | Production corpus | Spider 1.0 train | 8,659 | Scale-up validated pipe SQL pairs |
133
+ | Production corpus | BIRD train | 9,428 | Maximum training data (download 33.4 GB only at this stage) |
134
+
135
+ ---
136
+
137
+ ## 3. Validation Pipeline Architecture
138
+
139
+ ### 3.1 End-to-End Flow
140
+
141
+ ```
142
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
143
+ β”‚ For each (question_id, db_id, gold_sql) β”‚
144
+ β”‚ β”‚
145
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
146
+ β”‚ β”‚ gold_sql │────►│ Execute on │────►│ Result Set A β”‚ β”‚
147
+ β”‚ β”‚ (SQLite) β”‚ β”‚ SQLite DB β”‚ β”‚ (gold result) β”‚ β”‚
148
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
149
+ β”‚ β”‚ β”‚ β”‚
150
+ β”‚ β–Ό β”‚ β”‚
151
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
152
+ β”‚ β”‚ Parse with β”‚ β”‚ β”‚
153
+ β”‚ β”‚ SQLGlot β”‚ β”‚ β”‚
154
+ β”‚ β”‚ (read=sqlite)β”‚ β”‚ β”‚
155
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
156
+ β”‚ β”‚ β”‚ β”‚
157
+ β”‚ β–Ό β”‚ β”‚
158
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
159
+ β”‚ β”‚ Pre-process β”‚ β”‚ β”‚
160
+ β”‚ β”‚ (qualify, β”‚ β”‚ β”‚
161
+ β”‚ β”‚ unnest, β”‚ β”‚ β”‚
162
+ β”‚ β”‚ simplify) β”‚ β”‚ β”‚
163
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
164
+ β”‚ β”‚ β”‚ β”‚
165
+ β”‚ β–Ό β”‚ β”‚
166
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
167
+ β”‚ β”‚ Pipe Emitter β”‚ β”‚ β”‚
168
+ β”‚ β”‚ (AST β†’ pipe β”‚ β”‚ β”‚
169
+ β”‚ β”‚ operators) β”‚ β”‚ β”‚
170
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
171
+ β”‚ β”‚ β”‚ β”‚
172
+ β”‚ β–Ό β”‚ β”‚
173
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
174
+ β”‚ β”‚ pipe_sql β”‚ β”‚ β”‚
175
+ β”‚ β”‚ (canonical β”‚ β”‚ β”‚
176
+ β”‚ β”‚ pipe syntax)β”‚ β”‚ β”‚
177
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
178
+ β”‚ β”‚ β”‚ β”‚
179
+ β”‚ β–Ό β”‚ β”‚
180
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
181
+ β”‚ β”‚ SQLGlot β”‚ β”‚ β”‚
182
+ β”‚ β”‚ transpile β”‚ β”‚ β”‚
183
+ │ │ pipe→sqlite │ │ │
184
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
185
+ β”‚ β”‚ β”‚ β”‚
186
+ β”‚ β–Ό β”‚ β”‚
187
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
188
+ β”‚ β”‚ sqlite2_sql │────►│ Execute on │────►│ Result Set B β”‚ β”‚
189
+ β”‚ β”‚ (round-trip) β”‚ β”‚ same DB β”‚ β”‚ (pipe result) β”‚ β”‚
190
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
191
+ β”‚ β”‚ β”‚
192
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
193
+ β”‚ β”‚ Compare A == B β”‚ β”‚
194
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
195
+ β”‚ β”‚ β”‚
196
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β” β”‚
197
+ β”‚ β”‚ β”‚ β”‚ β”‚
198
+ β”‚ β–Ό β–Ό β–Ό β”‚
199
+ β”‚ MATCH MISMATCH ERROR β”‚
200
+ β”‚ (log success) (enter feedback) (triage)β”‚
201
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
202
+ ```
203
+
204
+ ### 3.2 The Three Outcomes
205
+
206
+ | Outcome | Meaning | Action |
207
+ |---|---|---|
208
+ | **MATCH** | `set(Result A) == set(Result B)` | Query is validated. Add to golden corpus. |
209
+ | **MISMATCH** | Both execute, but result sets differ | Enter diagnostic feedback loop (Section 5). |
210
+ | **ERROR** | SQLiteβ‚‚ fails to execute (syntax error, runtime error) | Enter error triage (Section 6). |
211
+
212
+ An additional sub-outcome exists:
213
+
214
+ | Sub-outcome | Meaning | Action |
215
+ |---|---|---|
216
+ | **DECOMPILE_FAIL** | The decompiler could not transform the query | Log with classification. Attempt fallback strategies. |
217
+ | **TIMEOUT** | Execution exceeds 30-second limit | Use BIRD's standard 30s timeout. Score as ERROR. |
218
+
219
+ ---
220
+
221
+ ## 4. Result Comparison Logic
222
+
223
+ ### 4.1 Set-Based Comparison
224
+
225
+ Following standard text-to-SQL evaluation methodology (used by both Spider and BIRD), comparison is **set-based** β€” row order does not matter:
226
+
227
+ ```python
228
+ def compare_results(result_a: List[Tuple], result_b: List[Tuple]) -> bool:
229
+ """Compare two result sets using set equality."""
230
+ return set(result_a) == set(result_b)
231
+ ```
232
+
233
+ This is intentionally strict: every row must match exactly. No fuzzy matching, no type coercion.
234
+
235
+ ### 4.2 Enhanced Comparison (For Diagnostic Purposes)
236
+
237
+ When set comparison fails, we compute additional metrics to diagnose the mismatch:
238
+
239
+ ```python
240
+ @dataclass
241
+ class ComparisonResult:
242
+ match: bool # set(A) == set(B)
243
+ result_a_rows: int
244
+ result_b_rows: int
245
+ result_a_cols: int
246
+ result_b_cols: int
247
+
248
+ # Diagnostic fields (only computed on mismatch)
249
+ row_count_match: bool # len(A) == len(B)
250
+ col_count_match: bool # same number of columns
251
+ col_types_match: bool # column types compatible
252
+ sorted_match: bool # match after sorting both
253
+ subset_a_in_b: bool # A βŠ† B
254
+ subset_b_in_a: bool # B βŠ† A
255
+ symmetric_difference: int # |A β–³ B|
256
+ sample_diff_rows: List[Tuple] # Up to 5 rows in A but not B
257
+ f1_score: float # Cell-level F1 (soft metric)
258
+
259
+ # Root cause classification
260
+ mismatch_type: str # See Section 5.2
261
+ ```
262
+
263
+ ### 4.3 Floating-Point Tolerance
264
+
265
+ SQLite may return slightly different floating-point results depending on expression evaluation order. We apply a tolerance layer:
266
+
267
+ ```python
268
+ def normalize_row(row: Tuple, tolerance: float = 1e-6) -> Tuple:
269
+ """Normalize a row for comparison."""
270
+ normalized = []
271
+ for val in row:
272
+ if isinstance(val, float):
273
+ normalized.append(round(val, 6))
274
+ elif isinstance(val, str):
275
+ normalized.append(val.strip())
276
+ elif val is None:
277
+ normalized.append(None)
278
+ else:
279
+ normalized.append(val)
280
+ return tuple(normalized)
281
+
282
+ def compare_with_tolerance(result_a, result_b, tolerance=1e-6):
283
+ set_a = set(normalize_row(r, tolerance) for r in result_a)
284
+ set_b = set(normalize_row(r, tolerance) for r in result_b)
285
+ return set_a == set_b
286
+ ```
287
+
288
+ ### 4.4 Column Order Handling
289
+
290
+ Standard evaluation compares result sets as sets of tuples, which means column order matters (a row `(1, 'Alice')` β‰  `('Alice', 1)`). Since pipe SQL may reorder columns (e.g., `|> AGGREGATE` outputs grouping columns first, then aggregates), the decompiler must ensure the final `|> SELECT` matches the original column order.
291
+
292
+ If column reordering is the sole cause of mismatch, it is classified as a **REORDER** mismatch (non-semantic, fixable by adjusting the final SELECT).
293
+
294
+ ---
295
+
296
+ ## 5. Diagnostic Feedback Loop (Mismatch Handling)
297
+
298
+ ### 5.1 Feedback Loop Flow
299
+
300
+ ```
301
+ MISMATCH detected
302
+ β”‚
303
+ β–Ό
304
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
305
+ β”‚ Classify β”‚
306
+ β”‚ mismatch type β”‚
307
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
308
+ β”‚
309
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
310
+ β–Ό β–Ό β–Ό β–Ό β–Ό
311
+ REORDER EXTRA_ROWS MISSING_ROWS WRONG_VALUES NULL_DIFF
312
+ β”‚ β”‚ β”‚ β”‚ β”‚
313
+ β–Ό β–Ό β–Ό β–Ό β–Ό
314
+ Fix final Diagnose Diagnose Diagnose Diagnose
315
+ SELECT filter/join filter/join expression NULL handling
316
+ projection logic logic rewriting differences
317
+ β”‚ β”‚ β”‚ β”‚ β”‚
318
+ β–Ό β–Ό β–Ό β–Ό β–Ό
319
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
320
+ β”‚ Generate Fix Hypothesis β”‚
321
+ β”‚ (which transformation rule is at fault?) β”‚
322
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
323
+ β”‚
324
+ β–Ό
325
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
326
+ β”‚ Apply Fix to Decompiler Rule β”‚
327
+ β”‚ (update rule, add edge case) β”‚
328
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
329
+ β”‚
330
+ β–Ό
331
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
332
+ β”‚ Re-run Validation on Affected Queriesβ”‚
333
+ β”‚ (regression test) β”‚
334
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
335
+ ```
336
+
337
+ ### 5.2 Mismatch Classification
338
+
339
+ | Type | Symptom | Likely Root Cause | Fix Strategy |
340
+ |---|---|---|---|
341
+ | **REORDER** | Same rows, different column order | Final `\|> SELECT` doesn't match original column order | Adjust projection rule to preserve original SELECT order |
342
+ | **EXTRA_ROWS** | B has rows not in A | WHERE filter was dropped or weakened during transformation | Check WHERE promotion rule; verify HAVING→WHERE conversion |
343
+ | **MISSING_ROWS** | A has rows not in B | WHERE filter is too aggressive, or JOIN type changed | Check JOIN linearization (INNER vs LEFT); verify subquery unnesting |
344
+ | **WRONG_VALUES** | Same row count, different values | Expression rewriting error (e.g., aggregate alias mismatch, CASE transformation) | Diff the two SQLs at the expression level; identify which column differs |
345
+ | **NULL_DIFF** | Mismatch only in NULL-containing rows | NULL ordering difference, or LEFT JOIN β†’ INNER JOIN conversion | Check JOIN types and NULL handling in WHERE conditions |
346
+ | **TYPE_DIFF** | Values are "equal" but different types (e.g., `1` vs `1.0`, `"1"` vs `1`) | Type coercion difference between original and CTE-based round-trip | Add type normalization in comparison or fix CAST in decompiler |
347
+ | **DUPLICATE_DIFF** | Set sizes differ but distinct values match | DISTINCT was added or dropped during transformation | Check DISTINCT handling in decompiler |
348
+
349
+ ### 5.3 Automated Root Cause Analysis
350
+
351
+ For each mismatch, the system performs automated analysis:
352
+
353
+ ```python
354
+ def diagnose_mismatch(
355
+ gold_sql: str,
356
+ pipe_sql: str,
357
+ sqlite2_sql: str,
358
+ result_a: List[Tuple],
359
+ result_b: List[Tuple],
360
+ db_path: str
361
+ ) -> DiagnosticReport:
362
+ report = DiagnosticReport()
363
+
364
+ # 1. Column count check
365
+ if len(result_a[0]) != len(result_b[0]):
366
+ report.mismatch_type = "REORDER" if sorted(result_a[0]) == sorted(result_b[0]) else "COL_COUNT"
367
+ report.fix_hint = "Adjust final |> SELECT projection"
368
+ return report
369
+
370
+ # 2. Row count check
371
+ if len(result_a) != len(result_b):
372
+ extra = set(result_b) - set(result_a)
373
+ missing = set(result_a) - set(result_b)
374
+ if extra and not missing:
375
+ report.mismatch_type = "EXTRA_ROWS"
376
+ report.fix_hint = "WHERE filter dropped or weakened"
377
+ elif missing and not extra:
378
+ report.mismatch_type = "MISSING_ROWS"
379
+ report.fix_hint = "WHERE filter too aggressive or JOIN type changed"
380
+ else:
381
+ report.mismatch_type = "ROW_SWAP"
382
+ report.fix_hint = "Both extra and missing rows β€” logic error in transformation"
383
+ report.sample_extra = list(extra)[:5]
384
+ report.sample_missing = list(missing)[:5]
385
+ return report
386
+
387
+ # 3. Value-level diff (same row count)
388
+ # Sort both and compare position by position
389
+ sorted_a = sorted(result_a)
390
+ sorted_b = sorted(result_b)
391
+ diff_positions = []
392
+ for i, (ra, rb) in enumerate(zip(sorted_a, sorted_b)):
393
+ if ra != rb:
394
+ diff_positions.append((i, ra, rb))
395
+ if diff_positions:
396
+ # Check if it's a NULL issue
397
+ null_diffs = [(i, a, b) for i, a, b in diff_positions
398
+ if any(v is None for v in a + b)]
399
+ if len(null_diffs) == len(diff_positions):
400
+ report.mismatch_type = "NULL_DIFF"
401
+ else:
402
+ report.mismatch_type = "WRONG_VALUES"
403
+ report.sample_diffs = diff_positions[:5]
404
+
405
+ # 4. AST diff between gold_sql and sqlite2_sql
406
+ report.ast_diff = compute_ast_diff(gold_sql, sqlite2_sql)
407
+
408
+ # 5. Identify which transformation rule likely caused the issue
409
+ report.suspected_rule = identify_suspected_rule(report.ast_diff, report.mismatch_type)
410
+
411
+ return report
412
+ ```
413
+
414
+ ### 5.4 AST Diff for Root Cause Identification
415
+
416
+ Compare the AST of the original `gold_sql` with the AST of `sqlite2_sql` (the round-tripped version) to pinpoint structural differences:
417
+
418
+ ```python
419
+ def compute_ast_diff(sql_a: str, sql_b: str) -> List[str]:
420
+ """Compute structural differences between two SQL ASTs."""
421
+ import sqlglot
422
+
423
+ ast_a = sqlglot.parse_one(sql_a, read="sqlite")
424
+ ast_b = sqlglot.parse_one(sql_b, read="sqlite")
425
+
426
+ diffs = []
427
+
428
+ # Compare FROM clauses
429
+ if ast_a.find(exp.From) and ast_b.find(exp.From):
430
+ if ast_a.find(exp.From).sql() != ast_b.find(exp.From).sql():
431
+ diffs.append(f"FROM changed: {ast_a.find(exp.From).sql()} β†’ {ast_b.find(exp.From).sql()}")
432
+
433
+ # Compare JOIN count and types
434
+ joins_a = list(ast_a.find_all(exp.Join))
435
+ joins_b = list(ast_b.find_all(exp.Join))
436
+ if len(joins_a) != len(joins_b):
437
+ diffs.append(f"JOIN count changed: {len(joins_a)} β†’ {len(joins_b)}")
438
+ for i, (ja, jb) in enumerate(zip(joins_a, joins_b)):
439
+ if ja.side != jb.side or ja.kind != jb.kind:
440
+ diffs.append(f"JOIN[{i}] type changed: {ja.side} {ja.kind} β†’ {jb.side} {jb.kind}")
441
+
442
+ # Compare WHERE conditions
443
+ where_a = ast_a.find(exp.Where)
444
+ where_b = ast_b.find(exp.Where)
445
+ if (where_a is None) != (where_b is None):
446
+ diffs.append(f"WHERE {'added' if where_b else 'dropped'}")
447
+ elif where_a and where_b and where_a.sql() != where_b.sql():
448
+ diffs.append(f"WHERE changed: {where_a.sql()} β†’ {where_b.sql()}")
449
+
450
+ # Compare GROUP BY
451
+ group_a = ast_a.find(exp.Group)
452
+ group_b = ast_b.find(exp.Group)
453
+ if (group_a is None) != (group_b is None):
454
+ diffs.append(f"GROUP BY {'added' if group_b else 'dropped'}")
455
+
456
+ # Compare aggregate functions
457
+ aggs_a = sorted(n.sql() for n in ast_a.find_all(exp.AggFunc))
458
+ aggs_b = sorted(n.sql() for n in ast_b.find_all(exp.AggFunc))
459
+ if aggs_a != aggs_b:
460
+ diffs.append(f"Aggregates changed: {aggs_a} β†’ {aggs_b}")
461
+
462
+ # Compare SELECT expressions
463
+ if isinstance(ast_a, exp.Select) and isinstance(ast_b, exp.Select):
464
+ sels_a = [e.sql() for e in ast_a.expressions]
465
+ sels_b = [e.sql() for e in ast_b.expressions]
466
+ if sels_a != sels_b:
467
+ diffs.append(f"SELECT changed: {sels_a} β†’ {sels_b}")
468
+
469
+ return diffs
470
+ ```
471
+
472
+ ### 5.5 Suspected Rule Identification
473
+
474
+ Map mismatch types to likely decompiler rules:
475
+
476
+ ```python
477
+ RULE_SUSPECTS = {
478
+ "REORDER": ["projection_rule"],
479
+ "EXTRA_ROWS": ["where_rule", "aggregate_rule"],
480
+ "MISSING_ROWS": ["where_rule", "join_rule", "subquery_rule"],
481
+ "WRONG_VALUES": ["aggregate_rule", "window_rule", "projection_rule"],
482
+ "NULL_DIFF": ["join_rule", "where_rule"],
483
+ "TYPE_DIFF": ["projection_rule"],
484
+ "DUPLICATE_DIFF": ["terminal_rule"], # DISTINCT handling
485
+ "COL_COUNT": ["projection_rule", "aggregate_rule"],
486
+ }
487
+
488
+ def identify_suspected_rule(ast_diffs: List[str], mismatch_type: str) -> List[str]:
489
+ suspects = list(RULE_SUSPECTS.get(mismatch_type, []))
490
+
491
+ # Refine based on AST diffs
492
+ for diff in ast_diffs:
493
+ if "JOIN" in diff:
494
+ suspects.insert(0, "join_rule")
495
+ if "WHERE" in diff:
496
+ suspects.insert(0, "where_rule")
497
+ if "GROUP BY" in diff:
498
+ suspects.insert(0, "aggregate_rule")
499
+ if "Aggregates" in diff:
500
+ suspects.insert(0, "aggregate_rule")
501
+
502
+ return list(dict.fromkeys(suspects)) # deduplicate, preserve order
503
+ ```
504
+
505
+ ---
506
+
507
+ ## 6. Error Triage (Execution Failures)
508
+
509
+ ### 6.1 Error Categories
510
+
511
+ When `sqlite2_sql` fails to execute, classify the error:
512
+
513
+ | Error Category | Example Error Message | Root Cause | Fix Strategy |
514
+ |---|---|---|---|
515
+ | **SYNTAX** | `near "|>": syntax error` | Pipe syntax leaked into SQLite output (SQLGlot transpile failure) | Fix SQLGlot read/write dialect params |
516
+ | **NO_SUCH_TABLE** | `no such table: __tmp1` | CTE chain broken during transpilation | Ensure all CTEs are properly defined |
517
+ | **NO_SUCH_COLUMN** | `no such column: t1.name` | Column qualification error after pipe transformation | Fix qualify step or alias propagation |
518
+ | **NO_SUCH_FUNCTION** | `no such function: ARRAY_AGG` | BigQuery function leaked into SQLite output | Add function mapping or flag as unsupported |
519
+ | **AMBIGUOUS_COLUMN** | `ambiguous column name: id` | Self-join or multi-table query lost its aliases | Fix AS insertion and alias propagation |
520
+ | **TYPE_ERROR** | `cannot use aggregate in this context` | Aggregate/non-aggregate mixing in wrong position | Fix expression classification in aggregate rule |
521
+ | **TIMEOUT** | Execution exceeded 30s | Query produces cartesian product or infinite recursion | Flag as decompiler bug (likely missing JOIN condition). Use standard 30s timeout. |
522
+ | **PARSE_FAIL** | SQLGlot cannot parse gold SQL | Source SQL uses SQLite-specific syntax SQLGlot doesn't support | Log and skip; count toward coverage metric |
523
+
524
+ ### 6.2 Error Handling Flow
525
+
526
+ ```python
527
+ def execute_with_error_handling(sql: str, db_path: str, timeout: int = 30) -> ExecutionResult:
528
+ try:
529
+ conn = sqlite3.connect(db_path)
530
+ conn.execute("PRAGMA busy_timeout = 30000")
531
+ cursor = conn.cursor()
532
+
533
+ # Execute with timeout
534
+ result = cursor.execute(sql).fetchall()
535
+ col_names = [desc[0] for desc in cursor.description] if cursor.description else []
536
+ return ExecutionResult(success=True, rows=result, columns=col_names)
537
+
538
+ except sqlite3.OperationalError as e:
539
+ error_msg = str(e)
540
+ if "no such table" in error_msg:
541
+ return ExecutionResult(success=False, error_category="NO_SUCH_TABLE", error_msg=error_msg)
542
+ elif "no such column" in error_msg:
543
+ return ExecutionResult(success=False, error_category="NO_SUCH_COLUMN", error_msg=error_msg)
544
+ elif "no such function" in error_msg:
545
+ return ExecutionResult(success=False, error_category="NO_SUCH_FUNCTION", error_msg=error_msg)
546
+ elif "ambiguous column" in error_msg:
547
+ return ExecutionResult(success=False, error_category="AMBIGUOUS_COLUMN", error_msg=error_msg)
548
+ elif "near" in error_msg:
549
+ return ExecutionResult(success=False, error_category="SYNTAX", error_msg=error_msg)
550
+ else:
551
+ return ExecutionResult(success=False, error_category="OTHER", error_msg=error_msg)
552
+
553
+ except Exception as e:
554
+ return ExecutionResult(success=False, error_category="UNEXPECTED", error_msg=str(e))
555
+
556
+ finally:
557
+ conn.close()
558
+ ```
559
+
560
+ ---
561
+
562
+ ## 7. Known SQLGlot Round-Trip Issues
563
+
564
+ These are confirmed issues in SQLGlot v29.x that will cause false mismatches. The validation loop must account for them.
565
+
566
+ ### 7.1 Issues That Cause Silent Wrong Answers
567
+
568
+ | Issue | Description | Impact | Mitigation |
569
+ |---|---|---|---|
570
+ | `LEAST(a, b)` bug | 2-argument `LEAST(a, b)` drops the second argument, outputs just `a` | Wrong values in result | Patch SQLGlot or post-process: detect LEAST/GREATEST with 2 args and rewrite to `MIN(a, b)` / `MAX(a, b)` |
571
+ | `IGNORE NULLS` dropped | SQLGlot silently drops `IGNORE NULLS` from window functions | Wrong NULL handling | Flag queries using IGNORE NULLS as KNOWN_ISSUE |
572
+ | `SAFE_CAST` β†’ `CAST` | Safety semantics lost; runtime errors instead of NULL | Execution error or wrong values | Flag SAFE_CAST queries as KNOWN_ISSUE |
573
+ | `GROUP_CONCAT ... ORDER BY` dropped | ORDER BY inside GROUP_CONCAT is silently removed | Different string concatenation order | Flag or rewrite to subquery-based ordering |
574
+
575
+ ### 7.2 Issues That Cause Execution Errors
576
+
577
+ | Issue | Description | Impact | Mitigation |
578
+ |---|---|---|---|
579
+ | Pipe operators SET/DROP/RENAME/DISTINCT/CALL/WITH | Crash with TypeError in v29.x | Cannot use these operators in pipe output | Avoid these operators in decompiler output; use SELECT/WHERE alternatives |
580
+ | `EXCEPT ALL` / `INTERSECT ALL` | SQLite does not support `ALL` modifier | Runtime error | Convert to EXCEPT/INTERSECT (drop ALL if source has no duplicates) |
581
+ | `TABLESAMPLE` | Not supported in SQLite | Runtime error | Avoid TABLESAMPLE in pipe output for SQLite targets |
582
+
583
+ ### 7.3 Issues That Cause Cosmetic Differences (Not Semantic)
584
+
585
+ | Issue | Description | Impact |
586
+ |---|---|---|
587
+ | `IFNULL` β†’ `COALESCE` | Function renamed during round-trip | None (semantically identical in SQLite) |
588
+ | `SUBSTR` β†’ `SUBSTRING` | Function renamed | None (SQLite accepts both) |
589
+ | Identifier quoting: `[col]` β†’ `"col"` | Quote style normalized | None |
590
+
591
+ ### 7.4 Pre-Validation Filters
592
+
593
+ Before comparing results, apply these filters to exclude known-problematic queries:
594
+
595
+ ```python
596
+ KNOWN_ISSUE_PATTERNS = [
597
+ (r'\bLEAST\s*\([^,]+,[^,]+\)', "LEAST_2ARG_BUG"),
598
+ (r'\bGREATEST\s*\([^,]+,[^,]+\)', "GREATEST_2ARG_BUG"),
599
+ (r'IGNORE\s+NULLS', "IGNORE_NULLS_DROPPED"),
600
+ (r'SAFE_CAST', "SAFE_CAST_LOSSY"),
601
+ (r'GROUP_CONCAT\s*\([^)]*ORDER\s+BY', "GROUP_CONCAT_ORDER_DROPPED"),
602
+ (r'EXCEPT\s+ALL|INTERSECT\s+ALL', "SET_OP_ALL_UNSUPPORTED"),
603
+ (r'TABLESAMPLE', "TABLESAMPLE_UNSUPPORTED"),
604
+ ]
605
+
606
+ def check_known_issues(sql: str) -> List[str]:
607
+ """Return list of known issue tags for a query."""
608
+ import re
609
+ issues = []
610
+ for pattern, tag in KNOWN_ISSUE_PATTERNS:
611
+ if re.search(pattern, sql, re.IGNORECASE):
612
+ issues.append(tag)
613
+ return issues
614
+ ```
615
+
616
+ Queries with known issues are tracked separately. They still go through the pipeline, but mismatches attributed to known issues are not counted against the decompiler's success rate.
617
+
618
+ ---
619
+
620
+ ## 8. Feedback Cycle: From Mismatch to Fix
621
+
622
+ ### 8.1 The Iteration Loop
623
+
624
+ ```
625
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
626
+ β”‚ ITERATION N β”‚
627
+ β”‚ β”‚
628
+ β”‚ 1. Run validation pipeline on all dev queries β”‚
629
+ β”‚ β”‚
630
+ β”‚ 2. Collect results: β”‚
631
+ β”‚ β”œβ”€β”€ MATCH: 1,200 queries βœ“ β”‚
632
+ β”‚ β”œβ”€β”€ MISMATCH: 150 queries βœ— β”‚
633
+ β”‚ β”œβ”€β”€ ERROR: 80 queries βœ— β”‚
634
+ β”‚ β”œβ”€β”€ DECOMPILE_FAIL: 84 queries ⊘ β”‚
635
+ β”‚ └── KNOWN_ISSUE: 20 queries ~ β”‚
636
+ β”‚ β”‚
637
+ β”‚ 3. Analyze MISMATCHes by type: β”‚
638
+ β”‚ β”œβ”€β”€ REORDER: 45 (fix: projection_rule) β”‚
639
+ β”‚ β”œβ”€β”€ EXTRA_ROWS: 30 (fix: where_rule) β”‚
640
+ β”‚ β”œβ”€β”€ MISSING_ROWS: 25 (fix: join_rule) β”‚
641
+ β”‚ β”œβ”€β”€ WRONG_VALUES: 35 (fix: aggregate_rule) β”‚
642
+ β”‚ └── NULL_DIFF: 15 (fix: join_rule) β”‚
643
+ β”‚ β”‚
644
+ β”‚ 4. Prioritize fixes by impact (most affected queries first) β”‚
645
+ β”‚ β”‚
646
+ β”‚ 5. Fix decompiler rule(s) β”‚
647
+ β”‚ β”‚
648
+ β”‚ 6. Regression test: re-run on ALL dev queries β”‚
649
+ β”‚ β”œβ”€β”€ Verify: previous MATCHes still MATCH β”‚
650
+ β”‚ └── Verify: targeted MISMATCHes now MATCH β”‚
651
+ β”‚ β”‚
652
+ β”‚ 7. If success rate < 95%: goto ITERATION N+1 β”‚
653
+ β”‚ Else: proceed to train set production β”‚
654
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
655
+ ```
656
+
657
+ ### 8.2 Fix Priority Ranking
658
+
659
+ Fixes are prioritized by:
660
+ 1. **Blast radius**: How many queries are affected? Fix the rule that causes the most mismatches first.
661
+ 2. **Severity**: WRONG_VALUES and MISSING_ROWS are more severe than REORDER.
662
+ 3. **Fixability**: Some mismatches are caused by SQLGlot limitations (known issues) and cannot be fixed in the decompiler. Deprioritize these.
663
+
664
+ ### 8.3 Regression Guard
665
+
666
+ Every decompiler change must pass the regression test:
667
+
668
+ ```python
669
+ def regression_test(
670
+ previous_results: Dict[int, str], # question_id -> "MATCH"/"MISMATCH"/"ERROR"
671
+ current_results: Dict[int, str],
672
+ ) -> RegressionReport:
673
+ regressions = []
674
+ improvements = []
675
+
676
+ for qid in previous_results:
677
+ prev = previous_results[qid]
678
+ curr = current_results[qid]
679
+
680
+ if prev == "MATCH" and curr != "MATCH":
681
+ regressions.append(qid) # Previously passing, now failing
682
+ elif prev != "MATCH" and curr == "MATCH":
683
+ improvements.append(qid) # Previously failing, now passing
684
+
685
+ return RegressionReport(
686
+ regressions=regressions,
687
+ improvements=improvements,
688
+ net_change=len(improvements) - len(regressions),
689
+ accept=len(regressions) == 0, # ZERO regressions policy
690
+ )
691
+ ```
692
+
693
+ **Policy**: A decompiler change is accepted only if it causes **zero regressions**. If a fix for one mismatch type causes regressions elsewhere, the fix must be refined.
694
+
695
+ ---
696
+
697
+ ## 9. Execution Harness
698
+
699
+ ### 9.1 Main Entry Point
700
+
701
+ ```python
702
+ import json
703
+ import sqlite3
704
+ from pathlib import Path
705
+ from dataclasses import dataclass, field, asdict
706
+ from typing import List, Dict, Optional
707
+ import sqlglot
708
+
709
+ @dataclass
710
+ class ValidationRecord:
711
+ question_id: int
712
+ db_id: str
713
+ difficulty: str
714
+ gold_sql: str
715
+ pipe_sql: Optional[str] = None
716
+ sqlite2_sql: Optional[str] = None
717
+ outcome: str = "" # MATCH, MISMATCH, ERROR, DECOMPILE_FAIL, KNOWN_ISSUE
718
+ mismatch_type: Optional[str] = None
719
+ error_category: Optional[str] = None
720
+ error_msg: Optional[str] = None
721
+ known_issues: List[str] = field(default_factory=list)
722
+ diagnostics: Optional[dict] = None
723
+ result_a_rows: int = 0
724
+ result_b_rows: int = 0
725
+
726
+ def normalize_entry(entry: dict, source: str = "spider") -> dict:
727
+ """Normalize Spider or BIRD JSON entry to a common format."""
728
+ if source == "spider":
729
+ return {
730
+ "question_id": entry.get("question_id", id(entry)),
731
+ "db_id": entry["db_id"],
732
+ "difficulty": entry.get("hardness", "unknown"), # Spider uses "hardness"
733
+ "gold_sql": entry["query"], # Spider uses "query"
734
+ }
735
+ else: # bird
736
+ return {
737
+ "question_id": entry["question_id"],
738
+ "db_id": entry["db_id"],
739
+ "difficulty": entry.get("difficulty", "unknown"), # BIRD uses "difficulty"
740
+ "gold_sql": entry["SQL"], # BIRD uses "SQL"
741
+ }
742
+
743
+ def run_validation(
744
+ dev_json_path: str,
745
+ db_dir: str,
746
+ decompiler, # The decompiler instance
747
+ output_path: str,
748
+ source: str = "spider", # "spider" or "bird"
749
+ ) -> Dict[str, int]:
750
+ """Run the full validation pipeline on the benchmark dev set.
751
+
752
+ Args:
753
+ source: "spider" for Spider 1.0 format, "bird" for BIRD/BIRD Mini-Dev format.
754
+ """
755
+
756
+ with open(dev_json_path) as f:
757
+ queries = json.load(f)
758
+
759
+ records = []
760
+ counters = {"MATCH": 0, "MISMATCH": 0, "ERROR": 0,
761
+ "DECOMPILE_FAIL": 0, "KNOWN_ISSUE": 0, "TIMEOUT": 0}
762
+
763
+ for raw_entry in queries:
764
+ entry = normalize_entry(raw_entry, source)
765
+ record = ValidationRecord(
766
+ question_id=entry["question_id"],
767
+ db_id=entry["db_id"],
768
+ difficulty=entry["difficulty"],
769
+ gold_sql=entry["gold_sql"],
770
+ )
771
+
772
+ db_path = Path(db_dir) / entry["db_id"] / f"{entry['db_id']}.sqlite"
773
+
774
+ # Step 1: Check for known SQLGlot issues
775
+ record.known_issues = check_known_issues(entry["gold_sql"])
776
+
777
+ # Step 2: Execute gold SQL β†’ Result Set A
778
+ exec_a = execute_with_error_handling(entry["gold_sql"], str(db_path))
779
+ if not exec_a.success:
780
+ record.outcome = "ERROR"
781
+ record.error_category = "GOLD_SQL_FAIL"
782
+ record.error_msg = exec_a.error_msg
783
+ records.append(record)
784
+ counters["ERROR"] += 1
785
+ continue
786
+ record.result_a_rows = len(exec_a.rows)
787
+
788
+ # Step 3: Decompile to pipe SQL
789
+ try:
790
+ result = decompiler.transform(entry["gold_sql"], dialect="sqlite")
791
+ record.pipe_sql = result.pipe_sql
792
+ except Exception as e:
793
+ record.outcome = "DECOMPILE_FAIL"
794
+ record.error_msg = str(e)
795
+ records.append(record)
796
+ counters["DECOMPILE_FAIL"] += 1
797
+ continue
798
+
799
+ # Step 4: Transpile pipe SQL back to SQLite
800
+ try:
801
+ sqlite2 = sqlglot.transpile(
802
+ record.pipe_sql, read="bigquery", write="sqlite"
803
+ )[0]
804
+ record.sqlite2_sql = sqlite2
805
+ except Exception as e:
806
+ record.outcome = "ERROR"
807
+ record.error_category = "TRANSPILE_FAIL"
808
+ record.error_msg = str(e)
809
+ records.append(record)
810
+ counters["ERROR"] += 1
811
+ continue
812
+
813
+ # Step 5: Execute SQLiteβ‚‚ β†’ Result Set B
814
+ exec_b = execute_with_error_handling(sqlite2, str(db_path))
815
+ if not exec_b.success:
816
+ record.outcome = "ERROR"
817
+ record.error_category = exec_b.error_category
818
+ record.error_msg = exec_b.error_msg
819
+ records.append(record)
820
+ counters["ERROR"] += 1
821
+ continue
822
+ record.result_b_rows = len(exec_b.rows)
823
+
824
+ # Step 6: Compare results
825
+ if compare_with_tolerance(exec_a.rows, exec_b.rows):
826
+ record.outcome = "MATCH"
827
+ counters["MATCH"] += 1
828
+ else:
829
+ if record.known_issues:
830
+ record.outcome = "KNOWN_ISSUE"
831
+ counters["KNOWN_ISSUE"] += 1
832
+ else:
833
+ record.outcome = "MISMATCH"
834
+ record.diagnostics = asdict(diagnose_mismatch(
835
+ entry["SQL"], record.pipe_sql, sqlite2,
836
+ exec_a.rows, exec_b.rows, str(db_path)
837
+ ))
838
+ record.mismatch_type = record.diagnostics.get("mismatch_type")
839
+ counters["MISMATCH"] += 1
840
+
841
+ records.append(record)
842
+
843
+ # Write detailed results
844
+ with open(output_path, "w") as f:
845
+ json.dump([asdict(r) for r in records], f, indent=2)
846
+
847
+ # Print summary
848
+ total = len(records)
849
+ print(f"\n{'='*60}")
850
+ print(f"Validation Results: {total} queries")
851
+ print(f"{'='*60}")
852
+ for outcome, count in sorted(counters.items()):
853
+ pct = count / total * 100
854
+ print(f" {outcome:20s}: {count:5d} ({pct:5.1f}%)")
855
+ match_rate = counters["MATCH"] / total * 100
856
+ print(f"{'='*60}")
857
+ print(f" Match rate: {match_rate:.1f}%")
858
+ effective = (counters["MATCH"] + counters["KNOWN_ISSUE"]) / total * 100
859
+ print(f" Effective rate (excl. known issues): {effective:.1f}%")
860
+
861
+ return counters
862
+ ```
863
+
864
+ ### 9.2 Difficulty-Stratified Reporting
865
+
866
+ Report results stratified by difficulty:
867
+
868
+ ```python
869
+ def print_stratified_report(records: List[ValidationRecord], source: str = "spider"):
870
+ """Print difficulty-stratified results."""
871
+ by_difficulty = defaultdict(lambda: {"total": 0, "match": 0})
872
+
873
+ for r in records:
874
+ by_difficulty[r.difficulty]["total"] += 1
875
+ if r.outcome == "MATCH":
876
+ by_difficulty[r.difficulty]["match"] += 1
877
+
878
+ # Spider uses: easy, medium, hard, extra
879
+ # BIRD uses: simple, moderate, challenging
880
+ if source == "spider":
881
+ levels = ["easy", "medium", "hard", "extra"]
882
+ else:
883
+ levels = ["simple", "moderate", "challenging"]
884
+
885
+ header = f"{'':15s}"
886
+ for d in levels:
887
+ header += f" {d:>12s}"
888
+ header += f" {'total':>10s}"
889
+ print(f"\n{header}")
890
+ print("-" * (15 + 12 * len(levels) + 10))
891
+
892
+ totals = {"total": 0, "match": 0}
893
+ row = f"{'count':15s}"
894
+ for d in levels:
895
+ row += f" {by_difficulty[d]['total']:12d}"
896
+ totals["total"] += by_difficulty[d]["total"]
897
+ totals["match"] += by_difficulty[d]["match"]
898
+ row += f" {totals['total']:10d}"
899
+ print(row)
900
+
901
+ row = f"{'match rate':15s}"
902
+ for d in levels:
903
+ rate = by_difficulty[d]["match"] / max(by_difficulty[d]["total"], 1) * 100
904
+ row += f" {rate:11.1f}%"
905
+ overall = totals["match"] / max(totals["total"], 1) * 100
906
+ row += f" {overall:9.1f}%"
907
+ print(row)
908
+ ```
909
+
910
+ ---
911
+
912
+ ## 10. Intermediate Validation: Pipe SQL Syntax Check
913
+
914
+ Before transpiling pipe SQL back to SQLite, verify the pipe SQL is syntactically valid:
915
+
916
+ ```python
917
+ def validate_pipe_syntax(pipe_sql: str) -> Tuple[bool, Optional[str]]:
918
+ """Verify pipe SQL parses without error."""
919
+ try:
920
+ ast = sqlglot.parse_one(pipe_sql, read="bigquery")
921
+ # Verify it round-trips to valid SQL
922
+ standard = ast.sql(dialect="bigquery")
923
+ return True, None
924
+ except sqlglot.errors.ParseError as e:
925
+ return False, str(e)
926
+ ```
927
+
928
+ This catches decompiler bugs that produce syntactically invalid pipe SQL before they reach the execution stage. Syntax errors are cheaper to diagnose than execution mismatches.
929
+
930
+ ### 10.1 Prefix Validation
931
+
932
+ Exploit the Prefix Property: every prefix of a valid pipe query (up to a `|>` boundary) is itself a valid query. Validate each prefix independently:
933
+
934
+ ```python
935
+ def validate_pipe_prefixes(pipe_sql: str) -> List[Tuple[int, bool, Optional[str]]]:
936
+ """Validate each prefix of the pipe query."""
937
+ lines = pipe_sql.strip().split("\n")
938
+ results = []
939
+
940
+ prefix = ""
941
+ for i, line in enumerate(lines):
942
+ if i == 0:
943
+ prefix = line
944
+ else:
945
+ prefix += "\n" + line
946
+
947
+ valid, error = validate_pipe_syntax(prefix)
948
+ results.append((i, valid, error))
949
+
950
+ if not valid:
951
+ break # First invalid prefix identifies the broken operator
952
+
953
+ return results
954
+ ```
955
+
956
+ If prefix N is valid but prefix N+1 is not, the bug is in the Nth pipe operator β€” providing precise localization for debugging.
957
+
958
+ ---
959
+
960
+ ## 11. Output Artifacts
961
+
962
+ ### 11.1 Per-Iteration Output
963
+
964
+ Each validation run produces:
965
+
966
+ ```
967
+ validation_output/
968
+ β”œβ”€β”€ iteration_001/
969
+ β”‚ β”œβ”€β”€ results.json # Full per-query results (ValidationRecord array)
970
+ β”‚ β”œβ”€β”€ summary.txt # Stratified match rates
971
+ β”‚ β”œβ”€β”€ mismatches.json # Only MISMATCH records with diagnostics
972
+ β”‚ β”œβ”€β”€ errors.json # Only ERROR records with error details
973
+ β”‚ β”œβ”€β”€ decompile_fails.json # Only DECOMPILE_FAIL records
974
+ β”‚ └── golden_pairs.jsonl # Successfully validated (question, pipe_sql) pairs
975
+ β”œβ”€β”€ iteration_002/
976
+ β”‚ β”œβ”€β”€ ...
977
+ β”‚ └── regression_report.txt # Comparison with iteration_001
978
+ └── ...
979
+ ```
980
+
981
+ ### 11.2 Golden Corpus Output
982
+
983
+ Queries that achieve MATCH are exported as training data:
984
+
985
+ ```json
986
+ {
987
+ "question_id": 7,
988
+ "db_id": "california_schools",
989
+ "question": "What is the phone number of the school that has the highest number of test takers with an SAT score of over 1500?",
990
+ "evidence": "",
991
+ "gold_sql": "SELECT T2.Phone FROM satscores AS T1 INNER JOIN schools AS T2 ON T1.cds = T2.CDSCode ORDER BY T1.NumGE1500 DESC LIMIT 1",
992
+ "pipe_sql": "FROM satscores AS T1\n|> JOIN schools AS T2 ON T1.cds = T2.CDSCode\n|> ORDER BY T1.NumGE1500 DESC\n|> LIMIT 1\n|> SELECT T2.Phone",
993
+ "difficulty": "simple",
994
+ "validation": "MATCH"
995
+ }
996
+ ```
997
+
998
+ ---
999
+
1000
+ ## 12. Success Criteria
1001
+
1002
+ | Metric | Target | Measured On |
1003
+ |---|---|---|
1004
+ | **Match rate** (strict) | β‰₯ 95% of decompilable queries | Spider dev (1,034 queries) |
1005
+ | **Decompile success rate** | β‰₯ 90% of all queries | Spider dev |
1006
+ | **Effective rate** (excl. known issues + decompile fails) | β‰₯ 97% | Successfully decompiled queries |
1007
+ | **Zero regressions per iteration** | 100% | Previous MATCH queries remain MATCH |
1008
+ | **Error rate** (execution failures) | ≀ 3% of decompilable queries | Spider dev |
1009
+ | **Match rate by difficulty** | easy β‰₯ 99%, medium β‰₯ 97%, hard β‰₯ 93%, extra hard β‰₯ 88% | Spider dev |
1010
+ | **Tier 2 stress test** | β‰₯ 90% match rate | BIRD Mini-Dev (500 queries) |
1011
+
1012
+ ### 12.1 Graduation Criteria
1013
+
1014
+ **Tier 1 graduation** (Spider) β€” the validation loop advances to Tier 2 when:
1015
+ 1. Match rate β‰₯ 95% on Spider dev set (1,034 queries).
1016
+ 2. All MISMATCH records have been either fixed or classified as KNOWN_ISSUE.
1017
+ 3. The decompiler has been stable (no regressions) for 2 consecutive iterations.
1018
+ 4. The golden corpus contains β‰₯ 900 validated pipe SQL queries from Spider dev.
1019
+
1020
+ **Tier 2 graduation** (BIRD Mini-Dev) β€” production readiness when:
1021
+ 1. Match rate β‰₯ 90% on BIRD Mini-Dev (500 queries).
1022
+ 2. No new categories of MISMATCH discovered (all mismatch types already seen in Tier 1).
1023
+ 3. Any BIRD-specific edge cases have been fixed without regressing Spider results.
1024
+
1025
+ At Tier 2 graduation, the decompiler is applied to Spider train (8,659 queries) and then full BIRD train (9,428 queries, 33.4 GB download) for production corpus generation.
1026
+
1027
+ ---
1028
+
1029
+ ## 13. Implementation Roadmap
1030
+
1031
+ ### Week 1: Harness Setup
1032
+ - Download Spider 1.0 (~1 GB: dev.json + database/)
1033
+ - Implement execution harness (`run_validation`)
1034
+ - Implement result comparison logic (set-based + tolerance)
1035
+ - Run baseline: gold SQL β†’ parse β†’ generate SQLite (no pipe) β†’ execute β†’ compare
1036
+ - This baseline measures SQLGlot's round-trip fidelity before the decompiler is involved
1037
+
1038
+ ### Week 2: Decompiler Integration
1039
+ - Connect the decompiler (from companion design doc) to the validation harness
1040
+ - Run first full iteration on Spider dev (1,034 queries)
1041
+ - Implement diagnostic feedback (mismatch classification, AST diff)
1042
+ - Triage all errors and mismatches from iteration 1
1043
+
1044
+ ### Week 3–4: Fix Cycle (Tier 1)
1045
+ - Fix decompiler rules based on mismatch diagnostics
1046
+ - Run iterations 2–N with regression testing
1047
+ - Target: 80% β†’ 90% β†’ 95% match rate on Spider dev
1048
+
1049
+ ### Week 5: Tier 2 Stress Test
1050
+ - Download BIRD Mini-Dev (500 queries, 11 databases)
1051
+ - Run decompiler on BIRD Mini-Dev
1052
+ - Fix any new edge cases discovered
1053
+ - Target: β‰₯ 90% match rate on BIRD Mini-Dev
1054
+
1055
+ ### Week 6: Production
1056
+ - Run decompiler on Spider train set (8,659 queries) β€” generate first golden corpus
1057
+ - Download full BIRD train set (9,428 queries, 33.4 GB) β€” generate extended corpus
1058
+ - Feed into trajectory decomposition pipeline (from fine-tuning design doc, Section 7)