Zheyuan Zhao commited on
Commit
e145226
Β·
verified Β·
1 Parent(s): 1e0227f

Add design doc: pipe-sql-fine-tuning-design-doc.md

Browse files
docs/pipe-sql-fine-tuning-design-doc.md ADDED
@@ -0,0 +1,693 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Document: Large-Scale Incremental Pipe SQL Synthesis & Specialized Fine-Tuning for Small Language Models (1.5B–7B)
2
+
3
+ ## 1. Executive Summary
4
+
5
+ Pipe SQL syntax β€” where queries are written as linear chains of `|>` operators rather than nested, inside-out clause blocks β€” is now production-ready in BigQuery (GA February 2025), Apache Spark 4.0+, and Databricks Runtime 16.2+. This linear structure maps naturally onto autoregressive token generation, making it a compelling target for specialized Text-to-SQL models.
6
+
7
+ This document presents a complete system for:
8
+
9
+ 1. **Synthesizing a large corpus of semantically validated pipe SQL queries** from existing standard SQL benchmarks via a custom AST-based decompiler.
10
+ 2. **Fine-tuning small language models (1.5B–7B)** using an incremental, pipe-by-pipe training strategy that teaches models to build queries step-by-step rather than emit them in one shot.
11
+
12
+ The core thesis: by converting the Text-to-SQL problem from "generate a complex nested structure" into "append the next correct pipe operator," we can train models at a fraction of frontier cost while achieving competitive execution accuracy.
13
+
14
+ ---
15
+
16
+ ## 2. Why Pipe SQL?
17
+
18
+ ### 2.1 The Problem with Standard SQL Generation
19
+
20
+ Standard SQL requires the model to produce clauses in an order that is the inverse of logical execution:
21
+
22
+ ```sql
23
+ SELECT department, AVG(salary) AS avg_sal -- Step 4: project
24
+ FROM employees -- Step 1: scan
25
+ WHERE hire_date > '2020-01-01' -- Step 2: filter
26
+ GROUP BY department -- Step 3: aggregate
27
+ HAVING AVG(salary) > 80000 -- Step 5: post-agg filter
28
+ ORDER BY avg_sal DESC -- Step 6: sort
29
+ ```
30
+
31
+ An LLM generating this left-to-right must write `SELECT department, AVG(salary)` before it has committed to the `WHERE` filter or `GROUP BY` clause. This inversion causes three well-documented failure modes:
32
+
33
+ - **Schema hallucination**: referencing columns eliminated by an earlier aggregation or join that the model hasn't generated yet.
34
+ - **Alias invisibility**: standard SQL forbids referencing a `SELECT` alias in the `WHERE` clause of the same block, leading to invalid queries.
35
+ - **Nesting complexity**: correlated subqueries and multi-level CTEs require the model to maintain deep structural state across its entire generation window.
36
+
37
+ ### 2.2 How Pipe Syntax Solves This
38
+
39
+ Pipe syntax linearizes the query into a sequence of transformations, each consuming the output of the previous step:
40
+
41
+ ```sql
42
+ FROM employees
43
+ |> WHERE hire_date > '2020-01-01'
44
+ |> AGGREGATE AVG(salary) AS avg_sal GROUP BY department
45
+ |> WHERE avg_sal > 80000
46
+ |> ORDER BY avg_sal DESC
47
+ ```
48
+
49
+ This provides three structural guarantees that directly benefit autoregressive generation:
50
+
51
+ 1. **The Prefix Property**: every prefix up to any `|>` boundary is a valid, independently executable query. The model never needs to look ahead.
52
+ 2. **Local scope**: each operator only sees columns produced by its immediate predecessor. The active schema is always well-defined and narrow.
53
+ 3. **Linear data flow**: no nesting, no back-references. The generation order matches the logical execution order.
54
+
55
+ ### 2.3 Supported Pipe Operators
56
+
57
+ The following operators are supported across BigQuery and Spark, which are the primary target engines:
58
+
59
+ | Operator | Purpose | Replaces (Standard SQL) |
60
+ |---|---|---|
61
+ | `FROM table` | Entry point | `FROM` clause |
62
+ | `\|> SELECT` | Column projection | `SELECT` |
63
+ | `\|> EXTEND expr AS alias` | Add computed column (preserves existing) | Inline expression in `SELECT` |
64
+ | `\|> WHERE` | Row filtering at any point in the pipeline | `WHERE`, `HAVING`, `QUALIFY` |
65
+ | `\|> AGGREGATE expr GROUP BY cols` | Aggregation | `SELECT ... GROUP BY` |
66
+ | `\|> JOIN table ON cond` | Join (all types: INNER, LEFT, etc.) | `JOIN` clause |
67
+ | `\|> ORDER BY` | Sorting | `ORDER BY` |
68
+ | `\|> LIMIT n` | Row limiting | `LIMIT` |
69
+ | `\|> SET col = expr` | Replace column values | N/A (new) |
70
+ | `\|> DROP col` | Remove columns | N/A (new) |
71
+ | `\|> RENAME old AS new` | Rename columns | `AS` alias |
72
+ | `\|> DISTINCT` | Deduplicate | `SELECT DISTINCT` |
73
+
74
+ ---
75
+
76
+ ## 3. Pipe Query Syntax Landscape
77
+
78
+ Multiple pipe-like query systems exist. This section surveys the landscape to justify our choice of GoogleSQL pipe syntax as the canonical training target.
79
+
80
+ ### 3.1 The Same Query in Every Syntax
81
+
82
+ > "From employees, filter Chicago office, average salary by department, keep departments with avg > 80K, sort descending."
83
+
84
+ **Standard SQL:**
85
+ ```sql
86
+ SELECT department, AVG(salary) AS avg_salary
87
+ FROM employees
88
+ WHERE office = 'Chicago'
89
+ GROUP BY department
90
+ HAVING AVG(salary) > 80000
91
+ ORDER BY avg_salary DESC;
92
+ ```
93
+
94
+ **GoogleSQL Pipe (BigQuery / Spark):**
95
+ ```sql
96
+ FROM employees
97
+ |> WHERE office = 'Chicago'
98
+ |> AGGREGATE AVG(salary) AS avg_salary GROUP BY department
99
+ |> WHERE avg_salary > 80000
100
+ |> ORDER BY avg_salary DESC;
101
+ ```
102
+
103
+ **Snowflake Flow (`->>`):**
104
+ ```sql
105
+ SELECT * FROM employees WHERE office = 'Chicago'
106
+ ->> SELECT department, AVG(salary) AS avg_salary FROM $1 GROUP BY department
107
+ ->> SELECT * FROM $2 WHERE avg_salary > 80000
108
+ ->> SELECT * FROM $3 ORDER BY avg_salary DESC;
109
+ ```
110
+
111
+ **PRQL:**
112
+ ```prql
113
+ from employees
114
+ filter office == 'Chicago'
115
+ group {department} (
116
+ aggregate { avg_salary = average salary }
117
+ )
118
+ filter avg_salary > 80000
119
+ sort {-avg_salary}
120
+ ```
121
+
122
+ **KQL (Kusto):**
123
+ ```kql
124
+ employees
125
+ | where office == 'Chicago'
126
+ | summarize avg_salary = avg(salary) by department
127
+ | where avg_salary > 80000
128
+ | sort by avg_salary desc
129
+ ```
130
+
131
+ **Malloy:**
132
+ ```malloy
133
+ run: duckdb.table('employees') -> {
134
+ where: office = 'Chicago'
135
+ group_by: department
136
+ aggregate: avg_salary is avg(salary)
137
+ } -> {
138
+ where: avg_salary > 80000
139
+ order_by: avg_salary desc
140
+ }
141
+ ```
142
+
143
+ **dplyr (R):**
144
+ ```r
145
+ employees |>
146
+ filter(office == "Chicago") |>
147
+ group_by(department) |>
148
+ summarize(avg_salary = mean(salary)) |>
149
+ filter(avg_salary > 80000) |>
150
+ arrange(desc(avg_salary))
151
+ ```
152
+
153
+ **Polars (Python):**
154
+ ```python
155
+ (
156
+ pl.scan_csv("employees.csv")
157
+ .filter(pl.col("office") == "Chicago")
158
+ .group_by("department")
159
+ .agg(pl.col("salary").mean().alias("avg_salary"))
160
+ .filter(pl.col("avg_salary") > 80000)
161
+ .sort("avg_salary", descending=True)
162
+ .collect()
163
+ )
164
+ ```
165
+
166
+ ### 3.2 Comparison Matrix
167
+
168
+ | System | Pipe Symbol | SQL-Compatible? | Compiles to SQL? | Native Engine | Adoption |
169
+ |---|---|---|---|---|---|
170
+ | **GoogleSQL Pipe** | `\|>` | Yes (extension) | N/A (IS SQL) | BigQuery, Spark, Databricks | Very High |
171
+ | **Snowflake Flow** | `->>` | Yes (chains full stmts) | N/A (IS SQL) | Snowflake | High |
172
+ | **KQL (Kusto)** | `\|` | No (separate language) | No | Azure Data Explorer | Very High |
173
+ | **PRQL** | Newlines | No (separate language) | Yes | None (compiler only) | Medium |
174
+ | **Malloy** | `->` | No (separate language) | Yes | None (compiles to SQL) | Low-Medium |
175
+ | **dplyr (R)** | `%>%` / `\|>` | No (R code) | Yes (via dbplyr) | R in-memory | Very High |
176
+ | **Polars** | `.method()` | No (Python) | No | Rust engine | High |
177
+ | **Logica** | N/A (predicates) | No (logic language) | Yes | None (compiler only) | Low |
178
+
179
+ ### 3.3 Key Distinctions
180
+
181
+ **GoogleSQL Pipe vs. Snowflake Flow**: Snowflake's `->>` chains *entire SQL statements* together, referencing prior results via positional `$1`, `$2` parameters. GoogleSQL pipes *individual operators* within a single query. Snowflake's approach is more like Unix pipes between programs; GoogleSQL's is like method chaining within a single program.
182
+
183
+ **GoogleSQL Pipe vs. KQL**: KQL (Microsoft Azure Data Explorer) is the spiritual predecessor β€” its `| where`, `| summarize`, `| extend` operators clearly inspired GoogleSQL pipe syntax. However, KQL is an entirely separate language that only runs on Kusto-engine databases. GoogleSQL pipe syntax stays within SQL itself.
184
+
185
+ **GoogleSQL Pipe vs. PRQL / Malloy**: Both are full language replacements that compile *down to* SQL. They offer cleaner syntax but require a compilation step and have no native engine support (except ClickHouse's experimental PRQL support). GoogleSQL pipe syntax is SQL β€” it runs natively without compilation.
186
+
187
+ **GoogleSQL Pipe vs. dplyr / Polars**: These are DataFrame API approaches in R and Python respectively. They solve the same readability problem at the programming language level. The Databricks team explicitly cited DataFrame APIs as inspiration for pipe SQL.
188
+
189
+ ### 3.4 BigQuery vs. Spark: Detailed Differences
190
+
191
+ While BigQuery and Spark share the same `|>` symbol and core operators, they diverge in several ways:
192
+
193
+ **Operators only in BigQuery:**
194
+
195
+ | Operator | Purpose |
196
+ |---|---|
197
+ | `\|> RENAME` | Rename columns directly |
198
+ | `\|> CALL` | Invoke table-valued functions in the pipe chain |
199
+ | `\|> WITH` | Inline CTEs within the pipe |
200
+ | `\|> WINDOW` | Standalone window function operator (deprecated in favor of EXTEND) |
201
+ | `\|> MATCH_RECOGNIZE` | Pattern matching on row sequences |
202
+ | `GROUP AND ORDER BY` | Shorthand inside AGGREGATE that also orders the output |
203
+
204
+ **Features only in Spark:**
205
+
206
+ | Feature | Purpose |
207
+ |---|---|
208
+ | `SEMI JOIN` / `ANTI JOIN` | Explicit semi/anti join keywords (BigQuery requires WHERE EXISTS) |
209
+ | `NATURAL JOIN` / `LATERAL JOIN` | Additional join types |
210
+ | Standalone `OFFSET` | OFFSET without requiring LIMIT (BigQuery requires LIMIT before OFFSET) |
211
+
212
+ **Behavioral differences:**
213
+
214
+ | Behavior | BigQuery | Spark |
215
+ |---|---|---|
216
+ | Lateral references in EXTEND | Allowed (later column can reference earlier alias in same EXTEND) | Not allowed (each projection is independent) |
217
+ | Default NULL ordering (ASC) | NULLs first | NULLs last |
218
+ | Default NULL ordering (DESC) | NULLs last | NULLs first |
219
+
220
+ ### 3.5 Why GoogleSQL Pipe Syntax Is the Training Target
221
+
222
+ 1. **It IS SQL.** Unlike PRQL, KQL, or Malloy, GoogleSQL pipe syntax requires no compilation step. The generated output is directly executable.
223
+ 2. **Multi-engine support.** BigQuery, Spark, and Databricks all support it natively. No other pipe syntax has this breadth.
224
+ 3. **BigQuery is a superset.** Training on BigQuery's dialect covers all Spark operators. The only gap is Spark's `SEMI`/`ANTI`/`NATURAL`/`LATERAL` join types, which can be expressed differently in BigQuery syntax.
225
+ 4. **SQLGlot round-trip.** SQLGlot can parse GoogleSQL pipe syntax back to standard SQL for any of its 30+ supported dialects. This enables the dual-execution validation loop (Section 6) and deployment-time transpilation to any target database (Section 12.2).
226
+ 5. **Pretraining signal.** BigQuery has been GA since February 2025. Frontier LLMs trained after this date will have GoogleSQL pipe syntax in their training data, providing a foundation for fine-tuning.
227
+
228
+ ---
229
+
230
+ ## 4. The Data Problem: Why a Decompiler Is Essential
231
+
232
+ ### 4.1 The Fine-Tuning Data Bottleneck
233
+
234
+ Fine-tuning a language model to generate pipe SQL requires a large corpus of (natural language question, database schema, pipe SQL query) triples that are **semantically correct** β€” meaning they return the right answer on the target database. The challenge:
235
+
236
+ - **No pipe SQL training data exists.** All major Text-to-SQL benchmarks (Spider, BIRD-SQL, WikiSQL, KaggleDBQA) contain exclusively standard SQL. Combined, Spider 1.0 (~7K train) and BIRD-SQL (~9.4K train) provide roughly 16K standard SQL queries β€” zero in pipe syntax.
237
+ - **LLM-based generation is unreliable.** Using a frontier model (GPT-4o, Claude) to generate pipe SQL has three problems: (a) high cost at scale, (b) the model has limited pipe SQL in its training data, and (c) there is no efficient way to validate correctness without executing every generated query.
238
+ - **Manual annotation is infeasible.** Writing thousands of pipe SQL queries by hand is prohibitively expensive and error-prone.
239
+
240
+ ### 4.2 The Decompiler Solution
241
+
242
+ A **decompiler** is a deterministic program that transforms standard SQL (which we have in abundance and with verified correctness) into semantically equivalent pipe SQL. This is the only approach that simultaneously satisfies all three requirements for training data:
243
+
244
+ | Requirement | LLM Generation | Manual Writing | Decompiler |
245
+ |---|---|---|---|
246
+ | Scale (50K+ queries) | Expensive ($$$) | Infeasible | Free (compute only) |
247
+ | Correctness guarantee | No (needs validation) | Error-prone | Deterministic (provable) |
248
+ | Reproducibility | Non-deterministic | N/A | Fully reproducible |
249
+ | Speed | ~1 query/sec | ~5 min/query | ~1000 queries/sec |
250
+
251
+ The decompiler takes a known-correct standard SQL query from an established benchmark, parses it into an AST, and mechanically transforms it into an equivalent pipe SQL query. Because the transformation is rule-based and structure-preserving, the output is guaranteed to be semantically equivalent to the input β€” no execution-based validation is strictly required (though we perform it anyway as a safety net).
252
+
253
+ ### 4.3 Why Not Just Use Standard SQL for Training?
254
+
255
+ One might ask: why not fine-tune on standard SQL and transpile at inference time? Because the entire point is to exploit the structural advantages of pipe syntax during generation. A model trained on standard SQL still suffers from the nesting and inversion problems described in Section 2.1. The model must learn to *think* in pipes β€” to decompose a question into a linear sequence of transformations β€” and this requires training on pipe SQL directly.
256
+
257
+ ### 4.4 Data Augmentation Strategy
258
+
259
+ Starting from the ~16K seed queries in Spider 1.0 + BIRD-SQL:
260
+
261
+ 1. **Decompile all seed queries** β†’ ~16K pipe SQL equivalents
262
+ 2. **Schema-aware augmentation**: for each query, generate variants by substituting table/column names from the same schema family (e.g., swap `employees.salary` for `staff.compensation`)
263
+ 3. **Complexity augmentation**: compose simple queries into multi-step pipes (e.g., combine a filter query and an aggregation query into a single pipeline)
264
+ 4. **Synthetic NL paraphrasing**: use a language model to rephrase the natural language question while keeping the SQL unchanged
265
+
266
+ Target: **50K–100K** validated pipe SQL training pairs.
267
+
268
+ ---
269
+
270
+ ## 5. Decompiler Architecture
271
+
272
+ ### 5.1 Why SQLGlot (and Its Limitations)
273
+
274
+ SQLGlot is the best available open-source SQL parser/transpiler, supporting 30+ dialects and providing a rich AST. However, its pipe syntax support is **one-directional only**:
275
+
276
+ - **Pipe β†’ Standard**: SQLGlot can parse pipe syntax and decompose it into CTE-based standard SQL. This works.
277
+ - **Standard β†’ Pipe**: SQLGlot has **no generator** for pipe syntax output. Pipe nodes are destroyed at parse time and replaced with CTEs. There is no `to_pipe()` method, no pipe expression nodes in the AST, and no reverse transformation.
278
+
279
+ Therefore, we must build a custom decompiler on top of SQLGlot's parser and AST infrastructure. SQLGlot provides the parsing, qualification, and optimization layers; we add the pipe emission layer.
280
+
281
+ ### 5.2 Transformation Pipeline
282
+
283
+ ```
284
+ Standard SQL (string)
285
+ β”‚
286
+ β–Ό
287
+ [SQLGlot Parse] ──► AST (language-agnostic)
288
+ β”‚
289
+ β–Ό
290
+ [SQLGlot Qualify] ──► Fully qualified AST
291
+ β”‚ (all columns resolved to table.column,
292
+ β”‚ all aliases expanded, star expressions resolved)
293
+ β”‚
294
+ β–Ό
295
+ [Custom Pipe Emitter] ──► Pipe SQL (string)
296
+ ```
297
+
298
+ ### 5.3 Pipe Emitter Transformation Rules
299
+
300
+ The custom emitter walks the qualified AST and applies the following rules in order:
301
+
302
+ **Rule 1: FROM extraction**
303
+ ```
304
+ SELECT ... FROM table_expr β†’ FROM table_expr
305
+ ```
306
+ Extract the `FROM` clause as the pipe entry point. If the `FROM` contains joins, emit them as separate `|> JOIN` operators.
307
+
308
+ **Rule 2: JOIN linearization**
309
+ ```
310
+ FROM a JOIN b ON a.id = b.id JOIN c ON b.id = c.id
311
+ β†’
312
+ FROM a
313
+ |> JOIN b ON a.id = b.id
314
+ |> JOIN c ON b.id = c.id
315
+ ```
316
+
317
+ **Rule 3: WHERE promotion**
318
+ ```
319
+ WHERE condition β†’ |> WHERE condition
320
+ ```
321
+ Placed immediately after all `FROM`/`JOIN` operators.
322
+
323
+ **Rule 4: GROUP BY + aggregation fusion**
324
+ ```
325
+ SELECT col, AGG(expr) ... GROUP BY col
326
+ β†’
327
+ |> AGGREGATE AGG(expr) AS alias GROUP BY col
328
+ ```
329
+ The `SELECT` list is decomposed: aggregate expressions go into `|> AGGREGATE`, non-aggregate computed expressions become `|> EXTEND` operators placed before the aggregation.
330
+
331
+ **Rule 5: HAVING β†’ post-aggregation WHERE**
332
+ ```
333
+ HAVING AGG(x) > threshold
334
+ β†’
335
+ |> WHERE agg_alias > threshold
336
+ ```
337
+ Since `|> AGGREGATE` produces named output columns, the `HAVING` condition is rewritten to reference those aliases.
338
+
339
+ **Rule 6: Window function extraction**
340
+ ```
341
+ SELECT ..., ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) AS rn
342
+ β†’
343
+ |> EXTEND ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) AS rn
344
+ ```
345
+ Window functions are emitted as `|> EXTEND` operators after all filtering and aggregation.
346
+
347
+ **Rule 7: QUALIFY β†’ post-window WHERE**
348
+ ```
349
+ QUALIFY rn = 1 β†’ |> WHERE rn = 1
350
+ ```
351
+
352
+ **Rule 8: ORDER BY / LIMIT passthrough**
353
+ ```
354
+ ORDER BY col DESC LIMIT 10
355
+ β†’
356
+ |> ORDER BY col DESC
357
+ |> LIMIT 10
358
+ ```
359
+
360
+ **Rule 9: Subquery unrolling**
361
+ Correlated and non-correlated subqueries in `WHERE` or `SELECT` are "unrolled" into preceding pipe segments. Scalar subqueries become `|> JOIN` + `|> EXTEND` patterns. `EXISTS`/`IN` subqueries become `|> JOIN` (semi-join) patterns.
362
+
363
+ **Rule 10: CTE inlining**
364
+ CTEs are unrolled into the main pipeline. Each CTE becomes a named sub-pipeline that feeds into the final query via `|> JOIN` or direct substitution.
365
+
366
+ ### 5.4 Handling Ambiguity and Edge Cases
367
+
368
+ Not every standard SQL query maps cleanly to a single pipe representation. The emitter follows a **canonical ordering** convention:
369
+
370
+ ```
371
+ FROM β†’ JOINs β†’ WHERE (pre-agg) β†’ EXTEND (computed cols) β†’
372
+ AGGREGATE β†’ WHERE (post-agg) β†’ EXTEND (windows) β†’
373
+ WHERE (post-window) β†’ SELECT (final projection) β†’
374
+ ORDER BY β†’ LIMIT
375
+ ```
376
+
377
+ When multiple valid orderings exist, the canonical order ensures deterministic output, which is critical for training data consistency.
378
+
379
+ ---
380
+
381
+ ## 6. Semantic Validation: The Dual-Execution Loop
382
+
383
+ Even though the decompiler is deterministic, bugs in transformation rules can silently produce semantically incorrect pipe SQL. We validate every synthesized query through dual execution:
384
+
385
+ ```
386
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
387
+ β”‚ Standard SQL (gold) β”‚
388
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
389
+ β”‚
390
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
391
+ β”‚ Execute on DB │──► Result Set A
392
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
393
+
394
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
395
+ β”‚ Pipe SQL (synth) β”‚
396
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
397
+ β”‚
398
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
399
+ β”‚ Transpile back to standard β”‚
400
+ β”‚ SQL via SQLGlot β”‚
401
+ │ (pipe→standard is supported)│
402
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
403
+ β”‚
404
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
405
+ β”‚ Execute on DB │──► Result Set B
406
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
407
+
408
+ Result Set A == Result Set B?
409
+ ```
410
+
411
+ **Comparison method**:
412
+ - Both result sets are loaded into DataFrames.
413
+ - Rows are sorted deterministically (by all columns) to eliminate non-deterministic ordering.
414
+ - Column types are coerced to common types (e.g., `DECIMAL` vs. `FLOAT` tolerance).
415
+ - For large results (>100K rows), row-level SHA-256 hashes are aggregated for constant-time comparison.
416
+
417
+ Queries that fail validation are quarantined with diagnostic metadata (AST diff, execution error) for manual review and decompiler rule refinement.
418
+
419
+ ---
420
+
421
+ ## 7. Incremental Training Strategy
422
+
423
+ ### 7.1 Why Incremental?
424
+
425
+ Standard fine-tuning teaches the model to emit the entire query in one shot. This wastes the structural advantage of pipe syntax. Instead, we train the model to generate one pipe operator at a time, conditioned on the growing prefix. This mirrors how a human analyst would build a query: start with the data source, filter, transform, aggregate, filter again.
426
+
427
+ ### 7.2 Trajectory Decomposition
428
+
429
+ Each N-operator pipe query is decomposed into N training samples:
430
+
431
+ **Example**: "Which departments in the Chicago office have average salaries above $80K?"
432
+
433
+ | Step | Input (prompt) | Output (completion) |
434
+ |---|---|---|
435
+ | 1 | Question + Schema | `FROM employees` |
436
+ | 2 | Question + Schema + `FROM employees` | `\|> WHERE office = 'Chicago'` |
437
+ | 3 | Question + Schema + prefix(1–2) | `\|> AGGREGATE AVG(salary) AS avg_sal GROUP BY department` |
438
+ | 4 | Question + Schema + prefix(1–3) | `\|> WHERE avg_sal > 80000` |
439
+ | 5 | Question + Schema + prefix(1–4) | `\|> SELECT department, avg_sal` |
440
+
441
+ This produces a 5:1 amplification of training data: a single pipe query yields 5 supervised examples. For 50K pipe queries with an average of 4 operators each, this produces **~200K training samples**.
442
+
443
+ ### 7.3 Dataset Format (JSONL)
444
+
445
+ ```json
446
+ {
447
+ "messages": [
448
+ {
449
+ "role": "system",
450
+ "content": "You are a SQL assistant that builds pipe SQL queries incrementally. Given a question, schema, and the query built so far, emit only the next pipe operator."
451
+ },
452
+ {
453
+ "role": "user",
454
+ "content": "Question: Which departments have avg salary > 80K?\nSchema: employees(id INT, name TEXT, department TEXT, salary DECIMAL, office TEXT)\nQuery so far: FROM employees\n|> WHERE office = 'Chicago'"
455
+ },
456
+ {
457
+ "role": "assistant",
458
+ "content": "|> AGGREGATE AVG(salary) AS avg_sal GROUP BY department"
459
+ }
460
+ ]
461
+ }
462
+ ```
463
+
464
+ ### 7.4 Prefix Executability as a Training Signal
465
+
466
+ Because every prefix is independently executable (the Prefix Property), we can verify each intermediate step during data generation. If step K's prefix doesn't execute successfully against the database, the entire trajectory is flagged. This catches decompiler errors that might only manifest mid-pipeline.
467
+
468
+ ---
469
+
470
+ ## 8. Fine-Tuning Configuration
471
+
472
+ ### 8.1 Base Model Selection
473
+
474
+ | Model | Parameters | Why |
475
+ |---|---|---|
476
+ | **Qwen-2.5-Coder-7B** (primary) | 7.6B | 82.0 on Spider (standard SQL); strong code/SQL pretraining; 128K context |
477
+ | **Llama-3.2-3B** (lightweight) | 3.2B | Cost-efficient inference; suitable for simpler schemas; 128K context |
478
+ | **Qwen-2.5-Coder-1.5B** (edge) | 1.5B | Speculative decoding draft model; edge deployment |
479
+
480
+ Qwen-2.5-Coder-7B is the primary target because it already has strong SQL capabilities from pretraining, meaning fewer training steps are needed to adapt it to pipe syntax.
481
+
482
+ ### 8.2 Training Parameters
483
+
484
+ | Parameter | Value | Rationale |
485
+ |---|---|---|
486
+ | Method | QLoRA (4-bit quantization) | 7B model fits in ~11GB VRAM on RTX 4090 (24GB) |
487
+ | LoRA rank | 64–128 | Pipe SQL is a novel syntax underrepresented in pretraining; higher rank captures structural patterns better |
488
+ | LoRA alpha | 2Γ— rank | Standard scaling rule |
489
+ | LoRA target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | Full attention + MLP adaptation |
490
+ | Learning rate | 1e-4 (with cosine decay) | Standard for QLoRA |
491
+ | Batch size | 8 (effective, with gradient accumulation) | Balance between stability and speed |
492
+ | Context window | 4096 tokens | Accommodates schema (1–3K tokens) + prefix + next operator |
493
+ | Epochs | 3–5 | Monitor validation loss; early stop on plateau |
494
+ | Warmup | 5% of total steps | Prevent early divergence |
495
+
496
+ ### 8.3 Loss Masking
497
+
498
+ Only the assistant completion (the next pipe operator) contributes to the loss. The system prompt, user message (question + schema + prefix), and any padding tokens are masked with label ID `-100`.
499
+
500
+ ---
501
+
502
+ ## 9. Agentic Inference: Tool Calls Between Pipes
503
+
504
+ At inference time, the model operates within an agent loop. After generating each pipe operator, external tools are invoked to ground the next generation step in reality.
505
+
506
+ ### 9.1 Tool 1: Schema Propagation (Dry Run)
507
+
508
+ **When**: After every operator.
509
+ **How**: Execute `DESCRIBE` or a dry-run API call on the current prefix.
510
+ **Returns**: The list of columns and their types available for the next operator.
511
+ **Why**: Prevents the model from referencing columns that were dropped by an earlier `|> AGGREGATE` or `|> SELECT`. This is the single most impactful tool for preventing schema hallucination.
512
+
513
+ ### 9.2 Tool 2: Sample Rows
514
+
515
+ **When**: After `|> WHERE` and `|> JOIN` operators (where data content matters).
516
+ **How**: Execute `[prefix] |> LIMIT 5` against the database.
517
+ **Returns**: 5 rows of actual data.
518
+ **Why**: Lets the model verify literal values (e.g., is it `'Furniture'` or `'FURNITURE'`?), date formats, and NULL patterns. Eliminates constant hallucination.
519
+
520
+ ### 9.3 Tool 3: Syntax Validation
521
+
522
+ **When**: After generating any operator candidate.
523
+ **How**: Parse `[prefix] |> [candidate]` with SQLGlot.
524
+ **Returns**: Success or error with location.
525
+ **Why**: Catches syntax errors (missing commas, invalid keywords) before the model proceeds, avoiding cascading errors in subsequent operators.
526
+
527
+ ### 9.4 Agent Loop
528
+
529
+ ```
530
+ Input: (question, schema)
531
+ prefix = ""
532
+
533
+ while not DONE:
534
+ candidate = model.generate(question, schema, active_columns, prefix)
535
+
536
+ if candidate == "<END>":
537
+ break
538
+
539
+ if not syntax_valid(prefix + candidate):
540
+ candidate = model.retry(question, schema, prefix, error_msg)
541
+
542
+ prefix = prefix + "\n" + candidate
543
+ active_columns = dry_run(prefix)
544
+
545
+ if needs_data_grounding(candidate):
546
+ sample = execute(prefix + " |> LIMIT 5")
547
+ # Feed sample into next generation context
548
+
549
+ return prefix
550
+ ```
551
+
552
+ ---
553
+
554
+ ## 10. Post-SFT Reinforcement: Group Relative Policy Optimization (GRPO)
555
+
556
+ After supervised fine-tuning, we apply GRPO to improve reasoning quality. GRPO (introduced in DeepSeek-Math) eliminates the need for a separate critic model by using group-level baselines, making it practical for small-scale training.
557
+
558
+ ### 10.1 Reward Signals
559
+
560
+ For each generated pipe query, compute a composite reward:
561
+
562
+ | Signal | Type | Weight | Description |
563
+ |---|---|---|---|
564
+ | **Execution** | Binary (0/1) | 0.3 | Does the complete query execute without error? |
565
+ | **Result correctness** | Continuous (0–1) | 0.5 | Does the result match the gold standard? (F1 over result set rows) |
566
+ | **Schema adherence** | Binary (0/1) | 0.1 | Does every referenced column exist in the active schema at that pipe stage? |
567
+ | **Operator structure** | Continuous (0–1) | 0.1 | Does the query use the expected operator types? (e.g., question implies aggregation β†’ model uses `\|> AGGREGATE`) |
568
+
569
+ ### 10.2 GRPO Procedure
570
+
571
+ 1. For each training prompt (question + schema), generate K=8 complete pipe queries using the SFT model.
572
+ 2. Score each query using the reward signals above.
573
+ 3. Compute the group mean and standard deviation of rewards.
574
+ 4. For each query, compute advantage = (reward - group_mean) / group_std.
575
+ 5. Update the policy to increase the probability of above-average completions and decrease below-average ones.
576
+
577
+ This is particularly effective for Text-to-SQL because the reward signal (execution correctness) is cheap to compute β€” just run the query.
578
+
579
+ ---
580
+
581
+ ## 11. Performance Targets
582
+
583
+ ### 11.1 Realistic Benchmarks
584
+
585
+ Performance targets are calibrated against published results as of early 2026:
586
+
587
+ | Benchmark | Base 7B (standard SQL) | Specialized Pipe 7B (target) | Current 7B SOTA (standard SQL) | Frontier model reference |
588
+ |---|---|---|---|---|
589
+ | **BIRD-SQL** (dev) | ~35% EX | **65–70% EX** | 70.4% (Arctic-R1-7B w/ GRPO) | ~78% (GPT-4o pipeline) |
590
+ | **Spider 1.0** (test) | ~70% EX | **80–85% EX** | ~82% (Qwen-2.5-Coder-7B) | ~86% (GPT-4o) |
591
+
592
+ **Note on Spider 2.0**: This benchmark involves enterprise-scale databases with 1000+ columns and complex multi-tool workflows. Even GPT-4o scores only ~13% on Spider 2.0-Lite. We do not set a target here β€” Spider 2.0 requires agentic capabilities beyond single-query generation.
593
+
594
+ ### 11.2 Why These Targets Are Achievable
595
+
596
+ - **BIRD-SQL 65–70%**: Arctic-Text2SQL-R1-7B already achieves 70.4% with standard SQL + GRPO. Pipe syntax should provide a comparable or slightly improved structural advantage, as the model doesn't need to manage nesting.
597
+ - **Spider 1.0 80–85%**: Qwen-2.5-Coder-7B already hits 82% with standard SQL. Fine-tuning on pipe syntax should at minimum match this, with incremental training providing an additional boost through better step-by-step reasoning.
598
+
599
+ ---
600
+
601
+ ## 12. Deployment
602
+
603
+ ### 12.1 Serving Stack
604
+
605
+ - **vLLM** for inference with automatic prefix caching (APC) enabled.
606
+ - **Schema caching**: database schemas (often 1–3K tokens) are prepended to every request. With prefix caching, repeated queries against the same database save 70–90% on schema tokens.
607
+ - **Speculative decoding** (optional): use the Qwen-2.5-Coder-1.5B model as a draft for the 7B model. Expected speedup: 1.5–1.8x for standard draft-verify; 2–3x if using EAGLE-style speculator trained on the 7B model's hidden states.
608
+
609
+ ### 12.2 Output Processing
610
+
611
+ The generated pipe SQL must be transpiled to the target engine's dialect before execution. Since SQLGlot can parse pipe syntax and emit standard SQL for any of its 30+ supported dialects, this is a single function call:
612
+
613
+ ```python
614
+ import sqlglot
615
+
616
+ pipe_query = model.generate(question, schema)
617
+ executable = sqlglot.transpile(pipe_query, read="bigquery", write="postgres")[0]
618
+ cursor.execute(executable)
619
+ ```
620
+
621
+ This means the model generates in a single canonical pipe syntax, and deployment supports any target database.
622
+
623
+ ---
624
+
625
+ ## 13. Risks and Mitigations
626
+
627
+ | Risk | Impact | Mitigation |
628
+ |---|---|---|
629
+ | Decompiler bugs produce incorrect pipe SQL | Corrupted training data | Dual-execution validation loop (Section 6) catches all semantic errors |
630
+ | 16K seed queries insufficient for fine-tuning | Underfitting on complex patterns | Data augmentation (Section 4.4) expands to 50–100K queries; trajectory decomposition (Section 7.2) amplifies to 200K+ samples |
631
+ | Pipe syntax unseen in pretraining | Model struggles with novel tokens | LoRA rank 64–128 provides sufficient capacity; `\|>` is a simple 2-token sequence, not a complex new grammar |
632
+ | Engine-specific pipe differences (BigQuery vs. Spark) | Portability issues | Train on canonical (BigQuery) syntax only; transpile at deployment via SQLGlot |
633
+ | Incremental generation is slower than one-shot | Latency at inference time | Speculative decoding + prefix caching offset the overhead; accuracy gains justify the tradeoff |
634
+
635
+ ---
636
+
637
+ ## 14. Implementation Roadmap
638
+
639
+ ### Phase 1: Decompiler
640
+ - Build pipe emitter on top of SQLGlot's qualified AST
641
+ - Implement transformation rules (Section 5.3)
642
+ - Validate against Spider 1.0 + BIRD-SQL with dual-execution loop
643
+ - Target: 90%+ of seed queries successfully decompiled and validated
644
+
645
+ ### Phase 2: Data Pipeline
646
+ - Run decompiler over all seed queries
647
+ - Apply augmentation strategies (schema substitution, complexity composition, NL paraphrasing)
648
+ - Generate trajectory-decomposed JSONL training files
649
+ - Target: 50K+ pipe queries β†’ 200K+ training samples
650
+
651
+ ### Phase 3: Supervised Fine-Tuning
652
+ - Fine-tune Qwen-2.5-Coder-7B with QLoRA
653
+ - Evaluate on held-out BIRD-SQL dev split
654
+ - Ablate: incremental vs. one-shot training, LoRA rank, context window size
655
+ - Target: match or exceed base model's standard SQL accuracy
656
+
657
+ ### Phase 4: GRPO Reinforcement
658
+ - Implement execution-based reward function
659
+ - Run GRPO on the SFT checkpoint
660
+ - Evaluate on full BIRD-SQL dev and Spider 1.0 test
661
+ - Target: 5–10% EX improvement over SFT-only model
662
+
663
+ ### Phase 5: Agentic Integration & Deployment
664
+ - Build agent loop with dry-run, sample rows, and syntax validation tools
665
+ - Integrate with vLLM serving
666
+ - End-to-end evaluation on production schemas
667
+ - Target: sub-2-second latency for 4-operator queries on RTX 4090
668
+
669
+ ---
670
+
671
+ ## Appendix A: Pipe SQL Engine Support Matrix
672
+
673
+ | Engine | Status | Pipe Symbol | Notes |
674
+ |---|---|---|---|
675
+ | **BigQuery (GoogleSQL)** | GA (Feb 2025) | `\|>` | 20+ operators; most complete implementation |
676
+ | **Apache Spark 4.0+** | GA | `\|>` | Full operator set; mirrors BigQuery |
677
+ | **Databricks Runtime 16.2+** | GA | `\|>` | Same as Spark |
678
+ | **Firebolt** | Supported | `\|>` | Subset of operators |
679
+ | **DuckDB** | Community extension only | `\|>` | Regex-based preprocessor; not production-grade |
680
+ | **Snowflake** | Different concept | `->>` | Chains full SQL statements, not operators |
681
+ | **PostgreSQL / MySQL / SQL Server** | Not supported | N/A | β€” |
682
+
683
+ ## Appendix B: SQLGlot Pipe Syntax Capabilities
684
+
685
+ | Capability | Supported? |
686
+ |---|---|
687
+ | Parse pipe syntax β†’ AST | Yes (12 operators) |
688
+ | Generate pipe syntax from AST | **No** |
689
+ | Pipe β†’ standard SQL transpilation | Yes (via CTE decomposition at parse time) |
690
+ | Standard SQL β†’ pipe transpilation | **No** (requires custom decompiler) |
691
+ | Optimizer/qualify on pipe input | Yes (operates on CTE-decomposed AST) |
692
+ | Supported pipe operators (parsing) | SELECT, WHERE, AGGREGATE, EXTEND, JOIN, ORDER BY, LIMIT, AS, PIVOT, UNPIVOT, TABLESAMPLE, set ops |
693
+ | Known broken operators (v29.x) | SET, DROP, RENAME, DISTINCT, CALL, WITH |