File size: 13,134 Bytes
f191612
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e0227f
 
f191612
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e0227f
 
 
 
 
 
 
 
 
 
 
f191612
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a49e9d7
 
 
f191612
 
 
 
 
a49e9d7
 
 
7e7485e
 
 
 
 
 
 
 
 
 
 
 
 
 
a49e9d7
f191612
a49e9d7
 
 
 
 
 
 
 
f191612
a49e9d7
f191612
a49e9d7
 
 
 
 
 
 
f191612
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
977cbcc
f191612
a49e9d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
977cbcc
a49e9d7
 
 
 
977cbcc
a49e9d7
 
 
 
 
977cbcc
a49e9d7
977cbcc
a49e9d7
 
 
 
 
 
 
 
977cbcc
a49e9d7
 
977cbcc
a49e9d7
 
 
 
 
 
 
 
 
 
 
 
977cbcc
 
a49e9d7
 
977cbcc
a49e9d7
 
 
 
 
 
977cbcc
a49e9d7
 
 
 
 
 
977cbcc
a49e9d7
 
 
 
 
977cbcc
a49e9d7
 
 
 
977cbcc
a49e9d7
 
 
 
 
 
 
 
 
 
 
977cbcc
a49e9d7
 
 
 
977cbcc
a49e9d7
 
 
977cbcc
a49e9d7
 
 
 
 
 
 
f191612
 
 
 
 
 
a49e9d7
f191612
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---

language:
- en
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
tags:
- text-to-sql
- pipe-sql
- sqlglot
- tool-calling
- qwen2
datasets:
- spider
pipeline_tag: text-generation
model-index:
- name: pipe-sql-1.5b
  results:
  - task:
      type: text-to-sql
      name: Text-to-SQL
    dataset:
      type: spider
      name: Spider 1.0 Dev
    metrics:
    - type: execution_accuracy
      value: 60.66
      name: Execution Accuracy
---


# Pipe SQL 1.5B

A fine-tuned [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) model for generating **Pipe SQL** through multi-turn tool-calling conversations.

**GitHub**: [nittygritty-zzy/sqlglot](https://github.com/nittygritty-zzy/sqlglot)

## What is Pipe SQL?

Pipe SQL is a more readable SQL syntax that uses the `|>` (pipe) operator to chain operations in a linear, top-to-bottom flow:

```sql

FROM employees

|> WHERE department = 'Engineering'

|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level

|> ORDER BY avg_salary DESC

```

This is transpiled to standard SQL via [sqlglot](https://github.com/tobymao/sqlglot), an open-source SQL parser and transpiler.

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | Qwen2.5-Coder-1.5B-Instruct |
| **Architecture** | Qwen2ForCausalLM |
| **Parameters** | 1.5B |
| **Hidden Size** | 1536 |
| **Layers** | 28 |
| **Attention Heads** | 12 (2 KV heads) |
| **Context Length** | 2048 tokens (training) |

## Design Documents

The full design and methodology behind this project is documented in the following design docs (also available in [docs/design/](https://github.com/nittygritty-zzy/sqlglot/tree/main/docs/design) on GitHub):

| Document | Description |
|----------|-------------|
| [Fine-Tuning Design Doc](docs/pipe-sql-fine-tuning-design-doc.md) | End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models |
| [Decompiler Design Doc](docs/pipe-sql-decompiler-design-doc.md) | Standard SQL to pipe SQL decompiler — the deterministic data generation component |
| [Validation Loop Design Doc](docs/pipe-sql-validation-loop-design-doc.md) | SQLite round-trip validation and feedback loop to ensure semantic correctness |
| [Training Reproduction Guide](docs/pipe-sql-training-reproduction-guide.md) | Step-by-step guide to reproduce the full training pipeline from scratch |

## Training

The model was fine-tuned using **QLoRA** on multi-turn tool-calling conversations for text-to-SQL generation.

### Training Data

Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/spider) training set, where each conversation follows an agentic workflow:
1. **Explore** the database schema using `list_tables`, `describe_table`, and `sample_data` tools
2. **Write** pipe SQL queries using `execute_pipe_sql` and `validate_pipe_sql` tools
3. **Iterate** based on execution results until the query is correct

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| **Method** | QLoRA (4-bit NF4) |
| **LoRA rank** | 16 |
| **LoRA alpha** | 32 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |

| **Epochs** | 3 |

| **Learning rate** | 2e-4 |

| **LR scheduler** | Cosine |

| **Warmup ratio** | 0.05 |

| **Batch size** | 2 (per device) |

| **Gradient accumulation** | 8 steps |

| **Weight decay** | 0.01 |

| **Loss** | Assistant-only (tool responses masked) |



## Evaluation Results



Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.



### Execution Accuracy



| Metric | Value |

|--------|-------|

| **Execution Accuracy** | **60.66%** (626 / 1,032) |

| **Prediction Rate** | 99.7% (1,031 / 1,034) |

| **Total Questions** | 1,034 |

| **Gold Errors Excluded** | 2 |



### Context: Spider 1.0 Dev Set SOTA



| Model | Size | EX (Dev) | Method |

|-------|-----:|----------|--------|

| MiniSeek | — | 91.2% | Proprietary |

| DAIL-SQL + GPT-4 + SC | — | 86.6% | In-context learning |

| DIN-SQL + GPT-4 | — | 85.3% | In-context learning |

| SFT CodeS-7B | 7B | 85.4% | Fine-tuned |

| SFT CodeS-3B | 3B | 83.3% | Fine-tuned |

| SFT CodeS-1B | 1B | 77.9% | Fine-tuned |

| **Pipe SQL 1.5B (ours)** | **1.5B** | **60.7%** | **Fine-tuned, agentic tool-calling** |



Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: [Spider leaderboard](https://yale-lily.github.io/spider), [CodeS (Li et al., 2024)](https://arxiv.org/abs/2402.16347).



### Detailed Breakdown



| Status | Count | % of Total | Description |

|--------|------:|------------|-------------|

| **Match** | 626 | 60.5% | Predicted SQL produces identical results to gold SQL |

| **Mismatch** | 209 | 20.2% | SQL executes but results differ from gold |

| **Execution Error** | 170 | 16.4% | Transpiled SQL fails to execute against SQLite |

| **Transpile Error** | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL |

| **No Prediction** | 3 | 0.3% | Agent did not produce a pipe SQL query |

| **Gold Error** | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) |



### Evaluation Methodology



1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)

2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL

3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()`

4. Both predicted and gold SQL are executed against the Spider SQLite databases

5. Result sets are compared using order-insensitive set comparison with numeric tolerance



> **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.



## Tools



The model was trained to use 5 tools in a multi-turn conversation:



| Tool | Description |

|------|-------------|

| `list_tables` | List all tables in a database |
| `describe_table` | Get column names, types, and constraints for a table |
| `sample_data` | Retrieve sample rows from a table |
| `execute_pipe_sql` | Execute a pipe SQL query against the database |
| `validate_pipe_sql` | Validate pipe SQL syntax without executing |

## Usage

### Chat Template

The model uses a custom chat template with `<tool_call>` tags for tool invocations:

```

<|im_start|>assistant

Let me explore the database first.

<tool_call>

list_tables({"db_id": "concert_singer"})

</tool_call><|im_end|>

```

Tool responses are formatted as:

```

<|im_start|>user

<tool_response>

Tables in database 'concert_singer':

- stadium

- singer

- concert

- singer_in_concert

</tool_response><|im_end|>

```

### Inference

For inference with the correct chat template, see the [evaluation server code](https://github.com/nittygritty-zzy/sqlglot/tree/main/pipe_sql/evaluation/server) on GitHub.

## Reproducing the Benchmark

### Prerequisites

- **GPU**: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
- **Python**: 3.11+ with pip/uv
- **Node.js**: 18+ with npm
- **Disk**: ~1 GB for Spider databases, ~3 GB for model weights

### Step 1: Clone the Repository

```bash

git clone https://github.com/nittygritty-zzy/sqlglot.git

cd sqlglot

```

### Step 2: Set Up Python Environment

```bash

# Create virtual environment

uv venv .venv --python 3.11

source .venv/bin/activate       # Linux/macOS

# source .venv/Scripts/activate # Windows (Git Bash)



# Install sqlglot (editable)

uv pip install -e .



# Install evaluation server dependencies

uv pip install fastapi uvicorn pydantic



# Install PyTorch with CUDA support

uv pip install torch --index-url https://download.pytorch.org/whl/cu126



# Install model loading dependencies

uv pip install transformers accelerate

```

Verify CUDA:
```bash

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

# Expected: True NVIDIA GeForce RTX ...

```

### Step 3: Download Spider 1.0 Dataset

The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).

```bash

# Install gdown for Google Drive downloads

uv pip install gdown



# Download and extract Spider 1.0 (~1 GB)

bash scripts/setup_data.sh

```

Verify:
```bash

ls data/spider/dev.json           # 1,034 questions

ls data/spider/database/ | wc -l  # ~166 databases (20 used by dev set)

```

### Step 4: Download the Model

```bash

# Option A: Use huggingface_hub (recommended)

pip install huggingface_hub

python -c "

from huggingface_hub import snapshot_download

snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')

"



# Option B: Use git-lfs

git lfs install

git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged

```

### Step 5: Install Node.js Agent Dependencies

```bash

cd pipe_sql/evaluation/agent

npm install

cd ../../..

```

### Step 6: Run the Benchmark

#### Option A: Full Pipeline (Recommended)

```bash

# Run all 1,034 questions (takes ~2 hours on RTX 4080)

bash pipe_sql/evaluation/run_all.sh



# Smoke test with 5 questions first

bash pipe_sql/evaluation/run_all.sh --limit 5

```

This script:
1. Starts the Python evaluation server (model inference + tool execution)
2. Waits for the server to be ready
3. Runs the TypeScript agent benchmark
4. Evaluates results and prints execution accuracy

#### Option B: Run Components Separately

**Start the evaluation server:**
```bash

# Default: loads model from pipe_sql/finetuning_output/merged/

python -m pipe_sql.evaluation.server.app



# Custom model path:

MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app

```

Wait for `Server ready` in the logs, then in a separate terminal:

**Run the agent benchmark:**
```bash

cd pipe_sql/evaluation/agent

npx tsx src/main.ts --benchmark           # All 1,034 questions

npx tsx src/main.ts --benchmark --limit 5 # Smoke test

```

**Run single question interactively:**
```bash

cd pipe_sql/evaluation/agent

npx tsx src/main.ts "How many singers do we have?" concert_singer

```

**Evaluate results:**
```bash

python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json

```

### Step 7: Review Results

Results are saved to `pipe_sql/output/`:

| File | Description |
|------|-------------|
| `results.json` | Agent predictions with conversation traces |
| `eval_results.json` | Per-question evaluation details (match/mismatch/error) |
| `eval_summary.json` | Aggregate metrics |

### Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `MODEL_PATH` | `pipe_sql/finetuning_output/merged` | Path to merged model directory |
| `SPIDER_DB_DIR` | `data/spider/database` | Spider database directory |
| `SPIDER_DIR` | `data/spider` | Spider data directory (contains dev.json) |
| `PORT` | `8000` | Evaluation server port |
| `SERVER_URL` | `http://localhost:8000` | Agent to server connection URL |
| `OUTPUT_DIR` | `pipe_sql/output` | Agent output directory |

### Troubleshooting

**Server fails to load model**: Ensure `pipe_sql/finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`.

**CUDA out of memory**: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU.

**Agent produces garbled tool calls**: The 1.5B model sometimes generates garbled special tokens instead of proper `<tool_call>` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.

**Spider databases not found**: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`.

## Limitations

- Trained and evaluated only on Spider 1.0 (SQLite databases)
- Context window limited to 2,048 tokens during training
- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
- This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Qwen2.5-Coder model license.