File size: 13,134 Bytes
f191612 1e0227f f191612 1e0227f f191612 a49e9d7 f191612 a49e9d7 7e7485e a49e9d7 f191612 a49e9d7 f191612 a49e9d7 f191612 a49e9d7 f191612 977cbcc f191612 a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 977cbcc a49e9d7 f191612 a49e9d7 f191612 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | ---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
tags:
- text-to-sql
- pipe-sql
- sqlglot
- tool-calling
- qwen2
datasets:
- spider
pipeline_tag: text-generation
model-index:
- name: pipe-sql-1.5b
results:
- task:
type: text-to-sql
name: Text-to-SQL
dataset:
type: spider
name: Spider 1.0 Dev
metrics:
- type: execution_accuracy
value: 60.66
name: Execution Accuracy
---
# Pipe SQL 1.5B
A fine-tuned [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) model for generating **Pipe SQL** through multi-turn tool-calling conversations.
**GitHub**: [nittygritty-zzy/sqlglot](https://github.com/nittygritty-zzy/sqlglot)
## What is Pipe SQL?
Pipe SQL is a more readable SQL syntax that uses the `|>` (pipe) operator to chain operations in a linear, top-to-bottom flow:
```sql
FROM employees
|> WHERE department = 'Engineering'
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
|> ORDER BY avg_salary DESC
```
This is transpiled to standard SQL via [sqlglot](https://github.com/tobymao/sqlglot), an open-source SQL parser and transpiler.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | Qwen2.5-Coder-1.5B-Instruct |
| **Architecture** | Qwen2ForCausalLM |
| **Parameters** | 1.5B |
| **Hidden Size** | 1536 |
| **Layers** | 28 |
| **Attention Heads** | 12 (2 KV heads) |
| **Context Length** | 2048 tokens (training) |
## Design Documents
The full design and methodology behind this project is documented in the following design docs (also available in [docs/design/](https://github.com/nittygritty-zzy/sqlglot/tree/main/docs/design) on GitHub):
| Document | Description |
|----------|-------------|
| [Fine-Tuning Design Doc](docs/pipe-sql-fine-tuning-design-doc.md) | End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models |
| [Decompiler Design Doc](docs/pipe-sql-decompiler-design-doc.md) | Standard SQL to pipe SQL decompiler — the deterministic data generation component |
| [Validation Loop Design Doc](docs/pipe-sql-validation-loop-design-doc.md) | SQLite round-trip validation and feedback loop to ensure semantic correctness |
| [Training Reproduction Guide](docs/pipe-sql-training-reproduction-guide.md) | Step-by-step guide to reproduce the full training pipeline from scratch |
## Training
The model was fine-tuned using **QLoRA** on multi-turn tool-calling conversations for text-to-SQL generation.
### Training Data
Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/spider) training set, where each conversation follows an agentic workflow:
1. **Explore** the database schema using `list_tables`, `describe_table`, and `sample_data` tools
2. **Write** pipe SQL queries using `execute_pipe_sql` and `validate_pipe_sql` tools
3. **Iterate** based on execution results until the query is correct
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| **Method** | QLoRA (4-bit NF4) |
| **LoRA rank** | 16 |
| **LoRA alpha** | 32 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Epochs** | 3 |
| **Learning rate** | 2e-4 |
| **LR scheduler** | Cosine |
| **Warmup ratio** | 0.05 |
| **Batch size** | 2 (per device) |
| **Gradient accumulation** | 8 steps |
| **Weight decay** | 0.01 |
| **Loss** | Assistant-only (tool responses masked) |
## Evaluation Results
Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.
### Execution Accuracy
| Metric | Value |
|--------|-------|
| **Execution Accuracy** | **60.66%** (626 / 1,032) |
| **Prediction Rate** | 99.7% (1,031 / 1,034) |
| **Total Questions** | 1,034 |
| **Gold Errors Excluded** | 2 |
### Context: Spider 1.0 Dev Set SOTA
| Model | Size | EX (Dev) | Method |
|-------|-----:|----------|--------|
| MiniSeek | — | 91.2% | Proprietary |
| DAIL-SQL + GPT-4 + SC | — | 86.6% | In-context learning |
| DIN-SQL + GPT-4 | — | 85.3% | In-context learning |
| SFT CodeS-7B | 7B | 85.4% | Fine-tuned |
| SFT CodeS-3B | 3B | 83.3% | Fine-tuned |
| SFT CodeS-1B | 1B | 77.9% | Fine-tuned |
| **Pipe SQL 1.5B (ours)** | **1.5B** | **60.7%** | **Fine-tuned, agentic tool-calling** |
Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: [Spider leaderboard](https://yale-lily.github.io/spider), [CodeS (Li et al., 2024)](https://arxiv.org/abs/2402.16347).
### Detailed Breakdown
| Status | Count | % of Total | Description |
|--------|------:|------------|-------------|
| **Match** | 626 | 60.5% | Predicted SQL produces identical results to gold SQL |
| **Mismatch** | 209 | 20.2% | SQL executes but results differ from gold |
| **Execution Error** | 170 | 16.4% | Transpiled SQL fails to execute against SQLite |
| **Transpile Error** | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL |
| **No Prediction** | 3 | 0.3% | Agent did not produce a pipe SQL query |
| **Gold Error** | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) |
### Evaluation Methodology
1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL
3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()`
4. Both predicted and gold SQL are executed against the Spider SQLite databases
5. Result sets are compared using order-insensitive set comparison with numeric tolerance
> **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.
## Tools
The model was trained to use 5 tools in a multi-turn conversation:
| Tool | Description |
|------|-------------|
| `list_tables` | List all tables in a database |
| `describe_table` | Get column names, types, and constraints for a table |
| `sample_data` | Retrieve sample rows from a table |
| `execute_pipe_sql` | Execute a pipe SQL query against the database |
| `validate_pipe_sql` | Validate pipe SQL syntax without executing |
## Usage
### Chat Template
The model uses a custom chat template with `<tool_call>` tags for tool invocations:
```
<|im_start|>assistant
Let me explore the database first.
<tool_call>
list_tables({"db_id": "concert_singer"})
</tool_call><|im_end|>
```
Tool responses are formatted as:
```
<|im_start|>user
<tool_response>
Tables in database 'concert_singer':
- stadium
- singer
- concert
- singer_in_concert
</tool_response><|im_end|>
```
### Inference
For inference with the correct chat template, see the [evaluation server code](https://github.com/nittygritty-zzy/sqlglot/tree/main/pipe_sql/evaluation/server) on GitHub.
## Reproducing the Benchmark
### Prerequisites
- **GPU**: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
- **Python**: 3.11+ with pip/uv
- **Node.js**: 18+ with npm
- **Disk**: ~1 GB for Spider databases, ~3 GB for model weights
### Step 1: Clone the Repository
```bash
git clone https://github.com/nittygritty-zzy/sqlglot.git
cd sqlglot
```
### Step 2: Set Up Python Environment
```bash
# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate # Linux/macOS
# source .venv/Scripts/activate # Windows (Git Bash)
# Install sqlglot (editable)
uv pip install -e .
# Install evaluation server dependencies
uv pip install fastapi uvicorn pydantic
# Install PyTorch with CUDA support
uv pip install torch --index-url https://download.pytorch.org/whl/cu126
# Install model loading dependencies
uv pip install transformers accelerate
```
Verify CUDA:
```bash
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: True NVIDIA GeForce RTX ...
```
### Step 3: Download Spider 1.0 Dataset
The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).
```bash
# Install gdown for Google Drive downloads
uv pip install gdown
# Download and extract Spider 1.0 (~1 GB)
bash scripts/setup_data.sh
```
Verify:
```bash
ls data/spider/dev.json # 1,034 questions
ls data/spider/database/ | wc -l # ~166 databases (20 used by dev set)
```
### Step 4: Download the Model
```bash
# Option A: Use huggingface_hub (recommended)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')
"
# Option B: Use git-lfs
git lfs install
git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged
```
### Step 5: Install Node.js Agent Dependencies
```bash
cd pipe_sql/evaluation/agent
npm install
cd ../../..
```
### Step 6: Run the Benchmark
#### Option A: Full Pipeline (Recommended)
```bash
# Run all 1,034 questions (takes ~2 hours on RTX 4080)
bash pipe_sql/evaluation/run_all.sh
# Smoke test with 5 questions first
bash pipe_sql/evaluation/run_all.sh --limit 5
```
This script:
1. Starts the Python evaluation server (model inference + tool execution)
2. Waits for the server to be ready
3. Runs the TypeScript agent benchmark
4. Evaluates results and prints execution accuracy
#### Option B: Run Components Separately
**Start the evaluation server:**
```bash
# Default: loads model from pipe_sql/finetuning_output/merged/
python -m pipe_sql.evaluation.server.app
# Custom model path:
MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app
```
Wait for `Server ready` in the logs, then in a separate terminal:
**Run the agent benchmark:**
```bash
cd pipe_sql/evaluation/agent
npx tsx src/main.ts --benchmark # All 1,034 questions
npx tsx src/main.ts --benchmark --limit 5 # Smoke test
```
**Run single question interactively:**
```bash
cd pipe_sql/evaluation/agent
npx tsx src/main.ts "How many singers do we have?" concert_singer
```
**Evaluate results:**
```bash
python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json
```
### Step 7: Review Results
Results are saved to `pipe_sql/output/`:
| File | Description |
|------|-------------|
| `results.json` | Agent predictions with conversation traces |
| `eval_results.json` | Per-question evaluation details (match/mismatch/error) |
| `eval_summary.json` | Aggregate metrics |
### Configuration
| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `MODEL_PATH` | `pipe_sql/finetuning_output/merged` | Path to merged model directory |
| `SPIDER_DB_DIR` | `data/spider/database` | Spider database directory |
| `SPIDER_DIR` | `data/spider` | Spider data directory (contains dev.json) |
| `PORT` | `8000` | Evaluation server port |
| `SERVER_URL` | `http://localhost:8000` | Agent to server connection URL |
| `OUTPUT_DIR` | `pipe_sql/output` | Agent output directory |
### Troubleshooting
**Server fails to load model**: Ensure `pipe_sql/finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`.
**CUDA out of memory**: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU.
**Agent produces garbled tool calls**: The 1.5B model sometimes generates garbled special tokens instead of proper `<tool_call>` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.
**Spider databases not found**: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`.
## Limitations
- Trained and evaluated only on Spider 1.0 (SQLite databases)
- Context window limited to 2,048 tokens during training
- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
- This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower
## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Qwen2.5-Coder model license.
|