Pipe SQL 1.5B
A fine-tuned Qwen2.5-Coder-1.5B-Instruct model for generating Pipe SQL through multi-turn tool-calling conversations.
GitHub: nittygritty-zzy/sqlglot
What is Pipe SQL?
Pipe SQL is a more readable SQL syntax that uses the |> (pipe) operator to chain operations in a linear, top-to-bottom flow:
FROM employees
|> WHERE department = 'Engineering'
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
|> ORDER BY avg_salary DESC
This is transpiled to standard SQL via sqlglot, an open-source SQL parser and transpiler.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen2.5-Coder-1.5B-Instruct |
| Architecture | Qwen2ForCausalLM |
| Parameters | 1.5B |
| Hidden Size | 1536 |
| Layers | 28 |
| Attention Heads | 12 (2 KV heads) |
| Context Length | 2048 tokens (training) |
Design Documents
The full design and methodology behind this project is documented in the following design docs (also available in docs/design/ on GitHub):
| Document | Description |
|---|---|
| Fine-Tuning Design Doc | End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models |
| Decompiler Design Doc | Standard SQL to pipe SQL decompiler — the deterministic data generation component |
| Validation Loop Design Doc | SQLite round-trip validation and feedback loop to ensure semantic correctness |
| Training Reproduction Guide | Step-by-step guide to reproduce the full training pipeline from scratch |
Training
The model was fine-tuned using QLoRA on multi-turn tool-calling conversations for text-to-SQL generation.
Training Data
Conversations were generated from the Spider 1.0 training set, where each conversation follows an agentic workflow:
- Explore the database schema using
list_tables,describe_table, andsample_datatools - Write pipe SQL queries using
execute_pipe_sqlandvalidate_pipe_sqltools - Iterate based on execution results until the query is correct
Hyperparameters
| Parameter | Value |
|---|---|
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 3 |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Batch size | 2 (per device) |
| Gradient accumulation | 8 steps |
| Weight decay | 0.01 |
| Loss | Assistant-only (tool responses masked) |
Evaluation Results
Evaluated on the Spider 1.0 dev set (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.
Execution Accuracy
| Metric | Value |
|---|---|
| Execution Accuracy | 60.66% (626 / 1,032) |
| Prediction Rate | 99.7% (1,031 / 1,034) |
| Total Questions | 1,034 |
| Gold Errors Excluded | 2 |
Context: Spider 1.0 Dev Set SOTA
| Model | Size | EX (Dev) | Method |
|---|---|---|---|
| MiniSeek | — | 91.2% | Proprietary |
| DAIL-SQL + GPT-4 + SC | — | 86.6% | In-context learning |
| DIN-SQL + GPT-4 | — | 85.3% | In-context learning |
| SFT CodeS-7B | 7B | 85.4% | Fine-tuned |
| SFT CodeS-3B | 3B | 83.3% | Fine-tuned |
| SFT CodeS-1B | 1B | 77.9% | Fine-tuned |
| Pipe SQL 1.5B (ours) | 1.5B | 60.7% | Fine-tuned, agentic tool-calling |
Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: Spider leaderboard, CodeS (Li et al., 2024).
Detailed Breakdown
| Status | Count | % of Total | Description |
|---|---|---|---|
| Match | 626 | 60.5% | Predicted SQL produces identical results to gold SQL |
| Mismatch | 209 | 20.2% | SQL executes but results differ from gold |
| Execution Error | 170 | 16.4% | Transpiled SQL fails to execute against SQLite |
| Transpile Error | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL |
| No Prediction | 3 | 0.3% | Agent did not produce a pipe SQL query |
| Gold Error | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) |
Evaluation Methodology
- The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
- The agent's final
execute_pipe_sqlcall is extracted as the predicted pipe SQL - Predicted pipe SQL is transpiled to standard SQL using
sqlglot.transpile() - Both predicted and gold SQL are executed against the Spider SQLite databases
- Result sets are compared using order-insensitive set comparison with numeric tolerance
Note: This is an in-distribution evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.
Tools
The model was trained to use 5 tools in a multi-turn conversation:
| Tool | Description |
|---|---|
list_tables |
List all tables in a database |
describe_table |
Get column names, types, and constraints for a table |
sample_data |
Retrieve sample rows from a table |
execute_pipe_sql |
Execute a pipe SQL query against the database |
validate_pipe_sql |
Validate pipe SQL syntax without executing |
Usage
Chat Template
The model uses a custom chat template with <tool_call> tags for tool invocations:
<|im_start|>assistant
Let me explore the database first.
<tool_call>
list_tables({"db_id": "concert_singer"})
</tool_call><|im_end|>
Tool responses are formatted as:
<|im_start|>user
<tool_response>
Tables in database 'concert_singer':
- stadium
- singer
- concert
- singer_in_concert
</tool_response><|im_end|>
Inference
For inference with the correct chat template, see the evaluation server code on GitHub.
Reproducing the Benchmark
Prerequisites
- GPU: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
- Python: 3.11+ with pip/uv
- Node.js: 18+ with npm
- Disk: ~1 GB for Spider databases, ~3 GB for model weights
Step 1: Clone the Repository
git clone https://github.com/nittygritty-zzy/sqlglot.git
cd sqlglot
Step 2: Set Up Python Environment
# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate # Linux/macOS
# source .venv/Scripts/activate # Windows (Git Bash)
# Install sqlglot (editable)
uv pip install -e .
# Install evaluation server dependencies
uv pip install fastapi uvicorn pydantic
# Install PyTorch with CUDA support
uv pip install torch --index-url https://download.pytorch.org/whl/cu126
# Install model loading dependencies
uv pip install transformers accelerate
Verify CUDA:
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: True NVIDIA GeForce RTX ...
Step 3: Download Spider 1.0 Dataset
The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).
# Install gdown for Google Drive downloads
uv pip install gdown
# Download and extract Spider 1.0 (~1 GB)
bash scripts/setup_data.sh
Verify:
ls data/spider/dev.json # 1,034 questions
ls data/spider/database/ | wc -l # ~166 databases (20 used by dev set)
Step 4: Download the Model
# Option A: Use huggingface_hub (recommended)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')
"
# Option B: Use git-lfs
git lfs install
git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged
Step 5: Install Node.js Agent Dependencies
cd pipe_sql/evaluation/agent
npm install
cd ../../..
Step 6: Run the Benchmark
Option A: Full Pipeline (Recommended)
# Run all 1,034 questions (takes ~2 hours on RTX 4080)
bash pipe_sql/evaluation/run_all.sh
# Smoke test with 5 questions first
bash pipe_sql/evaluation/run_all.sh --limit 5
This script:
- Starts the Python evaluation server (model inference + tool execution)
- Waits for the server to be ready
- Runs the TypeScript agent benchmark
- Evaluates results and prints execution accuracy
Option B: Run Components Separately
Start the evaluation server:
# Default: loads model from pipe_sql/finetuning_output/merged/
python -m pipe_sql.evaluation.server.app
# Custom model path:
MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app
Wait for Server ready in the logs, then in a separate terminal:
Run the agent benchmark:
cd pipe_sql/evaluation/agent
npx tsx src/main.ts --benchmark # All 1,034 questions
npx tsx src/main.ts --benchmark --limit 5 # Smoke test
Run single question interactively:
cd pipe_sql/evaluation/agent
npx tsx src/main.ts "How many singers do we have?" concert_singer
Evaluate results:
python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json
Step 7: Review Results
Results are saved to pipe_sql/output/:
| File | Description |
|---|---|
results.json |
Agent predictions with conversation traces |
eval_results.json |
Per-question evaluation details (match/mismatch/error) |
eval_summary.json |
Aggregate metrics |
Configuration
| Environment Variable | Default | Description |
|---|---|---|
MODEL_PATH |
pipe_sql/finetuning_output/merged |
Path to merged model directory |
SPIDER_DB_DIR |
data/spider/database |
Spider database directory |
SPIDER_DIR |
data/spider |
Spider data directory (contains dev.json) |
PORT |
8000 |
Evaluation server port |
SERVER_URL |
http://localhost:8000 |
Agent to server connection URL |
OUTPUT_DIR |
pipe_sql/output |
Agent output directory |
Troubleshooting
Server fails to load model: Ensure pipe_sql/finetuning_output/merged/ contains config.json, model.safetensors, and tokenizer.json. If using a different path, set MODEL_PATH.
CUDA out of memory: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use CUDA_VISIBLE_DEVICES=0 to select a specific GPU.
Agent produces garbled tool calls: The 1.5B model sometimes generates garbled special tokens instead of proper <tool_call> tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.
Spider databases not found: Run bash scripts/setup_data.sh to download Spider 1.0. The script downloads from Google Drive via gdown.
Limitations
- Trained and evaluated only on Spider 1.0 (SQLite databases)
- Context window limited to 2,048 tokens during training
- The 1.5B model may generate garbled special tokens instead of proper
<tool_call>tags — the inference server includes fallback parsing for bare function calls - Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
- This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower
License
This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-Coder model license.
Model tree for nittygritty-zzy/pipe-sql-1.5b
Base model
Qwen/Qwen2.5-1.5BDataset used to train nittygritty-zzy/pipe-sql-1.5b
Paper for nittygritty-zzy/pipe-sql-1.5b
Evaluation results
- Execution Accuracy on Spider 1.0 Devself-reported60.660