Pipe SQL 1.5B

A fine-tuned Qwen2.5-Coder-1.5B-Instruct model for generating Pipe SQL through multi-turn tool-calling conversations.

GitHub: nittygritty-zzy/sqlglot

What is Pipe SQL?

Pipe SQL is a more readable SQL syntax that uses the |> (pipe) operator to chain operations in a linear, top-to-bottom flow:

FROM employees
|> WHERE department = 'Engineering'
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
|> ORDER BY avg_salary DESC

This is transpiled to standard SQL via sqlglot, an open-source SQL parser and transpiler.

Model Details

Property	Value
Base Model	Qwen2.5-Coder-1.5B-Instruct
Architecture	Qwen2ForCausalLM
Parameters	1.5B
Hidden Size	1536
Layers	28
Attention Heads	12 (2 KV heads)
Context Length	2048 tokens (training)

Design Documents

The full design and methodology behind this project is documented in the following design docs (also available in docs/design/ on GitHub):

Document	Description
Fine-Tuning Design Doc	End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models
Decompiler Design Doc	Standard SQL to pipe SQL decompiler — the deterministic data generation component
Validation Loop Design Doc	SQLite round-trip validation and feedback loop to ensure semantic correctness
Training Reproduction Guide	Step-by-step guide to reproduce the full training pipeline from scratch

Training

The model was fine-tuned using QLoRA on multi-turn tool-calling conversations for text-to-SQL generation.

Training Data

Conversations were generated from the Spider 1.0 training set, where each conversation follows an agentic workflow:

Explore the database schema using list_tables, describe_table, and sample_data tools
Write pipe SQL queries using execute_pipe_sql and validate_pipe_sql tools
Iterate based on execution results until the query is correct

Hyperparameters

Parameter	Value
Method	QLoRA (4-bit NF4)
LoRA rank	16
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs	3
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	0.05
Batch size	2 (per device)
Gradient accumulation	8 steps
Weight decay	0.01
Loss	Assistant-only (tool responses masked)

Evaluation Results

Evaluated on the Spider 1.0 dev set (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.

Execution Accuracy

Metric	Value
Execution Accuracy	60.66% (626 / 1,032)
Prediction Rate	99.7% (1,031 / 1,034)
Total Questions	1,034
Gold Errors Excluded	2

Context: Spider 1.0 Dev Set SOTA

Model	Size	EX (Dev)	Method
MiniSeek	—	91.2%	Proprietary
DAIL-SQL + GPT-4 + SC	—	86.6%	In-context learning
DIN-SQL + GPT-4	—	85.3%	In-context learning
SFT CodeS-7B	7B	85.4%	Fine-tuned
SFT CodeS-3B	3B	83.3%	Fine-tuned
SFT CodeS-1B	1B	77.9%	Fine-tuned
Pipe SQL 1.5B (ours)	1.5B	60.7%	Fine-tuned, agentic tool-calling

Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: Spider leaderboard, CodeS (Li et al., 2024).

Detailed Breakdown

Status	Count	% of Total	Description
Match	626	60.5%	Predicted SQL produces identical results to gold SQL
Mismatch	209	20.2%	SQL executes but results differ from gold
Execution Error	170	16.4%	Transpiled SQL fails to execute against SQLite
Transpile Error	24	2.3%	Pipe SQL cannot be transpiled to standard SQL
No Prediction	3	0.3%	Agent did not produce a pipe SQL query
Gold Error	2	0.2%	Reference gold SQL fails (excluded from denominator)

Evaluation Methodology

The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
The agent's final execute_pipe_sql call is extracted as the predicted pipe SQL
Predicted pipe SQL is transpiled to standard SQL using sqlglot.transpile()
Both predicted and gold SQL are executed against the Spider SQLite databases
Result sets are compared using order-insensitive set comparison with numeric tolerance

Note: This is an in-distribution evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.

Tools

The model was trained to use 5 tools in a multi-turn conversation:

Tool	Description
`list_tables`	List all tables in a database
`describe_table`	Get column names, types, and constraints for a table
`sample_data`	Retrieve sample rows from a table
`execute_pipe_sql`	Execute a pipe SQL query against the database
`validate_pipe_sql`	Validate pipe SQL syntax without executing

Usage

Chat Template

The model uses a custom chat template with <tool_call> tags for tool invocations:

<|im_start|>assistant
Let me explore the database first.
<tool_call>
list_tables({"db_id": "concert_singer"})
</tool_call><|im_end|>

Tool responses are formatted as:

<|im_start|>user
<tool_response>
Tables in database 'concert_singer':
- stadium
- singer
- concert
- singer_in_concert
</tool_response><|im_end|>

Inference

For inference with the correct chat template, see the evaluation server code on GitHub.

Reproducing the Benchmark

Prerequisites

GPU: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
Python: 3.11+ with pip/uv
Node.js: 18+ with npm
Disk: ~1 GB for Spider databases, ~3 GB for model weights

Step 1: Clone the Repository

git clone https://github.com/nittygritty-zzy/sqlglot.git
cd sqlglot

Step 2: Set Up Python Environment

# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate       # Linux/macOS
# source .venv/Scripts/activate # Windows (Git Bash)

# Install sqlglot (editable)
uv pip install -e .

# Install evaluation server dependencies
uv pip install fastapi uvicorn pydantic

# Install PyTorch with CUDA support
uv pip install torch --index-url https://download.pytorch.org/whl/cu126

# Install model loading dependencies
uv pip install transformers accelerate

Verify CUDA:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: True NVIDIA GeForce RTX ...

Step 3: Download Spider 1.0 Dataset

The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).

# Install gdown for Google Drive downloads
uv pip install gdown

# Download and extract Spider 1.0 (~1 GB)
bash scripts/setup_data.sh

Verify:

ls data/spider/dev.json           # 1,034 questions
ls data/spider/database/ | wc -l  # ~166 databases (20 used by dev set)

Step 4: Download the Model

# Option A: Use huggingface_hub (recommended)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')
"

# Option B: Use git-lfs
git lfs install
git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged

Step 5: Install Node.js Agent Dependencies

cd pipe_sql/evaluation/agent
npm install
cd ../../..

Step 6: Run the Benchmark

Option A: Full Pipeline (Recommended)

# Run all 1,034 questions (takes ~2 hours on RTX 4080)
bash pipe_sql/evaluation/run_all.sh

# Smoke test with 5 questions first
bash pipe_sql/evaluation/run_all.sh --limit 5

This script:

Starts the Python evaluation server (model inference + tool execution)
Waits for the server to be ready
Runs the TypeScript agent benchmark
Evaluates results and prints execution accuracy

Option B: Run Components Separately

Start the evaluation server:

# Default: loads model from pipe_sql/finetuning_output/merged/
python -m pipe_sql.evaluation.server.app

# Custom model path:
MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app

Wait for Server ready in the logs, then in a separate terminal:

Run the agent benchmark:

cd pipe_sql/evaluation/agent
npx tsx src/main.ts --benchmark           # All 1,034 questions
npx tsx src/main.ts --benchmark --limit 5 # Smoke test

Run single question interactively:

cd pipe_sql/evaluation/agent
npx tsx src/main.ts "How many singers do we have?" concert_singer

Evaluate results:

python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json

Step 7: Review Results

Results are saved to pipe_sql/output/:

File	Description
`results.json`	Agent predictions with conversation traces
`eval_results.json`	Per-question evaluation details (match/mismatch/error)
`eval_summary.json`	Aggregate metrics

Configuration

Environment Variable	Default	Description
`MODEL_PATH`	`pipe_sql/finetuning_output/merged`	Path to merged model directory
`SPIDER_DB_DIR`	`data/spider/database`	Spider database directory
`SPIDER_DIR`	`data/spider`	Spider data directory (contains dev.json)
`PORT`	`8000`	Evaluation server port
`SERVER_URL`	`http://localhost:8000`	Agent to server connection URL
`OUTPUT_DIR`	`pipe_sql/output`	Agent output directory

Troubleshooting

Server fails to load model: Ensure pipe_sql/finetuning_output/merged/ contains config.json, model.safetensors, and tokenizer.json. If using a different path, set MODEL_PATH.

CUDA out of memory: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use CUDA_VISIBLE_DEVICES=0 to select a specific GPU.

Agent produces garbled tool calls: The 1.5B model sometimes generates garbled special tokens instead of proper <tool_call> tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.

Spider databases not found: Run bash scripts/setup_data.sh to download Spider 1.0. The script downloads from Google Drive via gdown.

Limitations

Trained and evaluated only on Spider 1.0 (SQLite databases)
Context window limited to 2,048 tokens during training
The 1.5B model may generate garbled special tokens instead of proper <tool_call> tags — the inference server includes fallback parsing for bare function calls
Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower