Pipe SQL 1.5B

A fine-tuned Qwen2.5-Coder-1.5B-Instruct model for generating Pipe SQL through multi-turn tool-calling conversations.

GitHub: nittygritty-zzy/sqlglot

What is Pipe SQL?

Pipe SQL is a more readable SQL syntax that uses the |> (pipe) operator to chain operations in a linear, top-to-bottom flow:

FROM employees
|> WHERE department = 'Engineering'
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
|> ORDER BY avg_salary DESC

This is transpiled to standard SQL via sqlglot, an open-source SQL parser and transpiler.

Model Details

Property Value
Base Model Qwen2.5-Coder-1.5B-Instruct
Architecture Qwen2ForCausalLM
Parameters 1.5B
Hidden Size 1536
Layers 28
Attention Heads 12 (2 KV heads)
Context Length 2048 tokens (training)

Design Documents

The full design and methodology behind this project is documented in the following design docs (also available in docs/design/ on GitHub):

Document Description
Fine-Tuning Design Doc End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models
Decompiler Design Doc Standard SQL to pipe SQL decompiler — the deterministic data generation component
Validation Loop Design Doc SQLite round-trip validation and feedback loop to ensure semantic correctness
Training Reproduction Guide Step-by-step guide to reproduce the full training pipeline from scratch

Training

The model was fine-tuned using QLoRA on multi-turn tool-calling conversations for text-to-SQL generation.

Training Data

Conversations were generated from the Spider 1.0 training set, where each conversation follows an agentic workflow:

  1. Explore the database schema using list_tables, describe_table, and sample_data tools
  2. Write pipe SQL queries using execute_pipe_sql and validate_pipe_sql tools
  3. Iterate based on execution results until the query is correct

Hyperparameters

Parameter Value
Method QLoRA (4-bit NF4)
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs 3
Learning rate 2e-4
LR scheduler Cosine
Warmup ratio 0.05
Batch size 2 (per device)
Gradient accumulation 8 steps
Weight decay 0.01
Loss Assistant-only (tool responses masked)

Evaluation Results

Evaluated on the Spider 1.0 dev set (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.

Execution Accuracy

Metric Value
Execution Accuracy 60.66% (626 / 1,032)
Prediction Rate 99.7% (1,031 / 1,034)
Total Questions 1,034
Gold Errors Excluded 2

Context: Spider 1.0 Dev Set SOTA

Model Size EX (Dev) Method
MiniSeek — 91.2% Proprietary
DAIL-SQL + GPT-4 + SC — 86.6% In-context learning
DIN-SQL + GPT-4 — 85.3% In-context learning
SFT CodeS-7B 7B 85.4% Fine-tuned
SFT CodeS-3B 3B 83.3% Fine-tuned
SFT CodeS-1B 1B 77.9% Fine-tuned
Pipe SQL 1.5B (ours) 1.5B 60.7% Fine-tuned, agentic tool-calling

Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: Spider leaderboard, CodeS (Li et al., 2024).

Detailed Breakdown

Status Count % of Total Description
Match 626 60.5% Predicted SQL produces identical results to gold SQL
Mismatch 209 20.2% SQL executes but results differ from gold
Execution Error 170 16.4% Transpiled SQL fails to execute against SQLite
Transpile Error 24 2.3% Pipe SQL cannot be transpiled to standard SQL
No Prediction 3 0.3% Agent did not produce a pipe SQL query
Gold Error 2 0.2% Reference gold SQL fails (excluded from denominator)

Evaluation Methodology

  1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
  2. The agent's final execute_pipe_sql call is extracted as the predicted pipe SQL
  3. Predicted pipe SQL is transpiled to standard SQL using sqlglot.transpile()
  4. Both predicted and gold SQL are executed against the Spider SQLite databases
  5. Result sets are compared using order-insensitive set comparison with numeric tolerance

Note: This is an in-distribution evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.

Tools

The model was trained to use 5 tools in a multi-turn conversation:

Tool Description
list_tables List all tables in a database
describe_table Get column names, types, and constraints for a table
sample_data Retrieve sample rows from a table
execute_pipe_sql Execute a pipe SQL query against the database
validate_pipe_sql Validate pipe SQL syntax without executing

Usage

Chat Template

The model uses a custom chat template with <tool_call> tags for tool invocations:

<|im_start|>assistant
Let me explore the database first.
<tool_call>
list_tables({"db_id": "concert_singer"})
</tool_call><|im_end|>

Tool responses are formatted as:

<|im_start|>user
<tool_response>
Tables in database 'concert_singer':
- stadium
- singer
- concert
- singer_in_concert
</tool_response><|im_end|>

Inference

For inference with the correct chat template, see the evaluation server code on GitHub.

Reproducing the Benchmark

Prerequisites

  • GPU: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
  • Python: 3.11+ with pip/uv
  • Node.js: 18+ with npm
  • Disk: ~1 GB for Spider databases, ~3 GB for model weights

Step 1: Clone the Repository

git clone https://github.com/nittygritty-zzy/sqlglot.git
cd sqlglot

Step 2: Set Up Python Environment

# Create virtual environment
uv venv .venv --python 3.11
source .venv/bin/activate       # Linux/macOS
# source .venv/Scripts/activate # Windows (Git Bash)

# Install sqlglot (editable)
uv pip install -e .

# Install evaluation server dependencies
uv pip install fastapi uvicorn pydantic

# Install PyTorch with CUDA support
uv pip install torch --index-url https://download.pytorch.org/whl/cu126

# Install model loading dependencies
uv pip install transformers accelerate

Verify CUDA:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# Expected: True NVIDIA GeForce RTX ...

Step 3: Download Spider 1.0 Dataset

The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).

# Install gdown for Google Drive downloads
uv pip install gdown

# Download and extract Spider 1.0 (~1 GB)
bash scripts/setup_data.sh

Verify:

ls data/spider/dev.json           # 1,034 questions
ls data/spider/database/ | wc -l  # ~166 databases (20 used by dev set)

Step 4: Download the Model

# Option A: Use huggingface_hub (recommended)
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')
"

# Option B: Use git-lfs
git lfs install
git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged

Step 5: Install Node.js Agent Dependencies

cd pipe_sql/evaluation/agent
npm install
cd ../../..

Step 6: Run the Benchmark

Option A: Full Pipeline (Recommended)

# Run all 1,034 questions (takes ~2 hours on RTX 4080)
bash pipe_sql/evaluation/run_all.sh

# Smoke test with 5 questions first
bash pipe_sql/evaluation/run_all.sh --limit 5

This script:

  1. Starts the Python evaluation server (model inference + tool execution)
  2. Waits for the server to be ready
  3. Runs the TypeScript agent benchmark
  4. Evaluates results and prints execution accuracy

Option B: Run Components Separately

Start the evaluation server:

# Default: loads model from pipe_sql/finetuning_output/merged/
python -m pipe_sql.evaluation.server.app

# Custom model path:
MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app

Wait for Server ready in the logs, then in a separate terminal:

Run the agent benchmark:

cd pipe_sql/evaluation/agent
npx tsx src/main.ts --benchmark           # All 1,034 questions
npx tsx src/main.ts --benchmark --limit 5 # Smoke test

Run single question interactively:

cd pipe_sql/evaluation/agent
npx tsx src/main.ts "How many singers do we have?" concert_singer

Evaluate results:

python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json

Step 7: Review Results

Results are saved to pipe_sql/output/:

File Description
results.json Agent predictions with conversation traces
eval_results.json Per-question evaluation details (match/mismatch/error)
eval_summary.json Aggregate metrics

Configuration

Environment Variable Default Description
MODEL_PATH pipe_sql/finetuning_output/merged Path to merged model directory
SPIDER_DB_DIR data/spider/database Spider database directory
SPIDER_DIR data/spider Spider data directory (contains dev.json)
PORT 8000 Evaluation server port
SERVER_URL http://localhost:8000 Agent to server connection URL
OUTPUT_DIR pipe_sql/output Agent output directory

Troubleshooting

Server fails to load model: Ensure pipe_sql/finetuning_output/merged/ contains config.json, model.safetensors, and tokenizer.json. If using a different path, set MODEL_PATH.

CUDA out of memory: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use CUDA_VISIBLE_DEVICES=0 to select a specific GPU.

Agent produces garbled tool calls: The 1.5B model sometimes generates garbled special tokens instead of proper <tool_call> tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.

Spider databases not found: Run bash scripts/setup_data.sh to download Spider 1.0. The script downloads from Google Drive via gdown.

Limitations

  • Trained and evaluated only on Spider 1.0 (SQLite databases)
  • Context window limited to 2,048 tokens during training
  • The 1.5B model may generate garbled special tokens instead of proper <tool_call> tags — the inference server includes fallback parsing for bare function calls
  • Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
  • This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower

License

This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-Coder model license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nittygritty-zzy/pipe-sql-1.5b

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(135)
this model

Dataset used to train nittygritty-zzy/pipe-sql-1.5b

Paper for nittygritty-zzy/pipe-sql-1.5b

Evaluation results