Update model card: add GitHub link, design docs, and benchmark setup guide

7e7485e verified 9 days ago

13.1 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
	tags:
	- text-to-sql
	- pipe-sql
	- sqlglot
	- tool-calling
	- qwen2
	datasets:
	- spider
	pipeline_tag: text-generation
	model-index:
	- name: pipe-sql-1.5b
	results:
	- task:
	type: text-to-sql
	name: Text-to-SQL
	dataset:
	type: spider
	name: Spider 1.0 Dev
	metrics:
	- type: execution_accuracy
	value: 60.66
	name: Execution Accuracy
	---

	# Pipe SQL 1.5B

	A fine-tuned [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) model for generating Pipe SQL through multi-turn tool-calling conversations.

	GitHub: [nittygritty-zzy/sqlglot](https://github.com/nittygritty-zzy/sqlglot)

	## What is Pipe SQL?

	Pipe SQL is a more readable SQL syntax that uses the `\|>` (pipe) operator to chain operations in a linear, top-to-bottom flow:

	```sql
	FROM employees
	\|> WHERE department = 'Engineering'
	\|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
	\|> ORDER BY avg_salary DESC
	```

	This is transpiled to standard SQL via [sqlglot](https://github.com/tobymao/sqlglot), an open-source SQL parser and transpiler.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen2.5-Coder-1.5B-Instruct \|
	\| Architecture \| Qwen2ForCausalLM \|
	\| Parameters \| 1.5B \|
	\| Hidden Size \| 1536 \|
	\| Layers \| 28 \|
	\| Attention Heads \| 12 (2 KV heads) \|
	\| Context Length \| 2048 tokens (training) \|

	## Design Documents

	The full design and methodology behind this project is documented in the following design docs (also available in [docs/design/](https://github.com/nittygritty-zzy/sqlglot/tree/main/docs/design) on GitHub):

	\| Document \| Description \|
	\|----------\|-------------\|
	\| [Fine-Tuning Design Doc](docs/pipe-sql-fine-tuning-design-doc.md) \| End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models \|
	\| [Decompiler Design Doc](docs/pipe-sql-decompiler-design-doc.md) \| Standard SQL to pipe SQL decompiler — the deterministic data generation component \|
	\| [Validation Loop Design Doc](docs/pipe-sql-validation-loop-design-doc.md) \| SQLite round-trip validation and feedback loop to ensure semantic correctness \|
	\| [Training Reproduction Guide](docs/pipe-sql-training-reproduction-guide.md) \| Step-by-step guide to reproduce the full training pipeline from scratch \|

	## Training

	The model was fine-tuned using QLoRA on multi-turn tool-calling conversations for text-to-SQL generation.

	### Training Data

	Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/spider) training set, where each conversation follows an agentic workflow:
	1. Explore the database schema using `list_tables`, `describe_table`, and `sample_data` tools
	2. Write pipe SQL queries using `execute_pipe_sql` and `validate_pipe_sql` tools
	3. Iterate based on execution results until the query is correct

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| QLoRA (4-bit NF4) \|
	\| LoRA rank \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Epochs \| 3 \|
	\| Learning rate \| 2e-4 \|
	\| LR scheduler \| Cosine \|
	\| Warmup ratio \| 0.05 \|
	\| Batch size \| 2 (per device) \|
	\| Gradient accumulation \| 8 steps \|
	\| Weight decay \| 0.01 \|
	\| Loss \| Assistant-only (tool responses masked) \|

	## Evaluation Results

	Evaluated on the Spider 1.0 dev set (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.

	### Execution Accuracy

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Execution Accuracy \| 60.66% (626 / 1,032) \|
	\| Prediction Rate \| 99.7% (1,031 / 1,034) \|
	\| Total Questions \| 1,034 \|
	\| Gold Errors Excluded \| 2 \|

	### Context: Spider 1.0 Dev Set SOTA

	\| Model \| Size \| EX (Dev) \| Method \|
	\|-------\|-----:\|----------\|--------\|
	\| MiniSeek \| — \| 91.2% \| Proprietary \|
	\| DAIL-SQL + GPT-4 + SC \| — \| 86.6% \| In-context learning \|
	\| DIN-SQL + GPT-4 \| — \| 85.3% \| In-context learning \|
	\| SFT CodeS-7B \| 7B \| 85.4% \| Fine-tuned \|
	\| SFT CodeS-3B \| 3B \| 83.3% \| Fine-tuned \|
	\| SFT CodeS-1B \| 1B \| 77.9% \| Fine-tuned \|
	\| Pipe SQL 1.5B (ours) \| 1.5B \| 60.7% \| Fine-tuned, agentic tool-calling \|

	Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: [Spider leaderboard](https://yale-lily.github.io/spider), [CodeS (Li et al., 2024)](https://arxiv.org/abs/2402.16347).

	### Detailed Breakdown

	\| Status \| Count \| % of Total \| Description \|
	\|--------\|------:\|------------\|-------------\|
	\| Match \| 626 \| 60.5% \| Predicted SQL produces identical results to gold SQL \|
	\| Mismatch \| 209 \| 20.2% \| SQL executes but results differ from gold \|
	\| Execution Error \| 170 \| 16.4% \| Transpiled SQL fails to execute against SQLite \|
	\| Transpile Error \| 24 \| 2.3% \| Pipe SQL cannot be transpiled to standard SQL \|
	\| No Prediction \| 3 \| 0.3% \| Agent did not produce a pipe SQL query \|
	\| Gold Error \| 2 \| 0.2% \| Reference gold SQL fails (excluded from denominator) \|

	### Evaluation Methodology

	1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
	2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL
	3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()`
	4. Both predicted and gold SQL are executed against the Spider SQLite databases
	5. Result sets are compared using order-insensitive set comparison with numeric tolerance

	> Note: This is an in-distribution evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.

	## Tools

	The model was trained to use 5 tools in a multi-turn conversation:

	\| Tool \| Description \|
	\|------\|-------------\|
	\| `list_tables` \| List all tables in a database \|
	\| `describe_table` \| Get column names, types, and constraints for a table \|
	\| `sample_data` \| Retrieve sample rows from a table \|
	\| `execute_pipe_sql` \| Execute a pipe SQL query against the database \|
	\| `validate_pipe_sql` \| Validate pipe SQL syntax without executing \|

	## Usage

	### Chat Template

	The model uses a custom chat template with `<tool_call>` tags for tool invocations:

	```
	<\|im_start\|>assistant
	Let me explore the database first.
	<tool_call>
	list_tables({"db_id": "concert_singer"})
	</tool_call><\|im_end\|>
	```

	Tool responses are formatted as:

	```
	<\|im_start\|>user
	<tool_response>
	Tables in database 'concert_singer':
	- stadium
	- singer
	- concert
	- singer_in_concert
	</tool_response><\|im_end\|>
	```

	### Inference

	For inference with the correct chat template, see the [evaluation server code](https://github.com/nittygritty-zzy/sqlglot/tree/main/pipe_sql/evaluation/server) on GitHub.

	## Reproducing the Benchmark

	### Prerequisites

	- GPU: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
	- Python: 3.11+ with pip/uv
	- Node.js: 18+ with npm
	- Disk: ~1 GB for Spider databases, ~3 GB for model weights

	### Step 1: Clone the Repository

	```bash
	git clone https://github.com/nittygritty-zzy/sqlglot.git
	cd sqlglot
	```

	### Step 2: Set Up Python Environment

	```bash
	# Create virtual environment
	uv venv .venv --python 3.11
	source .venv/bin/activate # Linux/macOS
	# source .venv/Scripts/activate # Windows (Git Bash)

	# Install sqlglot (editable)
	uv pip install -e .

	# Install evaluation server dependencies
	uv pip install fastapi uvicorn pydantic

	# Install PyTorch with CUDA support
	uv pip install torch --index-url https://download.pytorch.org/whl/cu126

	# Install model loading dependencies
	uv pip install transformers accelerate
	```

	Verify CUDA:
	```bash
	python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
	# Expected: True NVIDIA GeForce RTX ...
	```

	### Step 3: Download Spider 1.0 Dataset

	The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).

	```bash
	# Install gdown for Google Drive downloads
	uv pip install gdown

	# Download and extract Spider 1.0 (~1 GB)
	bash scripts/setup_data.sh
	```

	Verify:
	```bash
	ls data/spider/dev.json # 1,034 questions
	ls data/spider/database/ \| wc -l # ~166 databases (20 used by dev set)
	```

	### Step 4: Download the Model

	```bash
	# Option A: Use huggingface_hub (recommended)
	pip install huggingface_hub
	python -c "
	from huggingface_hub import snapshot_download
	snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged')
	"

	# Option B: Use git-lfs
	git lfs install
	git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged
	```

	### Step 5: Install Node.js Agent Dependencies

	```bash
	cd pipe_sql/evaluation/agent
	npm install
	cd ../../..
	```

	### Step 6: Run the Benchmark

	#### Option A: Full Pipeline (Recommended)

	```bash
	# Run all 1,034 questions (takes ~2 hours on RTX 4080)
	bash pipe_sql/evaluation/run_all.sh

	# Smoke test with 5 questions first
	bash pipe_sql/evaluation/run_all.sh --limit 5
	```

	This script:
	1. Starts the Python evaluation server (model inference + tool execution)
	2. Waits for the server to be ready
	3. Runs the TypeScript agent benchmark
	4. Evaluates results and prints execution accuracy

	#### Option B: Run Components Separately

	Start the evaluation server:
	```bash
	# Default: loads model from pipe_sql/finetuning_output/merged/
	python -m pipe_sql.evaluation.server.app

	# Custom model path:
	MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app
	```

	Wait for `Server ready` in the logs, then in a separate terminal:

	Run the agent benchmark:
	```bash
	cd pipe_sql/evaluation/agent
	npx tsx src/main.ts --benchmark # All 1,034 questions
	npx tsx src/main.ts --benchmark --limit 5 # Smoke test
	```

	Run single question interactively:
	```bash
	cd pipe_sql/evaluation/agent
	npx tsx src/main.ts "How many singers do we have?" concert_singer
	```

	Evaluate results:
	```bash
	python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json
	```

	### Step 7: Review Results

	Results are saved to `pipe_sql/output/`:

	\| File \| Description \|
	\|------\|-------------\|
	\| `results.json` \| Agent predictions with conversation traces \|
	\| `eval_results.json` \| Per-question evaluation details (match/mismatch/error) \|
	\| `eval_summary.json` \| Aggregate metrics \|

	### Configuration

	\| Environment Variable \| Default \| Description \|
	\|---------------------\|---------\|-------------\|
	\| `MODEL_PATH` \| `pipe_sql/finetuning_output/merged` \| Path to merged model directory \|
	\| `SPIDER_DB_DIR` \| `data/spider/database` \| Spider database directory \|
	\| `SPIDER_DIR` \| `data/spider` \| Spider data directory (contains dev.json) \|
	\| `PORT` \| `8000` \| Evaluation server port \|
	\| `SERVER_URL` \| `http://localhost:8000` \| Agent to server connection URL \|
	\| `OUTPUT_DIR` \| `pipe_sql/output` \| Agent output directory \|

	### Troubleshooting

	Server fails to load model: Ensure `pipe_sql/finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`.

	CUDA out of memory: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU.

	Agent produces garbled tool calls: The 1.5B model sometimes generates garbled special tokens instead of proper `<tool_call>` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.

	Spider databases not found: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`.

	## Limitations

	- Trained and evaluated only on Spider 1.0 (SQLite databases)
	- Context window limited to 2,048 tokens during training
	- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
	- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
	- This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower

	## License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Qwen2.5-Coder model license.