--- language: - en license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct tags: - text-to-sql - pipe-sql - sqlglot - tool-calling - qwen2 datasets: - spider pipeline_tag: text-generation model-index: - name: pipe-sql-1.5b results: - task: type: text-to-sql name: Text-to-SQL dataset: type: spider name: Spider 1.0 Dev metrics: - type: execution_accuracy value: 60.66 name: Execution Accuracy --- # Pipe SQL 1.5B A fine-tuned [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) model for generating **Pipe SQL** through multi-turn tool-calling conversations. **GitHub**: [nittygritty-zzy/sqlglot](https://github.com/nittygritty-zzy/sqlglot) ## What is Pipe SQL? Pipe SQL is a more readable SQL syntax that uses the `|>` (pipe) operator to chain operations in a linear, top-to-bottom flow: ```sql FROM employees |> WHERE department = 'Engineering' |> AGGREGATE AVG(salary) AS avg_salary GROUP BY level |> ORDER BY avg_salary DESC ``` This is transpiled to standard SQL via [sqlglot](https://github.com/tobymao/sqlglot), an open-source SQL parser and transpiler. ## Model Details | Property | Value | |----------|-------| | **Base Model** | Qwen2.5-Coder-1.5B-Instruct | | **Architecture** | Qwen2ForCausalLM | | **Parameters** | 1.5B | | **Hidden Size** | 1536 | | **Layers** | 28 | | **Attention Heads** | 12 (2 KV heads) | | **Context Length** | 2048 tokens (training) | ## Design Documents The full design and methodology behind this project is documented in the following design docs (also available in [docs/design/](https://github.com/nittygritty-zzy/sqlglot/tree/main/docs/design) on GitHub): | Document | Description | |----------|-------------| | [Fine-Tuning Design Doc](docs/pipe-sql-fine-tuning-design-doc.md) | End-to-end system design for incremental pipe SQL synthesis and specialized fine-tuning of 1.5B-7B models | | [Decompiler Design Doc](docs/pipe-sql-decompiler-design-doc.md) | Standard SQL to pipe SQL decompiler — the deterministic data generation component | | [Validation Loop Design Doc](docs/pipe-sql-validation-loop-design-doc.md) | SQLite round-trip validation and feedback loop to ensure semantic correctness | | [Training Reproduction Guide](docs/pipe-sql-training-reproduction-guide.md) | Step-by-step guide to reproduce the full training pipeline from scratch | ## Training The model was fine-tuned using **QLoRA** on multi-turn tool-calling conversations for text-to-SQL generation. ### Training Data Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/spider) training set, where each conversation follows an agentic workflow: 1. **Explore** the database schema using `list_tables`, `describe_table`, and `sample_data` tools 2. **Write** pipe SQL queries using `execute_pipe_sql` and `validate_pipe_sql` tools 3. **Iterate** based on execution results until the query is correct ### Hyperparameters | Parameter | Value | |-----------|-------| | **Method** | QLoRA (4-bit NF4) | | **LoRA rank** | 16 | | **LoRA alpha** | 32 | | **LoRA dropout** | 0.05 | | **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | **Epochs** | 3 | | **Learning rate** | 2e-4 | | **LR scheduler** | Cosine | | **Warmup ratio** | 0.05 | | **Batch size** | 2 (per device) | | **Gradient accumulation** | 8 steps | | **Weight decay** | 0.01 | | **Loss** | Assistant-only (tool responses masked) | ## Evaluation Results Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow. ### Execution Accuracy | Metric | Value | |--------|-------| | **Execution Accuracy** | **60.66%** (626 / 1,032) | | **Prediction Rate** | 99.7% (1,031 / 1,034) | | **Total Questions** | 1,034 | | **Gold Errors Excluded** | 2 | ### Context: Spider 1.0 Dev Set SOTA | Model | Size | EX (Dev) | Method | |-------|-----:|----------|--------| | MiniSeek | — | 91.2% | Proprietary | | DAIL-SQL + GPT-4 + SC | — | 86.6% | In-context learning | | DIN-SQL + GPT-4 | — | 85.3% | In-context learning | | SFT CodeS-7B | 7B | 85.4% | Fine-tuned | | SFT CodeS-3B | 3B | 83.3% | Fine-tuned | | SFT CodeS-1B | 1B | 77.9% | Fine-tuned | | **Pipe SQL 1.5B (ours)** | **1.5B** | **60.7%** | **Fine-tuned, agentic tool-calling** | Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: [Spider leaderboard](https://yale-lily.github.io/spider), [CodeS (Li et al., 2024)](https://arxiv.org/abs/2402.16347). ### Detailed Breakdown | Status | Count | % of Total | Description | |--------|------:|------------|-------------| | **Match** | 626 | 60.5% | Predicted SQL produces identical results to gold SQL | | **Mismatch** | 209 | 20.2% | SQL executes but results differ from gold | | **Execution Error** | 170 | 16.4% | Transpiled SQL fails to execute against SQLite | | **Transpile Error** | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL | | **No Prediction** | 3 | 0.3% | Agent did not produce a pipe SQL query | | **Gold Error** | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) | ### Evaluation Methodology 1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout) 2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL 3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()` 4. Both predicted and gold SQL are executed against the Spider SQLite databases 5. Result sets are compared using order-insensitive set comparison with numeric tolerance > **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases. ## Tools The model was trained to use 5 tools in a multi-turn conversation: | Tool | Description | |------|-------------| | `list_tables` | List all tables in a database | | `describe_table` | Get column names, types, and constraints for a table | | `sample_data` | Retrieve sample rows from a table | | `execute_pipe_sql` | Execute a pipe SQL query against the database | | `validate_pipe_sql` | Validate pipe SQL syntax without executing | ## Usage ### Chat Template The model uses a custom chat template with `` tags for tool invocations: ``` <|im_start|>assistant Let me explore the database first. list_tables({"db_id": "concert_singer"}) <|im_end|> ``` Tool responses are formatted as: ``` <|im_start|>user Tables in database 'concert_singer': - stadium - singer - concert - singer_in_concert <|im_end|> ``` ### Inference For inference with the correct chat template, see the [evaluation server code](https://github.com/nittygritty-zzy/sqlglot/tree/main/pipe_sql/evaluation/server) on GitHub. ## Reproducing the Benchmark ### Prerequisites - **GPU**: NVIDIA GPU with >= 6 GB VRAM (model runs in float16) - **Python**: 3.11+ with pip/uv - **Node.js**: 18+ with npm - **Disk**: ~1 GB for Spider databases, ~3 GB for model weights ### Step 1: Clone the Repository ```bash git clone https://github.com/nittygritty-zzy/sqlglot.git cd sqlglot ``` ### Step 2: Set Up Python Environment ```bash # Create virtual environment uv venv .venv --python 3.11 source .venv/bin/activate # Linux/macOS # source .venv/Scripts/activate # Windows (Git Bash) # Install sqlglot (editable) uv pip install -e . # Install evaluation server dependencies uv pip install fastapi uvicorn pydantic # Install PyTorch with CUDA support uv pip install torch --index-url https://download.pytorch.org/whl/cu126 # Install model loading dependencies uv pip install transformers accelerate ``` Verify CUDA: ```bash python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))" # Expected: True NVIDIA GeForce RTX ... ``` ### Step 3: Download Spider 1.0 Dataset The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases). ```bash # Install gdown for Google Drive downloads uv pip install gdown # Download and extract Spider 1.0 (~1 GB) bash scripts/setup_data.sh ``` Verify: ```bash ls data/spider/dev.json # 1,034 questions ls data/spider/database/ | wc -l # ~166 databases (20 used by dev set) ``` ### Step 4: Download the Model ```bash # Option A: Use huggingface_hub (recommended) pip install huggingface_hub python -c " from huggingface_hub import snapshot_download snapshot_download('nittygritty-zzy/pipe-sql-1.5b', local_dir='pipe_sql/finetuning_output/merged') " # Option B: Use git-lfs git lfs install git clone https://huggingface.co/nittygritty-zzy/pipe-sql-1.5b pipe_sql/finetuning_output/merged ``` ### Step 5: Install Node.js Agent Dependencies ```bash cd pipe_sql/evaluation/agent npm install cd ../../.. ``` ### Step 6: Run the Benchmark #### Option A: Full Pipeline (Recommended) ```bash # Run all 1,034 questions (takes ~2 hours on RTX 4080) bash pipe_sql/evaluation/run_all.sh # Smoke test with 5 questions first bash pipe_sql/evaluation/run_all.sh --limit 5 ``` This script: 1. Starts the Python evaluation server (model inference + tool execution) 2. Waits for the server to be ready 3. Runs the TypeScript agent benchmark 4. Evaluates results and prints execution accuracy #### Option B: Run Components Separately **Start the evaluation server:** ```bash # Default: loads model from pipe_sql/finetuning_output/merged/ python -m pipe_sql.evaluation.server.app # Custom model path: MODEL_PATH=path/to/model python -m pipe_sql.evaluation.server.app ``` Wait for `Server ready` in the logs, then in a separate terminal: **Run the agent benchmark:** ```bash cd pipe_sql/evaluation/agent npx tsx src/main.ts --benchmark # All 1,034 questions npx tsx src/main.ts --benchmark --limit 5 # Smoke test ``` **Run single question interactively:** ```bash cd pipe_sql/evaluation/agent npx tsx src/main.ts "How many singers do we have?" concert_singer ``` **Evaluate results:** ```bash python pipe_sql/evaluation/evaluate.py --results pipe_sql/output/results.json ``` ### Step 7: Review Results Results are saved to `pipe_sql/output/`: | File | Description | |------|-------------| | `results.json` | Agent predictions with conversation traces | | `eval_results.json` | Per-question evaluation details (match/mismatch/error) | | `eval_summary.json` | Aggregate metrics | ### Configuration | Environment Variable | Default | Description | |---------------------|---------|-------------| | `MODEL_PATH` | `pipe_sql/finetuning_output/merged` | Path to merged model directory | | `SPIDER_DB_DIR` | `data/spider/database` | Spider database directory | | `SPIDER_DIR` | `data/spider` | Spider data directory (contains dev.json) | | `PORT` | `8000` | Evaluation server port | | `SERVER_URL` | `http://localhost:8000` | Agent to server connection URL | | `OUTPUT_DIR` | `pipe_sql/output` | Agent output directory | ### Troubleshooting **Server fails to load model**: Ensure `pipe_sql/finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`. **CUDA out of memory**: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU. **Agent produces garbled tool calls**: The 1.5B model sometimes generates garbled special tokens instead of proper `` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically. **Spider databases not found**: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`. ## Limitations - Trained and evaluated only on Spider 1.0 (SQLite databases) - Context window limited to 2,048 tokens during training - The 1.5B model may generate garbled special tokens instead of proper `` tags — the inference server includes fallback parsing for bare function calls - Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested - This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Qwen2.5-Coder model license.