Zheyuan Zhao commited on
Add model card with training details and benchmark results
Browse files
README.md
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
|
| 6 |
+
tags:
|
| 7 |
+
- text-to-sql
|
| 8 |
+
- pipe-sql
|
| 9 |
+
- sqlglot
|
| 10 |
+
- tool-calling
|
| 11 |
+
- qwen2
|
| 12 |
+
datasets:
|
| 13 |
+
- spider
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
model-index:
|
| 16 |
+
- name: pipe-sql-1.5b
|
| 17 |
+
results:
|
| 18 |
+
- task:
|
| 19 |
+
type: text-to-sql
|
| 20 |
+
name: Text-to-SQL
|
| 21 |
+
dataset:
|
| 22 |
+
type: spider
|
| 23 |
+
name: Spider 1.0 Dev
|
| 24 |
+
metrics:
|
| 25 |
+
- type: execution_accuracy
|
| 26 |
+
value: 60.66
|
| 27 |
+
name: Execution Accuracy
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
# Pipe SQL 1.5B
|
| 31 |
+
|
| 32 |
+
A fine-tuned [Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) model for generating **Pipe SQL** through multi-turn tool-calling conversations.
|
| 33 |
+
|
| 34 |
+
## What is Pipe SQL?
|
| 35 |
+
|
| 36 |
+
Pipe SQL is a more readable SQL syntax that uses the `|>` (pipe) operator to chain operations in a linear, top-to-bottom flow:
|
| 37 |
+
|
| 38 |
+
```sql
|
| 39 |
+
FROM employees
|
| 40 |
+
|> WHERE department = 'Engineering'
|
| 41 |
+
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY level
|
| 42 |
+
|> ORDER BY avg_salary DESC
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
This is transpiled to standard SQL via [sqlglot](https://github.com/tobymao/sqlglot), an open-source SQL parser and transpiler.
|
| 46 |
+
|
| 47 |
+
## Model Details
|
| 48 |
+
|
| 49 |
+
| Property | Value |
|
| 50 |
+
|----------|-------|
|
| 51 |
+
| **Base Model** | Qwen2.5-Coder-1.5B-Instruct |
|
| 52 |
+
| **Architecture** | Qwen2ForCausalLM |
|
| 53 |
+
| **Parameters** | 1.5B |
|
| 54 |
+
| **Hidden Size** | 1536 |
|
| 55 |
+
| **Layers** | 28 |
|
| 56 |
+
| **Attention Heads** | 12 (2 KV heads) |
|
| 57 |
+
| **Context Length** | 2048 tokens (training) |
|
| 58 |
+
|
| 59 |
+
## Training
|
| 60 |
+
|
| 61 |
+
The model was fine-tuned using **QLoRA** on multi-turn tool-calling conversations for text-to-SQL generation.
|
| 62 |
+
|
| 63 |
+
### Training Data
|
| 64 |
+
|
| 65 |
+
Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/spider) training set, where each conversation follows an agentic workflow:
|
| 66 |
+
1. **Explore** the database schema using `list_tables`, `describe_table`, and `sample_data` tools
|
| 67 |
+
2. **Write** pipe SQL queries using `execute_pipe_sql` and `validate_pipe_sql` tools
|
| 68 |
+
3. **Iterate** based on execution results until the query is correct
|
| 69 |
+
|
| 70 |
+
### Hyperparameters
|
| 71 |
+
|
| 72 |
+
| Parameter | Value |
|
| 73 |
+
|-----------|-------|
|
| 74 |
+
| **Method** | QLoRA (4-bit NF4) |
|
| 75 |
+
| **LoRA rank** | 16 |
|
| 76 |
+
| **LoRA alpha** | 32 |
|
| 77 |
+
| **LoRA dropout** | 0.05 |
|
| 78 |
+
| **Target modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
| 79 |
+
| **Epochs** | 3 |
|
| 80 |
+
| **Learning rate** | 2e-4 |
|
| 81 |
+
| **LR scheduler** | Cosine |
|
| 82 |
+
| **Warmup ratio** | 0.05 |
|
| 83 |
+
| **Batch size** | 2 (per device) |
|
| 84 |
+
| **Gradient accumulation** | 8 steps |
|
| 85 |
+
| **Weight decay** | 0.01 |
|
| 86 |
+
| **Loss** | Assistant-only (tool responses masked) |
|
| 87 |
+
|
| 88 |
+
## Evaluation Results
|
| 89 |
+
|
| 90 |
+
Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline with execution accuracy.
|
| 91 |
+
|
| 92 |
+
| Metric | Value |
|
| 93 |
+
|--------|-------|
|
| 94 |
+
| **Execution Accuracy** | **60.66%** (626 / 1,032) |
|
| 95 |
+
| **Prediction Rate** | 99.7% (1,031 / 1,034) |
|
| 96 |
+
|
| 97 |
+
### Status Breakdown
|
| 98 |
+
|
| 99 |
+
| Status | Count | Percentage |
|
| 100 |
+
|--------|-------|------------|
|
| 101 |
+
| Match | 626 | 60.5% |
|
| 102 |
+
| Mismatch | 209 | 20.2% |
|
| 103 |
+
| Execution Error | 170 | 16.4% |
|
| 104 |
+
| Transpile Error | 24 | 2.3% |
|
| 105 |
+
| No Prediction | 3 | 0.3% |
|
| 106 |
+
| Gold Error (excluded) | 2 | 0.2% |
|
| 107 |
+
|
| 108 |
+
> **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases. Gold errors (2 questions where the reference SQL fails) are excluded from the accuracy denominator.
|
| 109 |
+
|
| 110 |
+
## Tools
|
| 111 |
+
|
| 112 |
+
The model was trained to use 5 tools in a multi-turn conversation:
|
| 113 |
+
|
| 114 |
+
| Tool | Description |
|
| 115 |
+
|------|-------------|
|
| 116 |
+
| `list_tables` | List all tables in a database |
|
| 117 |
+
| `describe_table` | Get column names, types, and constraints for a table |
|
| 118 |
+
| `sample_data` | Retrieve sample rows from a table |
|
| 119 |
+
| `execute_pipe_sql` | Execute a pipe SQL query against the database |
|
| 120 |
+
| `validate_pipe_sql` | Validate pipe SQL syntax without executing |
|
| 121 |
+
|
| 122 |
+
## Usage
|
| 123 |
+
|
| 124 |
+
### Chat Template
|
| 125 |
+
|
| 126 |
+
The model uses a custom chat template with `<tool_call>` tags for tool invocations:
|
| 127 |
+
|
| 128 |
+
```
|
| 129 |
+
<|im_start|>assistant
|
| 130 |
+
Let me explore the database first.
|
| 131 |
+
<tool_call>
|
| 132 |
+
list_tables({"db_id": "concert_singer"})
|
| 133 |
+
</tool_call><|im_end|>
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
Tool responses are formatted as:
|
| 137 |
+
|
| 138 |
+
```
|
| 139 |
+
<|im_start|>user
|
| 140 |
+
<tool_response>
|
| 141 |
+
Tables in database 'concert_singer':
|
| 142 |
+
- stadium
|
| 143 |
+
- singer
|
| 144 |
+
- concert
|
| 145 |
+
- singer_in_concert
|
| 146 |
+
</tool_response><|im_end|>
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Inference
|
| 150 |
+
|
| 151 |
+
For inference with the correct chat template, see the evaluation server code in the [sqlglot repository](https://github.com/nittygritty-zzy/sqlglot/tree/main/evaluation/server).
|
| 152 |
+
|
| 153 |
+
## Limitations
|
| 154 |
+
|
| 155 |
+
- Trained and evaluated only on Spider 1.0 (SQLite databases)
|
| 156 |
+
- Context window limited to 2,048 tokens during training
|
| 157 |
+
- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
|
| 158 |
+
- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
|
| 159 |
+
|
| 160 |
+
## License
|
| 161 |
+
|
| 162 |
+
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Qwen2.5-Coder model license.
|