File size: 6,799 Bytes
8285334 0f64c6d 8285334 0f64c6d a2133f4 0f64c6d 8285334 f7045ff 8285334 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ---
license: apache-2.0
base_model: Qwen/Qwen3.5-27B
tags:
- text-to-sql
- sql
- qwen3.5
- fine-tuned
- fsdp
- nebius
datasets:
- b-mc2/sql-create-context
- gretelai/synthetic_text_to_sql
language:
- en
pipeline_tag: text-generation
---
# Qwen3.5-27B-Text2SQL
Fine-tuned [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) for **Text-to-SQL** generation. Given a database schema and a natural language question, the model outputs a clean SQL query.
## Key Results
| Metric | Base Model | This Model | Improvement |
|---|---|---|---|
| **Gretel Execution Accuracy (3,492 samples)** | 26.5% | **66.7%** | **+40.2%** |
| **Gretel Valid SQL** | 37.8% | **89.4%** | +51.6% |
| **Spider Execution Accuracy (1,032, gold std)** | 46.6% | **55.1%** | +8.5% |
| **Spider Exact Match** | 0.0% | **22.2%** | +22.2% |
| **Spider Keyword Score** | 45.5% | **85.4%** | +39.9% |
### Regression (standard benchmarks via lm-eval-harness)
| Benchmark | Base | This Model | Delta |
|---|---|---|---|
| **MMLU (humanities)** | 81.9% | 83.5% | +1.6% (no regression) |
| **MMLU (STEM)** | 87.2% | 86.7% | -0.5% (no regression) |
| **MMLU (social sciences)** | 92.0% | 92.0% | 0% (no regression) |
| **MMLU (other)** | 87.5% | 87.5% | 0% (no regression) |
| **GSM8K (math, strict)** | 60.4% | 35.4% | **-25.0% (regression)** |
| **HellaSwag (common sense)** | 79.6% | 84.1% | +4.5% (improved) |
| **ARC-Challenge (reasoning)** | 69.3% | 71.3% | +2.0% (improved) |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mahernaija/Qwen3.5-27B-Text2SQL",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mahernaija/Qwen3.5-27B-Text2SQL",
trust_remote_code=True,
)
schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
question = "Find all employees in Engineering with salary above 90000."
prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000
```
### With vLLM
```bash
vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
```
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="mahernaija/Qwen3.5-27B-Text2SQL",
messages=[
{"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
{"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
],
max_tokens=256,
temperature=0,
)
print(response.choices[0].message.content)
# SELECT name FROM products ORDER BY price DESC LIMIT 3
```
## Training Details
| Parameter | Value |
|---|---|
| **Base model** | Qwen/Qwen3.5-27B (26.9B params) |
| **Method** | Full fine-tuning with FSDP |
| **Hardware** | 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) |
| **Training time** | 3 hours 6 minutes |
| **Dataset** | sql-create-context (78K) + Gretel synthetic (100K) = 178K samples |
| **Epochs** | 1 |
| **Batch size** | 2 per GPU × 8 grad accum × 16 GPUs = 256 global |
| **Learning rate** | 2e-5 (cosine decay, 5% warmup) |
| **Sequence length** | 512 (SQL samples P99=373 tokens) |
| **Precision** | BF16 |
| **GPU utilization** | 98-100% |
| **Final train loss** | 0.144 |
| **Final eval loss** | 0.176 |
### Data Preprocessing
- Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
- Manual chat format (bypasses Qwen3.5 `<think>` tag injection)
- Label masking: loss only on SQL output, prompt tokens masked with -100
- Deduplication + contamination check between train and eval splits
### Evaluation
**Gretel Execution Accuracy** (gold standard — runs SQL in SQLite, compares results):
| Complexity | Base | This Model |
|---|---|---|
| Basic SQL | 23.8% | **71.4%** |
| Aggregation | 18.2% | **54.5%** |
| Single JOIN | 25.0% | **75.0%** |
| Window functions | 0.0% | **33.3%** |
**Spider Benchmark** (1,034 dev questions, public standard):
- Exact match: 22.2%
- Keyword score: 85.4%
### Known Limitations
**Catastrophic forgetting**: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:
- LoRA fine-tuning instead of full FT
- Mixed training data (SQL + general chat)
- Using this model only for SQL-specific pipelines
### Regression Test
| Category | Base | This Model | SQL Contamination |
|---|---|---|---|
| General Knowledge | 84% | 44% | 2/5 |
| Math | 100% | 40% | 3/5 |
| Code | 48% | 47% | 5/5 |
| Language | 90% | 78% | 4/5 |
| Common Sense | 82% | 93% | 0/5 |
## Architecture
- **Architecture**: Qwen3_5ForConditionalGeneration (VLM with text + vision)
- **Text backbone**: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
- **Attention**: Hybrid GDN (linear_attention + full_attention)
- **Context**: 262K tokens
- **Vision**: Built-in 27-layer ViT (weights from base model, not finetuned)
## Files
| File | Description |
|---|---|
| `model-00001-of-00003.safetensors` | Text backbone weights (shard 1/3) |
| `model-00002-of-00003.safetensors` | Text backbone weights (shard 2/3) |
| `model-00003-of-00003.safetensors` | Text backbone weights (shard 3/3) |
| `model-visual.safetensors` | Vision encoder weights (from base model) |
| `config.json` | Full VLM config (required by vLLM) |
| `tokenizer.json` | Tokenizer |
## Citation
```bibtex
@misc{naija2026qwen35text2sql,
title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
author={Maher Naija},
year={2026},
url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
}
```
## License
Apache 2.0 (same as base model [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B))
## Acknowledgments
- Base model: [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) by Alibaba
- Training data: [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) + [Gretel synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)
|