File size: 6,799 Bytes
8285334
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f64c6d
 
 
8285334
 
 
0f64c6d
 
 
 
 
 
 
 
 
a2133f4
 
0f64c6d
8285334
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7045ff
 
 
 
8285334
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: apache-2.0
base_model: Qwen/Qwen3.5-27B
tags:
  - text-to-sql
  - sql
  - qwen3.5
  - fine-tuned
  - fsdp
  - nebius
datasets:
  - b-mc2/sql-create-context
  - gretelai/synthetic_text_to_sql
language:
  - en
pipeline_tag: text-generation
---

# Qwen3.5-27B-Text2SQL

Fine-tuned [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) for **Text-to-SQL** generation. Given a database schema and a natural language question, the model outputs a clean SQL query.

## Key Results

| Metric | Base Model | This Model | Improvement |
|---|---|---|---|
| **Gretel Execution Accuracy (3,492 samples)** | 26.5% | **66.7%** | **+40.2%** |
| **Gretel Valid SQL** | 37.8% | **89.4%** | +51.6% |
| **Spider Execution Accuracy (1,032, gold std)** | 46.6% | **55.1%** | +8.5% |
| **Spider Exact Match** | 0.0% | **22.2%** | +22.2% |
| **Spider Keyword Score** | 45.5% | **85.4%** | +39.9% |

### Regression (standard benchmarks via lm-eval-harness)

| Benchmark | Base | This Model | Delta |
|---|---|---|---|
| **MMLU (humanities)** | 81.9% | 83.5% | +1.6% (no regression) |
| **MMLU (STEM)** | 87.2% | 86.7% | -0.5% (no regression) |
| **MMLU (social sciences)** | 92.0% | 92.0% | 0% (no regression) |
| **MMLU (other)** | 87.5% | 87.5% | 0% (no regression) |
| **GSM8K (math, strict)** | 60.4% | 35.4% | **-25.0% (regression)** |
| **HellaSwag (common sense)** | 79.6% | 84.1% | +4.5% (improved) |
| **ARC-Challenge (reasoning)** | 69.3% | 71.3% | +2.0% (improved) |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "mahernaija/Qwen3.5-27B-Text2SQL",
    trust_remote_code=True,
)

schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
question = "Find all employees in Engineering with salary above 90000."

prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000
```

### With vLLM

```bash
vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
```

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
    model="mahernaija/Qwen3.5-27B-Text2SQL",
    messages=[
        {"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
        {"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
    ],
    max_tokens=256,
    temperature=0,
)
print(response.choices[0].message.content)
# SELECT name FROM products ORDER BY price DESC LIMIT 3
```

## Training Details

| Parameter | Value |
|---|---|
| **Base model** | Qwen/Qwen3.5-27B (26.9B params) |
| **Method** | Full fine-tuning with FSDP |
| **Hardware** | 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) |
| **Training time** | 3 hours 6 minutes |
| **Dataset** | sql-create-context (78K) + Gretel synthetic (100K) = 178K samples |
| **Epochs** | 1 |
| **Batch size** | 2 per GPU × 8 grad accum × 16 GPUs = 256 global |
| **Learning rate** | 2e-5 (cosine decay, 5% warmup) |
| **Sequence length** | 512 (SQL samples P99=373 tokens) |
| **Precision** | BF16 |
| **GPU utilization** | 98-100% |
| **Final train loss** | 0.144 |
| **Final eval loss** | 0.176 |

### Data Preprocessing

- Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
- Manual chat format (bypasses Qwen3.5 `<think>` tag injection)
- Label masking: loss only on SQL output, prompt tokens masked with -100
- Deduplication + contamination check between train and eval splits

### Evaluation

**Gretel Execution Accuracy** (gold standard — runs SQL in SQLite, compares results):

| Complexity | Base | This Model |
|---|---|---|
| Basic SQL | 23.8% | **71.4%** |
| Aggregation | 18.2% | **54.5%** |
| Single JOIN | 25.0% | **75.0%** |
| Window functions | 0.0% | **33.3%** |

**Spider Benchmark** (1,034 dev questions, public standard):
- Exact match: 22.2%
- Keyword score: 85.4%

### Known Limitations

**Catastrophic forgetting**: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:
- LoRA fine-tuning instead of full FT
- Mixed training data (SQL + general chat)
- Using this model only for SQL-specific pipelines

### Regression Test

| Category | Base | This Model | SQL Contamination |
|---|---|---|---|
| General Knowledge | 84% | 44% | 2/5 |
| Math | 100% | 40% | 3/5 |
| Code | 48% | 47% | 5/5 |
| Language | 90% | 78% | 4/5 |
| Common Sense | 82% | 93% | 0/5 |

## Architecture

- **Architecture**: Qwen3_5ForConditionalGeneration (VLM with text + vision)
- **Text backbone**: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
- **Attention**: Hybrid GDN (linear_attention + full_attention)
- **Context**: 262K tokens
- **Vision**: Built-in 27-layer ViT (weights from base model, not finetuned)

## Files

| File | Description |
|---|---|
| `model-00001-of-00003.safetensors` | Text backbone weights (shard 1/3) |
| `model-00002-of-00003.safetensors` | Text backbone weights (shard 2/3) |
| `model-00003-of-00003.safetensors` | Text backbone weights (shard 3/3) |
| `model-visual.safetensors` | Vision encoder weights (from base model) |
| `config.json` | Full VLM config (required by vLLM) |
| `tokenizer.json` | Tokenizer |

## Citation

```bibtex
@misc{naija2026qwen35text2sql,
  title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
  author={Maher Naija},
  year={2026},
  url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
}
```

## License

Apache 2.0 (same as base model [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B))

## Acknowledgments

- Base model: [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) by Alibaba
- Training data: [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) + [Gretel synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)