mahernaija commited on
Commit
8285334
·
verified ·
1 Parent(s): c44587f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +175 -0
README.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.5-27B
4
+ tags:
5
+ - text-to-sql
6
+ - sql
7
+ - qwen3.5
8
+ - fine-tuned
9
+ - fsdp
10
+ - nebius
11
+ datasets:
12
+ - b-mc2/sql-create-context
13
+ - gretelai/synthetic_text_to_sql
14
+ language:
15
+ - en
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Qwen3.5-27B-Text2SQL
20
+
21
+ Fine-tuned [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) for **Text-to-SQL** generation. Given a database schema and a natural language question, the model outputs a clean SQL query.
22
+
23
+ ## Key Results
24
+
25
+ | Metric | Base Model | This Model | Improvement |
26
+ |---|---|---|---|
27
+ | **SQL Execution Accuracy** | 19.5% | **61.0%** | **+41.5%** |
28
+ | **Valid SQL Output** | 41.5% | **90.2%** | +48.7% |
29
+ | **Spider Exact Match** | 0.0% | **22.2%** | +22.2% |
30
+ | **Spider Keyword Score** | 45.5% | **85.4%** | +39.9% |
31
+ | **Clean SQL Format** | 0% | **100%** | +100% |
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from transformers import AutoModelForCausalLM, AutoTokenizer
37
+
38
+ model = AutoModelForCausalLM.from_pretrained(
39
+ "mahernaija/Qwen3.5-27B-Text2SQL",
40
+ torch_dtype="auto",
41
+ device_map="auto",
42
+ trust_remote_code=True,
43
+ )
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ "mahernaija/Qwen3.5-27B-Text2SQL",
46
+ trust_remote_code=True,
47
+ )
48
+
49
+ schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
50
+ question = "Find all employees in Engineering with salary above 90000."
51
+
52
+ prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n"
53
+
54
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
55
+ outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
56
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
57
+ print(response)
58
+ # SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000
59
+ ```
60
+
61
+ ### With vLLM
62
+
63
+ ```bash
64
+ vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
65
+ ```
66
+
67
+ ```python
68
+ from openai import OpenAI
69
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
70
+
71
+ response = client.chat.completions.create(
72
+ model="mahernaija/Qwen3.5-27B-Text2SQL",
73
+ messages=[
74
+ {"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
75
+ {"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
76
+ ],
77
+ max_tokens=256,
78
+ temperature=0,
79
+ )
80
+ print(response.choices[0].message.content)
81
+ # SELECT name FROM products ORDER BY price DESC LIMIT 3
82
+ ```
83
+
84
+ ## Training Details
85
+
86
+ | Parameter | Value |
87
+ |---|---|
88
+ | **Base model** | Qwen/Qwen3.5-27B (26.9B params) |
89
+ | **Method** | Full fine-tuning with FSDP |
90
+ | **Hardware** | 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) |
91
+ | **Training time** | 3 hours 6 minutes |
92
+ | **Dataset** | sql-create-context (78K) + Gretel synthetic (100K) = 178K samples |
93
+ | **Epochs** | 1 |
94
+ | **Batch size** | 2 per GPU × 8 grad accum × 16 GPUs = 256 global |
95
+ | **Learning rate** | 2e-5 (cosine decay, 5% warmup) |
96
+ | **Sequence length** | 512 (SQL samples P99=373 tokens) |
97
+ | **Precision** | BF16 |
98
+ | **GPU utilization** | 98-100% |
99
+ | **Final train loss** | 0.144 |
100
+ | **Final eval loss** | 0.176 |
101
+
102
+ ### Data Preprocessing
103
+
104
+ - Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
105
+ - Manual chat format (bypasses Qwen3.5 `<think>` tag injection)
106
+ - Label masking: loss only on SQL output, prompt tokens masked with -100
107
+ - Deduplication + contamination check between train and eval splits
108
+
109
+ ### Evaluation
110
+
111
+ **Gretel Execution Accuracy** (gold standard — runs SQL in SQLite, compares results):
112
+
113
+ | Complexity | Base | This Model |
114
+ |---|---|---|
115
+ | Basic SQL | 23.8% | **71.4%** |
116
+ | Aggregation | 18.2% | **54.5%** |
117
+ | Single JOIN | 25.0% | **75.0%** |
118
+ | Window functions | 0.0% | **33.3%** |
119
+
120
+ **Spider Benchmark** (1,034 dev questions, public standard):
121
+ - Exact match: 22.2%
122
+ - Keyword score: 85.4%
123
+
124
+ ### Known Limitations
125
+
126
+ **Catastrophic forgetting**: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:
127
+ - LoRA fine-tuning instead of full FT
128
+ - Mixed training data (SQL + general chat)
129
+ - Using this model only for SQL-specific pipelines
130
+
131
+ ### Regression Test
132
+
133
+ | Category | Base | This Model | SQL Contamination |
134
+ |---|---|---|---|
135
+ | General Knowledge | 84% | 44% | 2/5 |
136
+ | Math | 100% | 40% | 3/5 |
137
+ | Code | 48% | 47% | 5/5 |
138
+ | Language | 90% | 78% | 4/5 |
139
+ | Common Sense | 82% | 93% | 0/5 |
140
+
141
+ ## Architecture
142
+
143
+ - **Architecture**: Qwen3_5ForConditionalGeneration (VLM with text + vision)
144
+ - **Text backbone**: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
145
+ - **Attention**: Hybrid GDN (linear_attention + full_attention)
146
+ - **Context**: 262K tokens
147
+ - **Vision**: Built-in 27-layer ViT (weights from base model, not finetuned)
148
+
149
+ ## Files
150
+
151
+ | File | Description |
152
+ |---|---|
153
+ | `model-00001-of-00003.safetensors` | Text backbone weights (shard 1/3) |
154
+ | `model-00002-of-00003.safetensors` | Text backbone weights (shard 2/3) |
155
+ | `model-00003-of-00003.safetensors` | Text backbone weights (shard 3/3) |
156
+ | `model-visual.safetensors` | Vision encoder weights (from base model) |
157
+ | `config.json` | Full VLM config (required by vLLM) |
158
+ | `tokenizer.json` | Tokenizer |
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @misc{naija2026qwen35text2sql,
164
+ title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
165
+ author={Maher Naija},
166
+ year={2026},
167
+ url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
168
+ }
169
+ ```
170
+
171
+ ## Acknowledgments
172
+
173
+ - Trained on [Nebius AI Cloud](https://nebius.com) using Soperator (Slurm on Kubernetes)
174
+ - Base model: [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) by Alibaba
175
+ - Training data: [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) + [Gretel synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)