| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3.5-27B |
| tags: |
| - text-to-sql |
| - sql |
| - qwen3.5 |
| - fine-tuned |
| - fsdp |
| - nebius |
| datasets: |
| - b-mc2/sql-create-context |
| - gretelai/synthetic_text_to_sql |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwen3.5-27B-Text2SQL |
|
|
| Fine-tuned [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) for **Text-to-SQL** generation. Given a database schema and a natural language question, the model outputs a clean SQL query. |
|
|
| ## Key Results |
|
|
| | Metric | Base Model | This Model | Improvement | |
| |---|---|---|---| |
| | **Gretel Execution Accuracy (3,492 samples)** | 26.5% | **66.7%** | **+40.2%** | |
| | **Gretel Valid SQL** | 37.8% | **89.4%** | +51.6% | |
| | **Spider Execution Accuracy (1,032, gold std)** | 46.6% | **55.1%** | +8.5% | |
| | **Spider Exact Match** | 0.0% | **22.2%** | +22.2% | |
| | **Spider Keyword Score** | 45.5% | **85.4%** | +39.9% | |
|
|
| ### Regression (standard benchmarks via lm-eval-harness) |
|
|
| | Benchmark | Base | This Model | Delta | |
| |---|---|---|---| |
| | **MMLU (humanities)** | 81.9% | 83.5% | +1.6% (no regression) | |
| | **MMLU (STEM)** | 87.2% | 86.7% | -0.5% (no regression) | |
| | **MMLU (social sciences)** | 92.0% | 92.0% | 0% (no regression) | |
| | **MMLU (other)** | 87.5% | 87.5% | 0% (no regression) | |
| | **GSM8K (math, strict)** | 60.4% | 35.4% | **-25.0% (regression)** | |
| | **HellaSwag (common sense)** | 79.6% | 84.1% | +4.5% (improved) | |
| | **ARC-Challenge (reasoning)** | 69.3% | 71.3% | +2.0% (improved) | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "mahernaija/Qwen3.5-27B-Text2SQL", |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| tokenizer = AutoTokenizer.from_pretrained( |
| "mahernaija/Qwen3.5-27B-Text2SQL", |
| trust_remote_code=True, |
| ) |
| |
| schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);" |
| question = "Find all employees in Engineering with salary above 90000." |
| |
| prompt = f"<|im_start|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<|im_end|>\n<|im_start|>user\nSchema: {schema}\nQuestion: {question}<|im_end|>\n<|im_start|>assistant\n" |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| outputs = model.generate(**inputs, max_new_tokens=256, temperature=0) |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) |
| print(response) |
| # SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000 |
| ``` |
|
|
| ### With vLLM |
|
|
| ```bash |
| vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2 |
| ``` |
|
|
| ```python |
| from openai import OpenAI |
| client = OpenAI(base_url="http://localhost:8000/v1", api_key="token") |
| |
| response = client.chat.completions.create( |
| model="mahernaija/Qwen3.5-27B-Text2SQL", |
| messages=[ |
| {"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."}, |
| {"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"}, |
| ], |
| max_tokens=256, |
| temperature=0, |
| ) |
| print(response.choices[0].message.content) |
| # SELECT name FROM products ORDER BY price DESC LIMIT 3 |
| ``` |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | **Base model** | Qwen/Qwen3.5-27B (26.9B params) | |
| | **Method** | Full fine-tuning with FSDP | |
| | **Hardware** | 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) | |
| | **Training time** | 3 hours 6 minutes | |
| | **Dataset** | sql-create-context (78K) + Gretel synthetic (100K) = 178K samples | |
| | **Epochs** | 1 | |
| | **Batch size** | 2 per GPU × 8 grad accum × 16 GPUs = 256 global | |
| | **Learning rate** | 2e-5 (cosine decay, 5% warmup) | |
| | **Sequence length** | 512 (SQL samples P99=373 tokens) | |
| | **Precision** | BF16 | |
| | **GPU utilization** | 98-100% | |
| | **Final train loss** | 0.144 | |
| | **Final eval loss** | 0.176 | |
|
|
| ### Data Preprocessing |
|
|
| - Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers) |
| - Manual chat format (bypasses Qwen3.5 `<think>` tag injection) |
| - Label masking: loss only on SQL output, prompt tokens masked with -100 |
| - Deduplication + contamination check between train and eval splits |
|
|
| ### Evaluation |
|
|
| **Gretel Execution Accuracy** (gold standard — runs SQL in SQLite, compares results): |
|
|
| | Complexity | Base | This Model | |
| |---|---|---| |
| | Basic SQL | 23.8% | **71.4%** | |
| | Aggregation | 18.2% | **54.5%** | |
| | Single JOIN | 25.0% | **75.0%** | |
| | Window functions | 0.0% | **33.3%** | |
|
|
| **Spider Benchmark** (1,034 dev questions, public standard): |
| - Exact match: 22.2% |
| - Keyword score: 85.4% |
|
|
| ### Known Limitations |
|
|
| **Catastrophic forgetting**: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider: |
| - LoRA fine-tuning instead of full FT |
| - Mixed training data (SQL + general chat) |
| - Using this model only for SQL-specific pipelines |
|
|
| ### Regression Test |
|
|
| | Category | Base | This Model | SQL Contamination | |
| |---|---|---|---| |
| | General Knowledge | 84% | 44% | 2/5 | |
| | Math | 100% | 40% | 3/5 | |
| | Code | 48% | 47% | 5/5 | |
| | Language | 90% | 78% | 4/5 | |
| | Common Sense | 82% | 93% | 0/5 | |
|
|
| ## Architecture |
|
|
| - **Architecture**: Qwen3_5ForConditionalGeneration (VLM with text + vision) |
| - **Text backbone**: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads |
| - **Attention**: Hybrid GDN (linear_attention + full_attention) |
| - **Context**: 262K tokens |
| - **Vision**: Built-in 27-layer ViT (weights from base model, not finetuned) |
|
|
| ## Files |
|
|
| | File | Description | |
| |---|---| |
| | `model-00001-of-00003.safetensors` | Text backbone weights (shard 1/3) | |
| | `model-00002-of-00003.safetensors` | Text backbone weights (shard 2/3) | |
| | `model-00003-of-00003.safetensors` | Text backbone weights (shard 3/3) | |
| | `model-visual.safetensors` | Vision encoder weights (from base model) | |
| | `config.json` | Full VLM config (required by vLLM) | |
| | `tokenizer.json` | Tokenizer | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{naija2026qwen35text2sql, |
| title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL}, |
| author={Maher Naija}, |
| year={2026}, |
| url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 (same as base model [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B)) |
|
|
| ## Acknowledgments |
|
|
| - Base model: [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) by Alibaba |
| - Training data: [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) + [Gretel synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) |
|
|