Upload README.md with huggingface_hub

a2133f4 verified 14 days ago

6.8 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3.5-27B
	tags:
	- text-to-sql
	- sql
	- qwen3.5
	- fine-tuned
	- fsdp
	- nebius
	datasets:
	- b-mc2/sql-create-context
	- gretelai/synthetic_text_to_sql
	language:
	- en
	pipeline_tag: text-generation
	---

	# Qwen3.5-27B-Text2SQL

	Fine-tuned [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) for Text-to-SQL generation. Given a database schema and a natural language question, the model outputs a clean SQL query.

	## Key Results

	\| Metric \| Base Model \| This Model \| Improvement \|
	\|---\|---\|---\|---\|
	\| Gretel Execution Accuracy (3,492 samples) \| 26.5% \| 66.7% \| +40.2% \|
	\| Gretel Valid SQL \| 37.8% \| 89.4% \| +51.6% \|
	\| Spider Execution Accuracy (1,032, gold std) \| 46.6% \| 55.1% \| +8.5% \|
	\| Spider Exact Match \| 0.0% \| 22.2% \| +22.2% \|
	\| Spider Keyword Score \| 45.5% \| 85.4% \| +39.9% \|

	### Regression (standard benchmarks via lm-eval-harness)

	\| Benchmark \| Base \| This Model \| Delta \|
	\|---\|---\|---\|---\|
	\| MMLU (humanities) \| 81.9% \| 83.5% \| +1.6% (no regression) \|
	\| MMLU (STEM) \| 87.2% \| 86.7% \| -0.5% (no regression) \|
	\| MMLU (social sciences) \| 92.0% \| 92.0% \| 0% (no regression) \|
	\| MMLU (other) \| 87.5% \| 87.5% \| 0% (no regression) \|
	\| GSM8K (math, strict) \| 60.4% \| 35.4% \| -25.0% (regression) \|
	\| HellaSwag (common sense) \| 79.6% \| 84.1% \| +4.5% (improved) \|
	\| ARC-Challenge (reasoning) \| 69.3% \| 71.3% \| +2.0% (improved) \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"mahernaija/Qwen3.5-27B-Text2SQL",
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"mahernaija/Qwen3.5-27B-Text2SQL",
	trust_remote_code=True,
	)

	schema = "CREATE TABLE employees (id INTEGER, name TEXT, department TEXT, salary REAL);"
	question = "Find all employees in Engineering with salary above 90000."

	prompt = f"<\|im_start\|>system\nYou are a SQL expert. Given a database schema and a natural language question, write the correct SQL query.<\|im_end\|>\n<\|im_start\|>user\nSchema: {schema}\nQuestion: {question}<\|im_end\|>\n<\|im_start\|>assistant\n"

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
	response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
	print(response)
	# SELECT * FROM employees WHERE department = 'Engineering' AND salary > 90000
	```

	### With vLLM

	```bash
	vllm serve mahernaija/Qwen3.5-27B-Text2SQL --tensor-parallel-size 2
	```

	```python
	from openai import OpenAI
	client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

	response = client.chat.completions.create(
	model="mahernaija/Qwen3.5-27B-Text2SQL",
	messages=[
	{"role": "system", "content": "You are a SQL expert. Given a database schema and a natural language question, write the correct SQL query."},
	{"role": "user", "content": "Schema: CREATE TABLE products (id INT, name TEXT, price REAL);\nQuestion: What are the 3 most expensive products?"},
	],
	max_tokens=256,
	temperature=0,
	)
	print(response.choices[0].message.content)
	# SELECT name FROM products ORDER BY price DESC LIMIT 3
	```

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen3.5-27B (26.9B params) \|
	\| Method \| Full fine-tuning with FSDP \|
	\| Hardware \| 16× NVIDIA H200 (2 nodes, Nebius AI Cloud) \|
	\| Training time \| 3 hours 6 minutes \|
	\| Dataset \| sql-create-context (78K) + Gretel synthetic (100K) = 178K samples \|
	\| Epochs \| 1 \|
	\| Batch size \| 2 per GPU × 8 grad accum × 16 GPUs = 256 global \|
	\| Learning rate \| 2e-5 (cosine decay, 5% warmup) \|
	\| Sequence length \| 512 (SQL samples P99=373 tokens) \|
	\| Precision \| BF16 \|
	\| GPU utilization \| 98-100% \|
	\| Final train loss \| 0.144 \|
	\| Final eval loss \| 0.176 \|

	### Data Preprocessing

	- Schemas stripped of INSERT data (model learns SQL from schema structure, not memorized answers)
	- Manual chat format (bypasses Qwen3.5 `<think>` tag injection)
	- Label masking: loss only on SQL output, prompt tokens masked with -100
	- Deduplication + contamination check between train and eval splits

	### Evaluation

	Gretel Execution Accuracy (gold standard — runs SQL in SQLite, compares results):

	\| Complexity \| Base \| This Model \|
	\|---\|---\|---\|
	\| Basic SQL \| 23.8% \| 71.4% \|
	\| Aggregation \| 18.2% \| 54.5% \|
	\| Single JOIN \| 25.0% \| 75.0% \|
	\| Window functions \| 0.0% \| 33.3% \|

	Spider Benchmark (1,034 dev questions, public standard):
	- Exact match: 22.2%
	- Keyword score: 85.4%

	### Known Limitations

	Catastrophic forgetting: Full fine-tuning on SQL-only data caused regression in general capabilities. The model tries to answer non-SQL questions with SQL (56% SQL contamination on general prompts). For production use with mixed tasks, consider:
	- LoRA fine-tuning instead of full FT
	- Mixed training data (SQL + general chat)
	- Using this model only for SQL-specific pipelines

	### Regression Test

	\| Category \| Base \| This Model \| SQL Contamination \|
	\|---\|---\|---\|---\|
	\| General Knowledge \| 84% \| 44% \| 2/5 \|
	\| Math \| 100% \| 40% \| 3/5 \|
	\| Code \| 48% \| 47% \| 5/5 \|
	\| Language \| 90% \| 78% \| 4/5 \|
	\| Common Sense \| 82% \| 93% \| 0/5 \|

	## Architecture

	- Architecture: Qwen3_5ForConditionalGeneration (VLM with text + vision)
	- Text backbone: 64 layers, hidden_size=5120, 24 attention heads, 4 KV heads
	- Attention: Hybrid GDN (linear_attention + full_attention)
	- Context: 262K tokens
	- Vision: Built-in 27-layer ViT (weights from base model, not finetuned)

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `model-00001-of-00003.safetensors` \| Text backbone weights (shard 1/3) \|
	\| `model-00002-of-00003.safetensors` \| Text backbone weights (shard 2/3) \|
	\| `model-00003-of-00003.safetensors` \| Text backbone weights (shard 3/3) \|
	\| `model-visual.safetensors` \| Vision encoder weights (from base model) \|
	\| `config.json` \| Full VLM config (required by vLLM) \|
	\| `tokenizer.json` \| Tokenizer \|

	## Citation

	```bibtex
	@misc{naija2026qwen35text2sql,
	title={Qwen3.5-27B-Text2SQL: Fine-tuned Qwen 3.5 for Text-to-SQL},
	author={Maher Naija},
	year={2026},
	url={https://huggingface.co/mahernaija/Qwen3.5-27B-Text2SQL}
	}
	```

	## License

	Apache 2.0 (same as base model [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B))

	## Acknowledgments

	- Base model: [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) by Alibaba
	- Training data: [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) + [Gretel synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql)