Qwen2.5-Coder-7B β Text-to-SQL (SFT + DPO)
A LoRA adapter fine-tuned on Qwen/Qwen2.5-Coder-7B-Instruct for text-to-SQL generation, achieving 78.2% result accuracy on Spider V1 β outperforming both frontier models used to build its training data.
| Model | Spider V1 Result Accuracy |
|---|---|
| This model (7B fine-tuned) | 78.2% |
| Grok-4 (xAI) | 73.7% |
| DeepSeek V3 | 71.8% |
GitHub: finetuning-text-to-sql Β· Dataset: jk200201/spider-dpo-1040
How to Use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter = "jk200201/qwen2.5-coder-7b-sql-dpo"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
schema = """
CREATE TABLE concerts (concert_id INT, concert_name TEXT, year INT);
CREATE TABLE singers (singer_id INT, name TEXT, country TEXT);
CREATE TABLE singer_in_concert (concert_id INT, singer_id INT);
"""
question = "How many singers performed in concerts held after 2015?"
prompt = f"""You are an expert SQL query generator. Given a database schema and a natural language question, generate a single valid SQL query that answers the question. Output ONLY the SQL query, nothing else.
### Schema:
{schema}
### Question:
{question}
### SQL:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.1)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training Methodology
Stage 1 β SFT on 7,000 Spider training examples (gold SQL), 3 epochs, LR 2e-4.
Stage 2 β DPO on 1,040 preference pairs built from frontier model disagreements:
- Ran Grok-4 and DeepSeek V3 on all 1,034 Spider dev questions
- For cases where both models were correct but wrote different SQL, used Gemini 2.5 Flash as a third-party judge to pick the better query
- Used those judgments as DPO preference signal
| DPO pair category | Count |
|---|---|
| Judge-resolved (both correct, different SQL) | 478 |
| Gold SQL vs wrong model output | 445 |
| Clear preference (one right, one wrong) | 117 |
Training infra: AWS EC2 g5.xlarge (NVIDIA A10G), QLoRA 4-bit NF4, LoRA rank 32 alpha 64, via LLaMA-Factory.
Training Hyperparameters
| Parameter | SFT | DPO |
|---|---|---|
| Epochs | 3 | 2 |
| Learning rate | 2e-4 | 5e-5 |
| LR schedule | cosine | cosine |
| Effective batch size | 16 | 16 |
| LoRA rank / alpha | 32 / 64 | 32 / 64 |
| Quantization | 4-bit NF4 | 4-bit NF4 |
| Final loss | 0.0715 | 0.2208 |
Citation
If you use the Spider dataset, please cite:
@inproceedings{Yu2018Spider,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev},
booktitle = {EMNLP},
year = {2018}
}
- Downloads last month
- 30