Upload README.md with huggingface_hub

f287a27 verified 2 days ago

4.36 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-to-sql
	- sql
	- nlp
	- fine-tuning
	- qlora
	- lora
	- phi-3
	- peft
	base_model: microsoft/Phi-3-mini-4k-instruct
	datasets:
	- b-mc2/sql-create-context
	metrics:
	- bleu
	pipeline_tag: text-generation
	---

	# QueryCraft — Phi-3 Mini Fine-Tuned for Text-to-SQL

	Fine-tuned Phi-3 Mini 3.8B using QLoRA on 76,000 Text-to-SQL examples.
	Converts natural language questions into valid SQL queries.

	## Evaluation Results

	\| Metric \| Base Model \| Fine-Tuned \|
	\|---\|---\|---\|
	\| Exact Match \| 0.0% \| 82.0% \|
	\| Execution Accuracy \| 84.0% \| 96.0% \|
	\| BLEU Score \| 55.79 \| 96.42 \|

	Evaluated on 50 held-out validation examples not seen during training.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| microsoft/Phi-3-mini-4k-instruct (3.8B params) \|
	\| Fine-tuning method \| QLoRA (4-bit NF4 + LoRA) \|
	\| LoRA rank \| r=16, alpha=32 \|
	\| Trainable parameters \| 8,912,896 (0.23%) \|
	\| Training examples \| 76,577 \|
	\| Training hardware \| NVIDIA RTX 5060 Ti 8GB \|
	\| Training time \| 3 hours 2 minutes \|
	\| Final train loss \| 0.5677 \|
	\| Max sequence length \| 256 tokens \|

	## How to Use

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel

	base_model = "microsoft/Phi-3-mini-4k-instruct"
	adapter = "Sid9797/querycraft-phi3-sql"

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	)

	tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	quantization_config=bnb_config,
	device_map="cuda:0",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	)
	model = PeftModel.from_pretrained(model, adapter)
	model.eval()

	prompt = '''### System:
	You are a SQL expert. Given a database schema and a natural language question, generate a valid SQL query that answers the question. Output only the SQL query with no explanation.

	### Schema:
	CREATE TABLE employees (id INTEGER, name VARCHAR, department VARCHAR, salary FLOAT)

	### Question:
	What is the average salary by department?

	### SQL:
	'''

	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)

	sql = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(sql.strip().split("\n")[0])
	# SELECT AVG(salary) FROM employees GROUP BY department
	```

	## Prompt Format

	The model was trained on the Alpaca instruction format:
	System:
	You are a SQL expert. Given a database schema and a natural language question,
	generate a valid SQL query that answers the question.
	Output only the SQL query with no explanation.
	Schema:
	{CREATE TABLE statements}
	Question:
	{natural language question}
	SQL:
	{model generates SQL here}
	## Training Details

	- Dataset: [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)
	— 78,577 examples with inline CREATE TABLE schemas
	- Train/Val split: 76,577 train / 2,000 validation (seeded shuffle)
	- Quantization: 4-bit NF4 with double quantization (bitsandbytes)
	- LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Optimizer: AdamW with cosine LR schedule, warmup_ratio=0.05
	- Effective batch size: 16 (batch_size=4, gradient_accumulation=4)
	- Packing: Enabled — short examples concatenated to fill 256-token sequences

	## Why the Base Model Scored 0% Exact Match

	The base Phi-3 Mini, without fine-tuning, consistently wrapped SQL output
	in markdown code fences (`` ```sql ... ``` ``) and appended semicolons.
	This formatting breaks exact match evaluation even when the SQL logic is correct.
	Fine-tuning on consistently formatted examples eliminated this entirely.

	## Limitations

	- Optimised for single-table and simple multi-table queries
	- Schema must be provided as CREATE TABLE SQL statements
	- Best results on English-language questions
	- May struggle with highly complex nested subqueries

	## Links

	- GitHub: https://github.com/Siddhesh-Ai9797/querycraft
	- Base Model: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
	- Training Dataset: https://huggingface.co/datasets/b-mc2/sql-create-context