sql-2025 / README.md

Upload folder using huggingface_hub

e491365 verified 19 days ago

5.5 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3-4B
	tags:
	- text2sql
	- sql
	- nlp
	- distillation
	- qwen3
	datasets:
	- distil-labs/text2sql-synthetic
	language:
	- en
	pipeline_tag: text-generation
	---

	# Distil-Qwen3-4B-Text2SQL

	A fine-tuned Qwen3-4B model for converting natural language questions into SQL queries. Trained using knowledge distillation from DeepSeek-V3, this 4B parameter model matches teacher-level accuracy while being small enough to run locally.

	## Results

	\| Metric \| DeepSeek-V3 (Teacher) \| Qwen3-4B (Base) \| This Model \|
	\|--------\|:---------------------:\|:---------------:\|:--------------:\|
	\| LLM-as-a-Judge \| 80% \| 62% \| 80% \|
	\| Exact Match \| 48% \| 16% \| 60% \|
	\| ROUGE \| 87.6% \| 84.2% \| 89.5% \|
	\| METEOR \| 85.1% \| 87.3% \| 86.1% \|

	The fine-tuned model matches the 685B parameter teacher on LLM-as-a-Judge accuracy and exceeds it on exact match and ROUGE scores.

	## Quick Start

	### Using Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("distil-labs/distil-qwen3-4b-text2sql")
	tokenizer = AutoTokenizer.from_pretrained("distil-labs/distil-qwen3-4b-text2sql")

	schema = """CREATE TABLE employees (
	id INTEGER PRIMARY KEY,
	name TEXT NOT NULL,
	department TEXT,
	salary INTEGER
	);"""

	question = "How many employees earn more than 50000?"

	messages = [
	{
	"role": "system",
	"content": """You are a problem solving model working on task_description XML block:
	<task_description>You are given a database schema and a natural language question. Generate the SQL query that answers the question.

	Input:
	- Schema: One or two table definitions in SQL DDL format
	- Question: Natural language question about the data

	Output:
	- A single SQL query that answers the question
	- No explanations, comments, or additional text

	Rules:
	- Use only tables and columns from the provided schema
	- Use uppercase SQL keywords (SELECT, FROM, WHERE, etc.)
	- Use SQLite-compatible syntax</task_description>
	You will be given a single task in the question XML block
	Solve only the task in question block.
	Generate only the answer, do not generate anything else"""
	},
	{
	"role": "user",
	"content": f"""Now for the real task, solve the task in question block.
	Generate only the solution, do not generate anything else
	<question>Schema:
	{schema}

	Question: {question}</question>"""
	}
	]

	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Using Ollama (GGUF version)

	For local inference, use the quantized GGUF versions:
	- [distil-qwen3-4b-text2sql-gguf](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf) - Full precision GGUF
	- [distil-qwen3-4b-text2sql-gguf-4bit](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf-4bit) - 4-bit quantized (~2.5GB)

	```bash
	# Download and create Ollama model
	ollama create distil-qwen3-4b-text2sql -f Modelfile

	# Run inference
	ollama run distil-qwen3-4b-text2sql
	```

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) \|
	\| Parameters \| 4 billion \|
	\| Architecture \| Qwen3ForCausalLM \|
	\| Context Length \| 262,144 tokens \|
	\| Precision \| bfloat16 \|
	\| Training Data \| ~10,000 synthetic examples \|
	\| Teacher Model \| DeepSeek-V3 \|

	## Training

	This model was trained using the [Distil Labs](https://distillabs.ai) platform:

	1. Seed Data: 50 hand-validated Text2SQL examples covering various SQL complexities
	2. Synthetic Generation: Expanded to ~10,000 examples using DeepSeek-V3
	3. Fine-tuning: 4 epochs on the synthetic dataset
	4. Evaluation: LLM-as-a-Judge with semantic equivalence checking

	### Training Hyperparameters

	- Epochs: 4
	- Learning Rate: 5e-5 (cosine schedule)
	- Batch Size: 1 (with gradient accumulation)
	- Total Steps: ~40,000

	## Task Format

	### Input Format

	```
	Schema:
	CREATE TABLE table_name (
	column_name DATA_TYPE [CONSTRAINTS],
	...
	);

	Question: Natural language question about the data
	```

	### Output Format

	A single SQL query with:
	- Uppercase SQL keywords (SELECT, FROM, WHERE, etc.)
	- SQLite-compatible syntax
	- No explanations or additional text

	### Supported SQL Features

	- Simple: SELECT, WHERE, COUNT, SUM, AVG, MAX, MIN
	- Medium: JOIN, GROUP BY, HAVING, ORDER BY, LIMIT
	- Complex: Subqueries, multiple JOINs, UNION

	## Use Cases

	- Natural language interfaces to databases
	- SQL query assistance and autocompletion
	- Database chatbots and conversational BI
	- Educational tools for learning SQL

	## Limitations

	- Optimized for SQLite syntax
	- Best with 1-2 table schemas
	- May struggle with highly complex nested subqueries
	- Trained on English questions only

	## License

	This model is released under the Apache 2.0 license.

	## Links

	- [Distil Labs Website](https://distillabs.ai)
	- [GitHub](https://github.com/distil-labs)
	- [Hugging Face](https://huggingface.co/distil-labs)

	## Citation

	```bibtex
	@misc{distil-qwen3-4b-text2sql,
	author = {Distil Labs},
	title = {Distil-Qwen3-4B-Text2SQL: A Fine-tuned Model for Natural Language to SQL},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql}
	}
	```