sql-2025 / README.md
Darejkal's picture
Upload folder using huggingface_hub
e491365 verified
---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- text2sql
- sql
- nlp
- distillation
- qwen3
datasets:
- distil-labs/text2sql-synthetic
language:
- en
pipeline_tag: text-generation
---
# Distil-Qwen3-4B-Text2SQL
A fine-tuned Qwen3-4B model for converting natural language questions into SQL queries. Trained using knowledge distillation from DeepSeek-V3, this 4B parameter model matches teacher-level accuracy while being small enough to run locally.
## Results
| Metric | DeepSeek-V3 (Teacher) | Qwen3-4B (Base) | **This Model** |
|--------|:---------------------:|:---------------:|:--------------:|
| LLM-as-a-Judge | 80% | 62% | **80%** |
| Exact Match | 48% | 16% | **60%** |
| ROUGE | 87.6% | 84.2% | **89.5%** |
| METEOR | 85.1% | 87.3% | 86.1% |
The fine-tuned model **matches the 685B parameter teacher** on LLM-as-a-Judge accuracy and **exceeds it** on exact match and ROUGE scores.
## Quick Start
### Using Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("distil-labs/distil-qwen3-4b-text2sql")
tokenizer = AutoTokenizer.from_pretrained("distil-labs/distil-qwen3-4b-text2sql")
schema = """CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
department TEXT,
salary INTEGER
);"""
question = "How many employees earn more than 50000?"
messages = [
{
"role": "system",
"content": """You are a problem solving model working on task_description XML block:
<task_description>You are given a database schema and a natural language question. Generate the SQL query that answers the question.
Input:
- Schema: One or two table definitions in SQL DDL format
- Question: Natural language question about the data
Output:
- A single SQL query that answers the question
- No explanations, comments, or additional text
Rules:
- Use only tables and columns from the provided schema
- Use uppercase SQL keywords (SELECT, FROM, WHERE, etc.)
- Use SQLite-compatible syntax</task_description>
You will be given a single task in the question XML block
Solve only the task in question block.
Generate only the answer, do not generate anything else"""
},
{
"role": "user",
"content": f"""Now for the real task, solve the task in question block.
Generate only the solution, do not generate anything else
<question>Schema:
{schema}
Question: {question}</question>"""
}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Using Ollama (GGUF version)
For local inference, use the quantized GGUF versions:
- [distil-qwen3-4b-text2sql-gguf](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf) - Full precision GGUF
- [distil-qwen3-4b-text2sql-gguf-4bit](https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql-gguf-4bit) - 4-bit quantized (~2.5GB)
```bash
# Download and create Ollama model
ollama create distil-qwen3-4b-text2sql -f Modelfile
# Run inference
ollama run distil-qwen3-4b-text2sql
```
## Model Details
| Property | Value |
|----------|-------|
| Base Model | [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
| Parameters | 4 billion |
| Architecture | Qwen3ForCausalLM |
| Context Length | 262,144 tokens |
| Precision | bfloat16 |
| Training Data | ~10,000 synthetic examples |
| Teacher Model | DeepSeek-V3 |
## Training
This model was trained using the [Distil Labs](https://distillabs.ai) platform:
1. **Seed Data**: 50 hand-validated Text2SQL examples covering various SQL complexities
2. **Synthetic Generation**: Expanded to ~10,000 examples using DeepSeek-V3
3. **Fine-tuning**: 4 epochs on the synthetic dataset
4. **Evaluation**: LLM-as-a-Judge with semantic equivalence checking
### Training Hyperparameters
- Epochs: 4
- Learning Rate: 5e-5 (cosine schedule)
- Batch Size: 1 (with gradient accumulation)
- Total Steps: ~40,000
## Task Format
### Input Format
```
Schema:
CREATE TABLE table_name (
column_name DATA_TYPE [CONSTRAINTS],
...
);
Question: Natural language question about the data
```
### Output Format
A single SQL query with:
- Uppercase SQL keywords (SELECT, FROM, WHERE, etc.)
- SQLite-compatible syntax
- No explanations or additional text
### Supported SQL Features
- **Simple**: SELECT, WHERE, COUNT, SUM, AVG, MAX, MIN
- **Medium**: JOIN, GROUP BY, HAVING, ORDER BY, LIMIT
- **Complex**: Subqueries, multiple JOINs, UNION
## Use Cases
- Natural language interfaces to databases
- SQL query assistance and autocompletion
- Database chatbots and conversational BI
- Educational tools for learning SQL
## Limitations
- Optimized for SQLite syntax
- Best with 1-2 table schemas
- May struggle with highly complex nested subqueries
- Trained on English questions only
## License
This model is released under the Apache 2.0 license.
## Links
- [Distil Labs Website](https://distillabs.ai)
- [GitHub](https://github.com/distil-labs)
- [Hugging Face](https://huggingface.co/distil-labs)
## Citation
```bibtex
@misc{distil-qwen3-4b-text2sql,
author = {Distil Labs},
title = {Distil-Qwen3-4B-Text2SQL: A Fine-tuned Model for Natural Language to SQL},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/distil-labs/distil-qwen3-4b-text2sql}
}
```