|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
tags: |
|
|
- text-to-sql |
|
|
- sql-generation |
|
|
- natural-language-to-sql |
|
|
- deepseek |
|
|
- qwen |
|
|
- reasoning |
|
|
- database |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- ameet/deepsql_training |
|
|
--- |
|
|
|
|
|
# DeepSQL |
|
|
|
|
|
DeepSQL is a fine-tuned language model specialized in converting natural language questions into SQL queries. It is based on [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) and has been trained to understand database schemas and generate accurate SQL queries from natural language questions. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Model Type**: Causal Language Model (Qwen2-based) |
|
|
- **Architecture**: Qwen2ForCausalLM |
|
|
- **Base Model**: DeepSeek-R1-Distill-Qwen-1.5B |
|
|
- **Parameters**: 1.5B |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with FSDP (Fully Sharded Data Parallel) |
|
|
- **Context Length**: 131,072 tokens |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Hidden Size**: 1,536 |
|
|
- **Intermediate Size**: 8,960 |
|
|
- **Number of Layers**: 28 |
|
|
- **Attention Heads**: 12 |
|
|
- **Key-Value Heads**: 2 |
|
|
- **Vocabulary Size**: 151,936 |
|
|
- **Activation Function**: SiLU |
|
|
- **Normalization**: RMSNorm |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
DeepSQL is designed to: |
|
|
- Convert natural language questions about databases into SQL queries |
|
|
- Understand complex database schemas and relationships |
|
|
- Generate SQL queries with reasoning about the query structure |
|
|
- Handle multi-table joins, aggregations, and complex filtering conditions |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
This model should not be used for: |
|
|
- General-purpose text generation |
|
|
- Tasks unrelated to SQL generation |
|
|
- Generating SQL queries without proper schema validation |
|
|
- Production database operations without proper testing and validation |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on the SynSQL2.5M dataset, a large-scale text-to-SQL dataset comprising over 2.5 million diverse and high-quality samples spanning more than 16,000 databases from various domains. The training data was processed and formatted as the `ameet/deepsql_training` dataset, which contains: |
|
|
- Natural language questions about databases |
|
|
- Corresponding database schemas (DDL) |
|
|
- Chain-of-thought reasoning for SQL generation |
|
|
- Target SQL queries |
|
|
- External knowledge annotations |
|
|
|
|
|
The SynSQL2.5M dataset provides comprehensive coverage of SQL query patterns, including complex joins, aggregations, subqueries, and various SQL dialects, making it an ideal training resource for text-to-SQL models. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Training Framework**: PyTorch with FSDP |
|
|
- **Optimizer**: AdamW |
|
|
- **Learning Rate**: 1e-5 |
|
|
- **LoRA Configuration**: |
|
|
- Rank (r): 16 |
|
|
- LoRA Alpha: 32 |
|
|
- Target Modules: q_proj, k_proj, v_proj, o_proj |
|
|
- LoRA Dropout: 0 |
|
|
- **Precision**: bfloat16 |
|
|
- **Training Infrastructure**: Multi-GPU training with FSDP sharding |
|
|
|
|
|
### Training Input Format |
|
|
|
|
|
The model uses a specific chat template format: |
|
|
|
|
|
``` |
|
|
The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query. |
|
|
The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags. |
|
|
|
|
|
Database Schema: {schema} |
|
|
|
|
|
External Knowledge: {external_knowledge} |
|
|
|
|
|
User: {question} |
|
|
|
|
|
Assistant: <think>{cot}</think> |
|
|
<answer> |
|
|
{sql} |
|
|
</answer> |
|
|
``` |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_name = "DeepSQL/DeepSQL-1.0" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Prepare your input |
|
|
schema = """ |
|
|
CREATE TABLE "vehicles" ( |
|
|
"vehicle_id" INTEGER, |
|
|
"make" TEXT, |
|
|
"model" TEXT, |
|
|
"year" INTEGER, |
|
|
"price" REAL, |
|
|
PRIMARY KEY ("vehicle_id") |
|
|
); |
|
|
""" |
|
|
|
|
|
question = "What is the average price of vehicles by make?" |
|
|
external_knowledge = "" |
|
|
|
|
|
prompt = f"""The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query. |
|
|
The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags. |
|
|
|
|
|
Database Schema: {schema} |
|
|
|
|
|
External Knowledge: {external_knowledge} |
|
|
|
|
|
User: {question} |
|
|
|
|
|
Assistant:""" |
|
|
|
|
|
# Tokenize and generate |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=1024, |
|
|
do_sample=True, |
|
|
temperature=0.8, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=False) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Using LMStudio with GGUF |
|
|
|
|
|
A GGUF model file (`deepsql-1.0.gguf`) is provided for easy inference with LMStudio: |
|
|
|
|
|
1. **Download the GGUF file**: |
|
|
- Download `deepsql-1.0.gguf` from the model repository |
|
|
|
|
|
2. **Open LMStudio**: |
|
|
- Launch LMStudio on your system |
|
|
|
|
|
3. **Load the model**: |
|
|
- Click on the "Browse" button in the Models section |
|
|
- Navigate to the directory containing `deepsql-1.0.gguf` |
|
|
- Select the file and click "Load" |
|
|
|
|
|
4. **Configure settings** (recommended): |
|
|
- **Temperature**: 0.6-0.8 (for more focused SQL generation) |
|
|
- **Top P**: 0.95 |
|
|
- **Max Tokens**: 1024 (sufficient for most SQL queries) |
|
|
|
|
|
5. **Use the Chat interface**: |
|
|
- Format your prompt according to the training template: |
|
|
``` |
|
|
Database Schema: [your schema here] |
|
|
|
|
|
External Knowledge: [optional external knowledge] |
|
|
|
|
|
User: [your question] |
|
|
|
|
|
Assistant: |
|
|
``` |
|
|
|
|
|
6. **Extract the SQL query**: |
|
|
- The model will generate reasoning in `<think>` tags |
|
|
- The SQL query will be in the `<answer>` section within ```sql``` tags |
|
|
- Copy the SQL query from the response |
|
|
|
|
|
### Example Usage in LMStudio |
|
|
|
|
|
**Input:** |
|
|
``` |
|
|
Database Schema: |
|
|
CREATE TABLE "products" ( |
|
|
"product_id" INTEGER PRIMARY KEY, |
|
|
"name" TEXT, |
|
|
"price" REAL, |
|
|
"category" TEXT |
|
|
); |
|
|
|
|
|
User: What are the top 5 most expensive products in the Electronics category? |
|
|
``` |
|
|
|
|
|
**Expected Output:** |
|
|
``` |
|
|
<think> |
|
|
I need to find the top 5 most expensive products in the Electronics category. This requires: |
|
|
1. Filtering by category = 'Electronics' |
|
|
2. Ordering by price in descending order |
|
|
3. Limiting to 5 results |
|
|
</think> |
|
|
<answer> |
|
|
```sql |
|
|
SELECT product_id, name, price, category |
|
|
FROM products |
|
|
WHERE category = 'Electronics' |
|
|
ORDER BY price DESC |
|
|
LIMIT 5; |
|
|
``` |
|
|
</answer> |
|
|
``` |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
- The model may generate SQL queries that are syntactically correct but logically incorrect |
|
|
- Performance may vary depending on the complexity of the database schema |
|
|
- The model is trained primarily on English language questions |
|
|
- Generated SQL queries should always be validated and tested before execution on production databases |
|
|
- The model may struggle with very complex multi-table joins or advanced SQL features not present in the training data |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model's performance should be evaluated on: |
|
|
- SQL accuracy (syntactic correctness) |
|
|
- SQL execution success rate |
|
|
- Semantic correctness of generated queries |
|
|
- Handling of edge cases and complex schemas |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use DeepSQL in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{deepsql2026, |
|
|
title={DeepSQL: A Fine-tuned Language Model for Text-to-SQL Generation}, |
|
|
author={Your Name}, |
|
|
year={2026}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/your-username/deepsql}} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Dataset Citation |
|
|
|
|
|
This model was trained using the SynSQL2.5M dataset. If you use this model, please also cite the SynSQL2.5M dataset: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025omnisql, |
|
|
title={OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale}, |
|
|
author={Li, Haoyang and Wu, Song and Zhang, Xinyuan and Huang, Xinyi and Zhang, Jiaqi and Jiang, Fei and Wang, Siyuan and Zhang, Tianyi and Chen, Jing and Shi, Rui}, |
|
|
journal={Proceedings of the VLDB Endowment}, |
|
|
volume={18}, |
|
|
number={12}, |
|
|
pages={4695--4706}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions, issues, or contributions, please open an issue on the model repository. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the Apache 2.0 license. See the LICENSE file for more details. |