DeepSQL-1.0 / README.md

Update README.md

202a294 verified 21 days ago

8.53 kB

	---
	license: apache-2.0
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	tags:
	- text-to-sql
	- sql-generation
	- natural-language-to-sql
	- deepseek
	- qwen
	- reasoning
	- database
	language:
	- en
	pipeline_tag: text-generation
	datasets:
	- ameet/deepsql_training
	---

	# DeepSQL

	DeepSQL is a fine-tuned language model specialized in converting natural language questions into SQL queries. It is based on [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) and has been trained to understand database schemas and generate accurate SQL queries from natural language questions.

	## Model Details

	### Model Description

	- Model Type: Causal Language Model (Qwen2-based)
	- Architecture: Qwen2ForCausalLM
	- Base Model: DeepSeek-R1-Distill-Qwen-1.5B
	- Parameters: 1.5B
	- Fine-tuning Method: LoRA (Low-Rank Adaptation) with FSDP (Fully Sharded Data Parallel)
	- Context Length: 131,072 tokens

	### Model Architecture

	- Hidden Size: 1,536
	- Intermediate Size: 8,960
	- Number of Layers: 28
	- Attention Heads: 12
	- Key-Value Heads: 2
	- Vocabulary Size: 151,936
	- Activation Function: SiLU
	- Normalization: RMSNorm

	## Intended Use

	### Direct Use

	DeepSQL is designed to:
	- Convert natural language questions about databases into SQL queries
	- Understand complex database schemas and relationships
	- Generate SQL queries with reasoning about the query structure
	- Handle multi-table joins, aggregations, and complex filtering conditions

	### Out-of-Scope Use

	This model should not be used for:
	- General-purpose text generation
	- Tasks unrelated to SQL generation
	- Generating SQL queries without proper schema validation
	- Production database operations without proper testing and validation

	## Training Details

	### Training Data

	The model was fine-tuned on the SynSQL2.5M dataset, a large-scale text-to-SQL dataset comprising over 2.5 million diverse and high-quality samples spanning more than 16,000 databases from various domains. The training data was processed and formatted as the `ameet/deepsql_training` dataset, which contains:
	- Natural language questions about databases
	- Corresponding database schemas (DDL)
	- Chain-of-thought reasoning for SQL generation
	- Target SQL queries
	- External knowledge annotations

	The SynSQL2.5M dataset provides comprehensive coverage of SQL query patterns, including complex joins, aggregations, subqueries, and various SQL dialects, making it an ideal training resource for text-to-SQL models.

	### Training Procedure

	- Training Framework: PyTorch with FSDP
	- Optimizer: AdamW
	- Learning Rate: 1e-5
	- LoRA Configuration:
	- Rank (r): 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, k_proj, v_proj, o_proj
	- LoRA Dropout: 0
	- Precision: bfloat16
	- Training Infrastructure: Multi-GPU training with FSDP sharding

	### Training Input Format

	The model uses a specific chat template format:

	```
	The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query.
	The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags.

	Database Schema: {schema}

	External Knowledge: {external_knowledge}

	User: {question}

	Assistant: <think>{cot}</think>
	<answer>
	{sql}
	</answer>
	```

	## How to Use

	### Using Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "DeepSQL/DeepSQL-1.0"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Prepare your input
	schema = """
	CREATE TABLE "vehicles" (
	"vehicle_id" INTEGER,
	"make" TEXT,
	"model" TEXT,
	"year" INTEGER,
	"price" REAL,
	PRIMARY KEY ("vehicle_id")
	);
	"""

	question = "What is the average price of vehicles by make?"
	external_knowledge = ""

	prompt = f"""The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query.
	The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags.

	Database Schema: {schema}

	External Knowledge: {external_knowledge}

	User: {question}

	Assistant:"""

	# Tokenize and generate
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=1024,
	do_sample=True,
	temperature=0.8,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=False)
	print(response)
	```

	### Using LMStudio with GGUF

	A GGUF model file (`deepsql-1.0.gguf`) is provided for easy inference with LMStudio:

	1. Download the GGUF file:
	- Download `deepsql-1.0.gguf` from the model repository

	2. Open LMStudio:
	- Launch LMStudio on your system

	3. Load the model:
	- Click on the "Browse" button in the Models section
	- Navigate to the directory containing `deepsql-1.0.gguf`
	- Select the file and click "Load"

	4. Configure settings (recommended):
	- Temperature: 0.6-0.8 (for more focused SQL generation)
	- Top P: 0.95
	- Max Tokens: 1024 (sufficient for most SQL queries)

	5. Use the Chat interface:
	- Format your prompt according to the training template:
	```
	Database Schema: [your schema here]

	External Knowledge: [optional external knowledge]

	User: [your question]

	Assistant:
	```

	6. Extract the SQL query:
	- The model will generate reasoning in `<think>` tags
	- The SQL query will be in the `<answer>` section within ```sql``` tags
	- Copy the SQL query from the response

	### Example Usage in LMStudio

	Input:
	```
	Database Schema:
	CREATE TABLE "products" (
	"product_id" INTEGER PRIMARY KEY,
	"name" TEXT,
	"price" REAL,
	"category" TEXT
	);

	User: What are the top 5 most expensive products in the Electronics category?
	```

	Expected Output:
	```
	<think>
	I need to find the top 5 most expensive products in the Electronics category. This requires:
	1. Filtering by category = 'Electronics'
	2. Ordering by price in descending order
	3. Limiting to 5 results
	</think>
	<answer>
	```sql
	SELECT product_id, name, price, category
	FROM products
	WHERE category = 'Electronics'
	ORDER BY price DESC
	LIMIT 5;
	```
	</answer>
	```

	## Limitations and Bias

	- The model may generate SQL queries that are syntactically correct but logically incorrect
	- Performance may vary depending on the complexity of the database schema
	- The model is trained primarily on English language questions
	- Generated SQL queries should always be validated and tested before execution on production databases
	- The model may struggle with very complex multi-table joins or advanced SQL features not present in the training data

	## Evaluation

	The model's performance should be evaluated on:
	- SQL accuracy (syntactic correctness)
	- SQL execution success rate
	- Semantic correctness of generated queries
	- Handling of edge cases and complex schemas

	## Citation

	If you use DeepSQL in your research or applications, please cite:

	```bibtex
	@misc{deepsql2026,
	title={DeepSQL: A Fine-tuned Language Model for Text-to-SQL Generation},
	author={Your Name},
	year={2026},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/your-username/deepsql}}
	}
	```

	### Dataset Citation

	This model was trained using the SynSQL2.5M dataset. If you use this model, please also cite the SynSQL2.5M dataset:

	```bibtex
	@article{li2025omnisql,
	title={OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale},
	author={Li, Haoyang and Wu, Song and Zhang, Xinyuan and Huang, Xinyi and Zhang, Jiaqi and Jiang, Fei and Wang, Siyuan and Zhang, Tianyi and Chen, Jing and Shi, Rui},
	journal={Proceedings of the VLDB Endowment},
	volume={18},
	number={12},
	pages={4695--4706},
	year={2025}
	}
	```

	## Model Card Contact

	For questions, issues, or contributions, please open an issue on the model repository.

	## License

	This model is licensed under the Apache 2.0 license. See the LICENSE file for more details.