ameet commited on
Commit
1c6e92a
·
verified ·
1 Parent(s): d401d7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +288 -3
README.md CHANGED
@@ -1,3 +1,288 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
4
+ tags:
5
+ - text-to-sql
6
+ - sql-generation
7
+ - natural-language-to-sql
8
+ - deepseek
9
+ - qwen
10
+ - reasoning
11
+ - database
12
+ language:
13
+ - en
14
+ pipeline_tag: text-generation
15
+ datasets:
16
+ - ameet/deepsql_training
17
+ ---
18
+
19
+ # DeepSQL
20
+
21
+ DeepSQL is a fine-tuned language model specialized in converting natural language questions into SQL queries. It is based on [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) and has been trained to understand database schemas and generate accurate SQL queries from natural language questions.
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ - **Model Type**: Causal Language Model (Qwen2-based)
28
+ - **Architecture**: Qwen2ForCausalLM
29
+ - **Base Model**: DeepSeek-R1-Distill-Qwen-1.5B
30
+ - **Parameters**: 1.5B
31
+ - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with FSDP (Fully Sharded Data Parallel)
32
+ - **Context Length**: 131,072 tokens
33
+
34
+ ### Model Architecture
35
+
36
+ - **Hidden Size**: 1,536
37
+ - **Intermediate Size**: 8,960
38
+ - **Number of Layers**: 28
39
+ - **Attention Heads**: 12
40
+ - **Key-Value Heads**: 2
41
+ - **Vocabulary Size**: 151,936
42
+ - **Activation Function**: SiLU
43
+ - **Normalization**: RMSNorm
44
+
45
+ ## Intended Use
46
+
47
+ ### Direct Use
48
+
49
+ DeepSQL is designed to:
50
+ - Convert natural language questions about databases into SQL queries
51
+ - Understand complex database schemas and relationships
52
+ - Generate SQL queries with reasoning about the query structure
53
+ - Handle multi-table joins, aggregations, and complex filtering conditions
54
+
55
+ ### Out-of-Scope Use
56
+
57
+ This model should not be used for:
58
+ - General-purpose text generation
59
+ - Tasks unrelated to SQL generation
60
+ - Generating SQL queries without proper schema validation
61
+ - Production database operations without proper testing and validation
62
+
63
+ ## Training Details
64
+
65
+ ### Training Data
66
+
67
+ The model was fine-tuned on the SynSQL2.5M dataset, a large-scale text-to-SQL dataset comprising over 2.5 million diverse and high-quality samples spanning more than 16,000 databases from various domains. The training data was processed and formatted as the `ameet/deepsql_training` dataset, which contains:
68
+ - Natural language questions about databases
69
+ - Corresponding database schemas (DDL)
70
+ - Chain-of-thought reasoning for SQL generation
71
+ - Target SQL queries
72
+ - External knowledge annotations
73
+
74
+ The SynSQL2.5M dataset provides comprehensive coverage of SQL query patterns, including complex joins, aggregations, subqueries, and various SQL dialects, making it an ideal training resource for text-to-SQL models.
75
+
76
+ ### Training Procedure
77
+
78
+ - **Training Framework**: PyTorch with FSDP
79
+ - **Optimizer**: AdamW
80
+ - **Learning Rate**: 1e-5
81
+ - **LoRA Configuration**:
82
+ - Rank (r): 16
83
+ - LoRA Alpha: 32
84
+ - Target Modules: q_proj, k_proj, v_proj, o_proj
85
+ - LoRA Dropout: 0
86
+ - **Precision**: bfloat16
87
+ - **Training Infrastructure**: Multi-GPU training with FSDP sharding
88
+
89
+ ### Training Input Format
90
+
91
+ The model uses a specific chat template format:
92
+
93
+ ```
94
+ The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query.
95
+ The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags.
96
+
97
+ Database Schema: {schema}
98
+
99
+ External Knowledge: {external_knowledge}
100
+
101
+ User: {question}
102
+
103
+ Assistant: <think>{cot}</think>
104
+ <answer>
105
+ {sql}
106
+ </answer>
107
+ ```
108
+
109
+ ## How to Use
110
+
111
+ ### Using Transformers
112
+
113
+ ```python
114
+ from transformers import AutoTokenizer, AutoModelForCausalLM
115
+ import torch
116
+
117
+ model_name = "your-username/deepsql" # Replace with your Hugging Face model path
118
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
119
+ model = AutoModelForCausalLM.from_pretrained(
120
+ model_name,
121
+ torch_dtype=torch.bfloat16,
122
+ device_map="auto"
123
+ )
124
+
125
+ # Prepare your input
126
+ schema = """
127
+ CREATE TABLE "vehicles" (
128
+ "vehicle_id" INTEGER,
129
+ "make" TEXT,
130
+ "model" TEXT,
131
+ "year" INTEGER,
132
+ "price" REAL,
133
+ PRIMARY KEY ("vehicle_id")
134
+ );
135
+ """
136
+
137
+ question = "What is the average price of vehicles by make?"
138
+ external_knowledge = ""
139
+
140
+ prompt = f"""The user asks a question about a database, and the Assistant helps convert it to SQL. The assistant first thinks about how to write the SQL query by analyzing the question, database schema and external knowledge, then provides the final SQL query.
141
+ The reasoning process and SQL query are enclosed within <think> </think> and <answer> </answer> tags respectively. The answer query must contain the SQL query within ```sql``` tags.
142
+
143
+ Database Schema: {schema}
144
+
145
+ External Knowledge: {external_knowledge}
146
+
147
+ User: {question}
148
+
149
+ Assistant:"""
150
+
151
+ # Tokenize and generate
152
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
153
+ outputs = model.generate(
154
+ **inputs,
155
+ max_new_tokens=1024,
156
+ do_sample=True,
157
+ temperature=0.8,
158
+ pad_token_id=tokenizer.eos_token_id
159
+ )
160
+
161
+ response = tokenizer.decode(outputs[0], skip_special_tokens=False)
162
+ print(response)
163
+ ```
164
+
165
+ ### Using LMStudio with GGUF
166
+
167
+ A GGUF model file (`deepsql-1.0.gguf`) is provided for easy inference with LMStudio:
168
+
169
+ 1. **Download the GGUF file**:
170
+ - Download `deepsql-1.0.gguf` from the model repository
171
+
172
+ 2. **Open LMStudio**:
173
+ - Launch LMStudio on your system
174
+
175
+ 3. **Load the model**:
176
+ - Click on the "Browse" button in the Models section
177
+ - Navigate to the directory containing `deepsql-1.0.gguf`
178
+ - Select the file and click "Load"
179
+
180
+ 4. **Configure settings** (recommended):
181
+ - **Temperature**: 0.6-0.8 (for more focused SQL generation)
182
+ - **Top P**: 0.95
183
+ - **Max Tokens**: 1024 (sufficient for most SQL queries)
184
+
185
+ 5. **Use the Chat interface**:
186
+ - Format your prompt according to the training template:
187
+ ```
188
+ Database Schema: [your schema here]
189
+
190
+ External Knowledge: [optional external knowledge]
191
+
192
+ User: [your question]
193
+
194
+ Assistant:
195
+ ```
196
+
197
+ 6. **Extract the SQL query**:
198
+ - The model will generate reasoning in `<think>` tags
199
+ - The SQL query will be in the `<answer>` section within ```sql``` tags
200
+ - Copy the SQL query from the response
201
+
202
+ ### Example Usage in LMStudio
203
+
204
+ **Input:**
205
+ ```
206
+ Database Schema:
207
+ CREATE TABLE "products" (
208
+ "product_id" INTEGER PRIMARY KEY,
209
+ "name" TEXT,
210
+ "price" REAL,
211
+ "category" TEXT
212
+ );
213
+
214
+ User: What are the top 5 most expensive products in the Electronics category?
215
+ ```
216
+
217
+ **Expected Output:**
218
+ ```
219
+ <think>
220
+ I need to find the top 5 most expensive products in the Electronics category. This requires:
221
+ 1. Filtering by category = 'Electronics'
222
+ 2. Ordering by price in descending order
223
+ 3. Limiting to 5 results
224
+ </think>
225
+ <answer>
226
+ ```sql
227
+ SELECT product_id, name, price, category
228
+ FROM products
229
+ WHERE category = 'Electronics'
230
+ ORDER BY price DESC
231
+ LIMIT 5;
232
+ ```
233
+ </answer>
234
+ ```
235
+
236
+ ## Limitations and Bias
237
+
238
+ - The model may generate SQL queries that are syntactically correct but logically incorrect
239
+ - Performance may vary depending on the complexity of the database schema
240
+ - The model is trained primarily on English language questions
241
+ - Generated SQL queries should always be validated and tested before execution on production databases
242
+ - The model may struggle with very complex multi-table joins or advanced SQL features not present in the training data
243
+
244
+ ## Evaluation
245
+
246
+ The model's performance should be evaluated on:
247
+ - SQL accuracy (syntactic correctness)
248
+ - SQL execution success rate
249
+ - Semantic correctness of generated queries
250
+ - Handling of edge cases and complex schemas
251
+
252
+ ## Citation
253
+
254
+ If you use DeepSQL in your research or applications, please cite:
255
+
256
+ ```bibtex
257
+ @misc{deepsql2026,
258
+ title={DeepSQL: A Fine-tuned Language Model for Text-to-SQL Generation},
259
+ author={Your Name},
260
+ year={2026},
261
+ publisher={Hugging Face},
262
+ howpublished={\url{https://huggingface.co/your-username/deepsql}}
263
+ }
264
+ ```
265
+
266
+ ### Dataset Citation
267
+
268
+ This model was trained using the SynSQL2.5M dataset. If you use this model, please also cite the SynSQL2.5M dataset:
269
+
270
+ ```bibtex
271
+ @article{li2025omnisql,
272
+ title={OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale},
273
+ author={Li, Haoyang and Wu, Song and Zhang, Xinyuan and Huang, Xinyi and Zhang, Jiaqi and Jiang, Fei and Wang, Siyuan and Zhang, Tianyi and Chen, Jing and Shi, Rui},
274
+ journal={Proceedings of the VLDB Endowment},
275
+ volume={18},
276
+ number={12},
277
+ pages={4695--4706},
278
+ year={2025}
279
+ }
280
+ ```
281
+
282
+ ## Model Card Contact
283
+
284
+ For questions, issues, or contributions, please open an issue on the model repository.
285
+
286
+ ## License
287
+
288
+ This model is licensed under the Apache 2.0 license. See the LICENSE file for more details.