File size: 10,337 Bytes
6b2ce50
 
1a87275
 
 
 
 
 
 
 
 
 
 
 
 
6b2ce50
 
1a87275
6b2ce50
1a87275
6b2ce50
 
 
 
157c90c
6b2ce50
4798cb5
157c90c
1a87275
 
45714f3
6b2ce50
45714f3
6b2ce50
 
 
1a87275
6b2ce50
 
 
 
 
1a87275
6b2ce50
45714f3
6b2ce50
1a87275
6b2ce50
 
 
1a87275
6b2ce50
1a87275
6b2ce50
1a87275
6b2ce50
 
 
1a87275
6b2ce50
 
 
 
1a87275
 
e0d7335
1a87275
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49bdda7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b2ce50
 
 
 
 
 
 
1a87275
6b2ce50
 
 
 
 
45714f3
6b2ce50
1a87275
6b2ce50
 
 
1a87275
6b2ce50
1a87275
6b2ce50
143a6c8
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
library_name: transformers
tags:
- Phi-3-mini-4k-instruct
- NLP
- Chatbot
- Instruction Tuning
- SQL
- SQL-Generation
license: mit
datasets:
- gretelai/synthetic_text_to_sql
language:
- en
pipeline_tag: text-generation
---

# Model Card for sql-xp-phi-3-mini

<!-- Phi-3 Mini is a transformer-based language model optimized for understanding and generating responses based on instructional input. -->


### Model Description

Phi-3 Mini is a transformer-based language model fine-tuned to generate SQL queries from natural language inputs. This model is optimized for converting human-readable queries into SQL statements, facilitating easier interaction with databases through natural language processing.

- **Developed by:** [spectrewolf8](https://github.com/Spectrewolf8)
- **Model type:** Transformer-based Language Model for SQL Generation
- **Language(s) (NLP):** English (and SQL)
- **License:** MIT
- **Finetuned from model :** Phi-3-mini-4k-instruct base model

### Model Sources 

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/Spectrewolf8/kaggle-sql-xp-phi-3-mini-4k-instruct

## Uses

### Direct Use

Phi-3 Mini can be used to translate natural language instructions into SQL queries, making it a powerful tool for database querying and management. Users can input descriptive text, and the model will generate the corresponding SQL commands.

### Downstream Use 

This model can be integrated into applications such as chatbots or virtual assistants that interact with databases. It can also be used in tools designed for automatic query generation based on user-friendly descriptions.

### Out-of-Scope Use

Phi-3 Mini is not suitable for tasks requiring non-SQL-related language understanding or generation. It should not be used for generating queries in languages other than SQL or for other domains outside database querying.

### Bias, Risks, and Limitations

Phi-3 Mini, like other language models, may have limitations in understanding complex or ambiguous instructions. The SQL queries generated might need manual review to ensure accuracy and appropriateness.

### Recommendations

Users should verify the generated SQL queries for correctness and security, especially when using them in production environments. Implementing additional layers of validation and testing can help mitigate risks associated with incorrect SQL generation.


## How to Get Started with the Model

To get started with Phi-3 Mini for SQL generation, follow the code snippet below:

```python
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

# Set the seed for the random number generator to ensure reproducibility
set_seed(1234)

# Define the repository name for the Hugging Face model
# 'hf_model_repo' is a variable that holds the repository name for the Hugging Face model
# 'username/modelname' is the repository name, where 'username' is the username of the repository owner
# and 'modelname' is the name of the model
hf_model_repo = "spectrewolf8/sql-xp-phi-3-mini-4k"

# Retrieve the device mapping and computation data type
# 'device_map' is a variable that holds the mapping of the devices that are used for computation
# 'compute_dtype' is a variable that holds the data type that is used for computation

# device_map = {"": 0}
# compute_dtype = torch.bfloat16 or torch.float16
device_map, compute_dtype

# Load a pre-trained tokenizer from the Hugging Face Model Hub
# 'tokenizer' is the variable that holds the tokenizer
# 'trust_remote_code=True' allows the execution of code from the model file
tokenizer = AutoTokenizer.from_pretrained(hf_model_repo, trust_remote_code=True)

# Load a pre-trained model for causal language modeling from the Hugging Face Model Hub
# 'model' is the variable that holds the model
# 'trust_remote_code=True' allows the execution of code from the model file
# 'torch_dtype=compute_dtype' sets the data type for the PyTorch tensors
# 'device_map=device_map' sets the device mapping
model = AutoModelForCausalLM.from_pretrained(hf_model_repo, trust_remote_code=True, torch_dtype=compute_dtype, device_map=device_map)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define the input phrase which represents the user's request or query.
input_phrase = """
insert 5 values
"""

# Define the context phrase which provides the SQL table schema relevant to the input phrase.
context_phrase = """
CREATE TABLE tasks (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    task_name VARCHAR(100) NOT NULL,
    userid INT NOT NULL,
    date DATE NOT NULL,
    FOREIGN KEY (userid) REFERENCES users(id)
);
"""

# Create a prompt by applying a chat template to the input and context phrases using the tokenizer.
# The 'apply_chat_template' method formats the input as a chat message, making it suitable for text generation.
# 'tokenize=False' indicates that the input should not be tokenized yet.
# 'add_generation_prompt=True' adds a prompt for text generation.
prompt = pipe.tokenizer.apply_chat_template(
    [{"role": "user", "content": f"\n #prompt: {input_phrase}\n #context: {context_phrase}"}],
    tokenize=False,
    add_generation_prompt=True
)

# Create a text generation pipeline using the specified model and tokenizer.
# The 'pipeline' function sets up a ready-to-use text generation pipeline, combining the model and tokenizer.
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text using the pipeline with the specified parameters.
# 'max_new_tokens=256' sets the maximum number of new tokens to generate.
# 'do_sample=True' enables sampling for text generation.
# 'num_beams=1' specifies the number of beams for beam search (1 means no beam search).
# 'temperature=0.3' controls the randomness of predictions by scaling the logits before applying softmax.
# 'top_k=50' considers only the top 50 token predictions for sampling.
# 'top_p=0.95' enables nucleus sampling, considering tokens that have a cumulative probability of 0.95.
# 'max_time=180' sets the maximum generation time to 180 seconds.
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    num_beams=1,
    temperature=0.3,
    top_k=50,
    top_p=0.95,
    max_time=180
)

# Print the generated text by stripping out the prompt portion and displaying only the new generated content.
print(outputs[0]['generated_text'][len(prompt):].strip())
```
## Fine Tuning Details

The fine-tuning process in this project involved adapting the pre-trained language model microsoft/Phi-3-mini-4k-instruct for generating SQL commands from natural language prompts. The methodology employed included the following key steps:

### Data Preparation

A synthetic dataset, "gretelai/synthetic_text_to_sql," was utilized, containing examples of natural language instructions paired with SQL queries. The dataset was processed to extract essential fields, specifically the instruction("sql_prompt"), input("sql_context"), and output("sql"). Each data point was structured to simulate a conversation where the user's message encompassed the prompt and context, and the assistant's message contained the corresponding SQL output.

### Quantization and Model Preparation

The project implemented 4-bit quantization through the BitsAndBytes library. This technique reduced the model's memory requirements while retaining performance accuracy. Additionally, QLoRA (Quantized Low-Rank Adaptation) was used to fine-tune the model. This involved introducing low-rank matrices into selected layers, such as attention and projection layers, to optimize the model's parameters without requiring full retraining.

### Model and Tokenizer Setup

The tokenizer was customized to accommodate special tokens and proper padding management, particularly adjusting for left-side padding. These settings ensured accurate tokenization for the structured input required by the model.

### Training Configuration

Fine-tuning was executed using the SFTTrainer from the Hugging Face Transformers library. The configuration included settings for a small batch size, gradient accumulation, and a learning rate tuned for the specific SQL generation task. The training setup incorporated various optimizations, including the use of mixed-precision training where beneficial.

### Training Execution

The model underwent multiple epochs of training on the processed dataset. The process focused on optimizing the model's capability to understand and generate SQL queries based on diverse natural language instructions. Weights & Biases (wandb) was employed for detailed logging and monitoring of training metrics, allowing for robust tracking of the model's performance improvements.

### Model Saving and Deployment

After fine-tuning, the updated model and tokenizer were saved locally and then uploaded to the Hugging Face Hub. This deployment step made the refined model accessible for future use, ensuring it could efficiently generate SQL commands in response to new prompts. The model's final configuration enabled effective inference, leveraging the improvements gained from the fine-tuning process.

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- **Data set used was:** https://huggingface.co/datasets/gretelai/synthetic_text_to_sql

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing 

Ignore columns other than "sql_prompt", "sql_context", "sql" from the dataset.

#### Training Hyperparameters

- **Training regime:** Mixed precision (fp16) for efficiency. <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

### Training aftermath

The model was trained on the RTX 3060 OC 12 GB variant. It took 5 hours to train the model with 10,000 values for training and 3,300 values for testing with 2 Epochs.

### Demo

![image/png](https://cdn-uploads.huggingface.co/production/uploads/668d0a0916006f60d0451bd2/XT_T2yRs-ephCU2zA5KXx.png)