mirajnair commited on
Commit
7ee13ee
·
verified ·
1 Parent(s): b355edc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -4
README.md CHANGED
@@ -1,4 +1,124 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # TinyLlama for Text-to-SQL
6
+
7
+ ## Problem Statement
8
+
9
+ I need a small generative model that can generate SQL code in response to user queries while avoiding any additional commentary. This will help reduce operational costs, increase throughput, and lower latency.
10
+
11
+ ## Solution
12
+
13
+ ### Part 1: Initial Experimentation (Refer to `Run_Tinyllama_Chat.ipynb`)
14
+
15
+ #### Step 1: Using an Off-the-Shelf Model
16
+
17
+ I started with the TinyLlama model. Below is an example of the initial request and response:
18
+
19
+ ```
20
+ <|system|>
21
+ CREATE TABLE head(age INTEGER)</s>
22
+ <|user|>
23
+ How many heads of the departments are older than 56?</s>
24
+ <|assistant|>
25
+ I don't have access to the latest data or the current headcount of the departments...
26
+ ```
27
+
28
+ The model did not return the expected SQL query, which is understandable given the lack of context.
29
+
30
+ #### Step 2: Prompt Engineering
31
+
32
+ I attempted prompt engineering by adding more details to the context:
33
+
34
+ ```
35
+ <|system|>
36
+ You can only reply in SQL query language. Provide only SQL for the user's query given this context --> CREATE TABLE head(age INTEGER)</s>
37
+ <|user|>
38
+ How many heads of the departments are older than 56?</s>
39
+ <|assistant|>
40
+ SELECT COUNT(*) FROM head WHERE age > 56
41
+ ```
42
+
43
+ The model generated the SQL query but included additional commentary, which I wanted to avoid.
44
+
45
+ #### Step 3: Further Refinement
46
+
47
+ Despite additional prompt engineering efforts, the model still produced unwanted explanations:
48
+
49
+ ```
50
+ <|assistant|>
51
+ To calculate the number of heads of the departments older than 56, you can use the following SQL query:
52
+
53
+ SELECT COUNT(*) FROM departments WHERE age > 56;
54
+
55
+ In the above query, "departments" is the name of the table and "age" is the column name...
56
+ ```
57
+
58
+ This led me to consider fine-tuning the model.
59
+
60
+ ---
61
+
62
+ ### Part 2: Fine-Tuning the Model
63
+
64
+ I decided to fine-tune TinyLlama for better SQL-specific responses. Below are the steps to replicate the fine-tuning process.
65
+
66
+ #### Setup Environment and Run Fine-Tuning Job on RunPod.io
67
+
68
+ ```bash
69
+ #!/bin/bash
70
+ pip install -q accelerate transformers peft deepspeed bitsandbytes --no-build-isolation
71
+ pip install trl==0.9.6
72
+ pip install packaging ninja
73
+ MAX_JOBS=16 pip install flash-attn==2.6.0.post1 --no-build-isolation
74
+ git clone https://github.com/Rajesh-Nair/llm-text2sql-finetuning
75
+ cd llm-text2sql-finetuning
76
+ accelerate launch --config_file "ds_z3_qlora_config.yaml" train.py run_config.yaml | tee accelerate_output.log
77
+ ```
78
+
79
+ #### Key Components of Fine-Tuning
80
+
81
+ 1. **Dataset**: Utilized `b-mc2/sql-create-context` from Hugging Face for fine-tuning. High-quality data is essential for improving model performance.
82
+ 2. **Accelerate**: Leveraged `accelerate` to enhance training speed and minimize boilerplate code.
83
+ 3. **Distributed Training**:
84
+ - Deployed across two GPUs on a single node via RunPod.io.
85
+ - Hardware specifications: L4 GPU, PyTorch 2.1, Python 3.10, CUDA 11.8 (Ubuntu image).
86
+ 4. **QLoRA**:
87
+ - Applied QLoRA for memory-efficient fine-tuning.
88
+ - Configured LoRA with 8-rank matrices for all linear layers.
89
+ 5. **DeepSpeed Zero3**: Implemented for optimized sharding of optimizers, gradients, and parameters.
90
+ 6. **Mixed Precision**: Utilized to accelerate training and improve GPU efficiency.
91
+ 7. **Batch Size & Gradient Accumulation**:
92
+ - Set batch size per device to 4.
93
+ - Applied gradient accumulation every 2 steps for optimal performance.
94
+ - Increasing batch size beyond this sometimes led to GPU communication bottlenecks.
95
+ 8. **Gradient Clipping**: Enabled to prevent unexpected exploding gradients.
96
+ 9. **Training Duration & Cost**:
97
+ - Each epoch took approximately 1 hour.
98
+ - Training was force-stopped after 3 epochs due to negligible improvements in training loss.
99
+ - Total fine-tuning cost on RunPod: under \$3.
100
+ 10. **Training Logs**: Captured logs in `accelerate_outlog.log` for future analysis and reference.
101
+
102
+ #### Serving the Fine-Tuned Model
103
+
104
+ Refer to `Run_ft_Tinyllama_Chat.ipynb` for deploying the fine-tuned model.
105
+
106
+ Example Query and Response:
107
+
108
+ ```
109
+ <|system|>
110
+ CREATE TABLE head(age INTEGER)</s>
111
+ <|user|>
112
+ How many heads of the departments are older than 56?</s>
113
+ <|assistant|>
114
+ SELECT COUNT(*) FROM head WHERE age > 56
115
+ ```
116
+
117
+ The fine-tuned model now returns only the SQL query, as intended.
118
+
119
+ ---
120
+
121
+ ### Final Model & Deployment
122
+
123
+ After fine-tuning, I merged the trained adapter with the base model and uploaded it to Hugging Face: Here is the full code : 🔗 [**TinyLlama-1.1B-Chat-Text2SQL-v1.0**](https://github.com/Rajesh-Nair/llm-text2sql-finetuning)
124
+