junaid0600 commited on
Commit
c0392b1
Β·
verified Β·
1 Parent(s): b227ef2

Update blog_post.md

Browse files
Files changed (1) hide show
  1. blog_post.md +199 -199
blog_post.md CHANGED
@@ -1,199 +1,199 @@
1
- # Training a SQL Database Engineer Agent with GRPO on Qwen2.5
2
-
3
- *Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning*
4
-
5
- ---
6
-
7
- ## Overview
8
-
9
- Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β€” yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
10
-
11
- In this project, we trained **Qwen2.5-7B-Instruct** to act as a senior database engineer β€” inspecting slow queries, identifying missing indexes, and applying targeted fixes β€” using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
12
-
13
- After **200 training steps**, the agent achieved a **+94% reward improvement** (0.235 β†’ 0.456) and outperformed a random baseline by an average of **+31.4 database performance points** across 15 scenarios.
14
-
15
- ---
16
-
17
- ## The Problem
18
-
19
- Given a database with slow-running SQL queries, the agent must:
20
- 1. **Investigate** β€” understand why queries are slow
21
- 2. **Diagnose** β€” identify missing indexes or inefficient query patterns
22
- 3. **Fix** β€” apply the correct indexes and optimizations
23
- 4. **Verify** β€” confirm the performance score improved
24
-
25
- A random agent that creates indexes on arbitrary columns scores **0 pts** on every scenario. Our trained agent had to learn β€” purely from feedback β€” which tables and columns actually matter.
26
-
27
- ---
28
-
29
- ## Architecture
30
-
31
- ### Environment β€” DatabaseSimulator
32
- We built a custom `DatabaseSimulator` that:
33
- - Loads SQL scenarios (tables, slow queries, missing index hints)
34
- - Tracks a **performance score (0–100)** that improves when correct indexes are applied
35
- - Returns delta rewards based on how much the score improved
36
- - Runs **locally** β€” no HTTP calls, no shared state, fully deterministic
37
-
38
- ### Scenarios
39
- We created **15 scenarios** across 3 difficulty levels:
40
-
41
- | Level | Count | Description |
42
- |-------|-------|-------------|
43
- | Easy | 5 | Single table, one missing index |
44
- | Medium | 5 | E-commerce DB, composite indexes |
45
- | Hard | 5 | 4-table financial schema, complex joins |
46
-
47
- ### Action Space
48
- The agent can take 10 actions:
49
-
50
- ```json
51
- {"action_type": "inspect_query", "payload": {"query_id": "q1"}}
52
- {"action_type": "analyze_indexes", "payload": {}}
53
- {"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
54
- {"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
55
- {"action_type": "analyze_statistics","payload": {"table": "orders"}}
56
- {"action_type": "submit_report", "payload": {"summary": "..."}}
57
- ```
58
-
59
- ---
60
-
61
- ## Training Setup
62
-
63
- ### Model
64
- - **Base model:** `unsloth/Qwen2.5-7B-Instruct` (7.66B parameters)
65
- - **Trainable parameters:** 40,370,176 of 7,655,986,688 **(only 0.53% via LoRA)**
66
- - **Fine-tuning:** LoRA (r=16, alpha=16) via Unsloth β€” 2x faster free finetuning
67
- - **Training algorithm:** GRPO (Group Relative Policy Optimization)
68
- - **Framework:** TRL + Unsloth + PyTorch
69
- - **GPU:** Single GPU (1x)
70
-
71
- ### Training Data
72
- - **Examples:** 15 scenarios
73
- - **Epochs:** 29 (cycling through all 15 scenarios)
74
- - **Total steps:** 200
75
- - **Effective batch size:** 8 (batch size 4 Γ— gradient accumulation 2 Γ— 1 GPU)
76
-
77
- ### GRPO Reward Function
78
- The reward function combines three signals:
79
-
80
- ```python
81
- total_reward = step_reward + delta_reward + milestone_bonus
82
- ```
83
-
84
- | Component | Description | Range |
85
- |-----------|-------------|-------|
86
- | `step_reward` | Base reward per valid action type | 0.05–0.20 |
87
- | `delta_reward` | Proportional to DB performance improvement | 0.0–0.65 |
88
- | `milestone_bonus` | Bonus at 25%, 50%, 75% improvement thresholds | 0.15–0.40 |
89
- | `wrong_index_penalty` | Penalty for indexing useless columns | -0.05 |
90
-
91
- **Expected rewards per action:**
92
- ```
93
- inspect_query / analyze_indexes β†’ ~0.10
94
- create_index (no table/col match) β†’ ~0.10
95
- create_index (partial hint match) β†’ ~0.20–0.45
96
- create_index (perfect hint match) β†’ ~0.55–0.80
97
- create_index (simulator confirms) β†’ ~0.75–0.99
98
- Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
99
- ```
100
-
101
- **Key design decision:** We used a **hint-match fallback** to give GRPO a gradient signal early in training β€” before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
102
-
103
- ### Training Config
104
- ```python
105
- GRPOConfig(
106
- max_steps = 200,
107
- per_device_train_batch_size = 4,
108
- gradient_accumulation_steps = 2,
109
- learning_rate = 2e-5,
110
- max_completion_length = 150,
111
- num_generations = 4,
112
- temperature = 1.0,
113
- warmup_steps = 10,
114
- )
115
- ```
116
-
117
- ---
118
-
119
- ## Results
120
-
121
- ### Training Curves
122
-
123
- After 200 steps of GRPO training:
124
-
125
- - **Loss:** `4.92e-07 β†’ 1.23e-05`
126
- *(GRPO policy loss rises as the model becomes more confident in its policy β€” this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)*
127
- - **Reward:** `0.235 β†’ 0.456 (+94% improvement)`
128
- The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
129
-
130
- ### Evaluation β€” Trained vs Random Agent
131
-
132
- We evaluated both agents across all 15 scenarios:
133
-
134
- | Agent | Avg Improvement | Best Scenario | Worst Scenario |
135
- |-------|----------------|---------------|----------------|
136
- | Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
137
- | Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
138
-
139
- The trained agent outperformed the random baseline on **every single scenario**, with an average improvement of **+31.4 database performance points**. Scenario 8 was flagged as a statistical outlier (Β±1.5Οƒ above mean) β€” the agent found an especially impactful index combination. The relative gain is **∞** since the untrained baseline scored exactly 0 on all scenarios.
140
-
141
- ---
142
-
143
- ## Key Learnings
144
-
145
- ### 1. Reward shaping is everything in GRPO
146
- The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
147
-
148
- ### 2. LoRA makes 7B models trainable on a single GPU
149
- With only **0.53% of parameters trainable** via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
150
-
151
- ### 3. Local simulation beats API calls for training
152
- Using `DatabaseSimulator` directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
153
-
154
- ### 4. GRPO loss behaviour differs from supervised loss
155
- Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β€” what matters is whether the reward is trending upward.
156
-
157
- ### 5. Composite indexes are hard to learn
158
- The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
159
-
160
- ---
161
-
162
- ## Live Demo
163
-
164
- Try the agent yourself β€” pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
165
-
166
- **[SQL Database Engineer Agent β€” Live Demo](https://huggingface.co/spaces/YOUR_USERNAME/sql-db-engineer-demo)**
167
-
168
- ---
169
-
170
- ## Resources
171
-
172
- | Resource | Link |
173
- |----------|------|
174
- | Demo Space |https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
175
- | |
176
- | Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
177
- | | HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
178
- | |
179
- |Training Run Notebook URL| https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb
180
-
181
- ---
182
-
183
- ## What's Next
184
-
185
- - **More steps:** 200 steps showed strong learning β€” 500+ steps would likely push the average score above 50 pts
186
- - **Harder scenarios:** 8-table schemas with nested subqueries and CTEs
187
- - **Query rewriting:** The agent currently focuses on indexing β€” teaching it to rewrite SQL itself is the next frontier
188
- - **Multi-step episodes:** Chain multiple actions per episode so the agent can inspect β†’ diagnose β†’ fix β†’ verify in sequence
189
-
190
- ---
191
-
192
- ## Acknowledgements
193
-
194
- Built for the **META Γ— PyTorch Γ— SST Hackathon** using:
195
- - [Unsloth](https://github.com/unslothai/unsloth) β€” 2x faster LoRA fine-tuning
196
- - [TRL](https://github.com/huggingface/trl) β€” GRPO implementation
197
- - [Hugging Face](https://huggingface.co) β€” model hosting and Spaces
198
- - [Qwen2.5](https://huggingface.co/Qwen) β€” base language model
199
-
 
1
+ # Training a SQL Database Engineer Agent with GRPO on Qwen2.5
2
+
3
+ *Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning*
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β€” yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
10
+
11
+ In this project, we trained **Qwen2.5-7B-Instruct** to act as a senior database engineer β€” inspecting slow queries, identifying missing indexes, and applying targeted fixes β€” using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
12
+
13
+ After **200 training steps**, the agent achieved a **+94% reward improvement** (0.235 β†’ 0.456) and outperformed a random baseline by an average of **+31.4 database performance points** across 15 scenarios.
14
+
15
+ ---
16
+
17
+ ## The Problem
18
+
19
+ Given a database with slow-running SQL queries, the agent must:
20
+ 1. **Investigate** β€” understand why queries are slow
21
+ 2. **Diagnose** β€” identify missing indexes or inefficient query patterns
22
+ 3. **Fix** β€” apply the correct indexes and optimizations
23
+ 4. **Verify** β€” confirm the performance score improved
24
+
25
+ A random agent that creates indexes on arbitrary columns scores **0 pts** on every scenario. Our trained agent had to learn β€” purely from feedback β€” which tables and columns actually matter.
26
+
27
+ ---
28
+
29
+ ## Architecture
30
+
31
+ ### Environment β€” DatabaseSimulator
32
+ We built a custom `DatabaseSimulator` that:
33
+ - Loads SQL scenarios (tables, slow queries, missing index hints)
34
+ - Tracks a **performance score (0–100)** that improves when correct indexes are applied
35
+ - Returns delta rewards based on how much the score improved
36
+ - Runs **locally** β€” no HTTP calls, no shared state, fully deterministic
37
+
38
+ ### Scenarios
39
+ We created **15 scenarios** across 3 difficulty levels:
40
+
41
+ | Level | Count | Description |
42
+ |-------|-------|-------------|
43
+ | Easy | 5 | Single table, one missing index |
44
+ | Medium | 5 | E-commerce DB, composite indexes |
45
+ | Hard | 5 | 4-table financial schema, complex joins |
46
+
47
+ ### Action Space
48
+ The agent can take 10 actions:
49
+
50
+ ```json
51
+ {"action_type": "inspect_query", "payload": {"query_id": "q1"}}
52
+ {"action_type": "analyze_indexes", "payload": {}}
53
+ {"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
54
+ {"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
55
+ {"action_type": "analyze_statistics","payload": {"table": "orders"}}
56
+ {"action_type": "submit_report", "payload": {"summary": "..."}}
57
+ ```
58
+
59
+ ---
60
+
61
+ ## Training Setup
62
+
63
+ ### Model
64
+ - **Base model:** `unsloth/Qwen2.5-7B-Instruct` (7.66B parameters)
65
+ - **Trainable parameters:** 40,370,176 of 7,655,986,688 **(only 0.53% via LoRA)**
66
+ - **Fine-tuning:** LoRA (r=16, alpha=16) via Unsloth β€” 2x faster free finetuning
67
+ - **Training algorithm:** GRPO (Group Relative Policy Optimization)
68
+ - **Framework:** TRL + Unsloth + PyTorch
69
+ - **GPU:** Single GPU (1x)
70
+
71
+ ### Training Data
72
+ - **Examples:** 15 scenarios
73
+ - **Epochs:** 29 (cycling through all 15 scenarios)
74
+ - **Total steps:** 200
75
+ - **Effective batch size:** 8 (batch size 4 Γ— gradient accumulation 2 Γ— 1 GPU)
76
+
77
+ ### GRPO Reward Function
78
+ The reward function combines three signals:
79
+
80
+ ```python
81
+ total_reward = step_reward + delta_reward + milestone_bonus
82
+ ```
83
+
84
+ | Component | Description | Range |
85
+ |-----------|-------------|-------|
86
+ | `step_reward` | Base reward per valid action type | 0.05–0.20 |
87
+ | `delta_reward` | Proportional to DB performance improvement | 0.0–0.65 |
88
+ | `milestone_bonus` | Bonus at 25%, 50%, 75% improvement thresholds | 0.15–0.40 |
89
+ | `wrong_index_penalty` | Penalty for indexing useless columns | -0.05 |
90
+
91
+ **Expected rewards per action:**
92
+ ```
93
+ inspect_query / analyze_indexes β†’ ~0.10
94
+ create_index (no table/col match) β†’ ~0.10
95
+ create_index (partial hint match) β†’ ~0.20–0.45
96
+ create_index (perfect hint match) β†’ ~0.55–0.80
97
+ create_index (simulator confirms) β†’ ~0.75–0.99
98
+ Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
99
+ ```
100
+
101
+ **Key design decision:** We used a **hint-match fallback** to give GRPO a gradient signal early in training β€” before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
102
+
103
+ ### Training Config
104
+ ```python
105
+ GRPOConfig(
106
+ max_steps = 200,
107
+ per_device_train_batch_size = 4,
108
+ gradient_accumulation_steps = 2,
109
+ learning_rate = 2e-5,
110
+ max_completion_length = 150,
111
+ num_generations = 4,
112
+ temperature = 1.0,
113
+ warmup_steps = 10,
114
+ )
115
+ ```
116
+
117
+ ---
118
+
119
+ ## Results
120
+
121
+ ### Training Curves
122
+
123
+ After 200 steps of GRPO training:
124
+
125
+ - **Loss:** `4.92e-07 β†’ 1.23e-05`
126
+ *(GRPO policy loss rises as the model becomes more confident in its policy β€” this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)*
127
+ - **Reward:** `0.235 β†’ 0.456 (+94% improvement)`
128
+ The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
129
+
130
+ ### Evaluation β€” Trained vs Random Agent
131
+
132
+ We evaluated both agents across all 15 scenarios:
133
+
134
+ | Agent | Avg Improvement | Best Scenario | Worst Scenario |
135
+ |-------|----------------|---------------|----------------|
136
+ | Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
137
+ | Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
138
+
139
+ The trained agent outperformed the random baseline on **every single scenario**, with an average improvement of **+31.4 database performance points**. Scenario 8 was flagged as a statistical outlier (Β±1.5Οƒ above mean) β€” the agent found an especially impactful index combination. The relative gain is **∞** since the untrained baseline scored exactly 0 on all scenarios.
140
+
141
+ ---
142
+
143
+ ## Key Learnings
144
+
145
+ ### 1. Reward shaping is everything in GRPO
146
+ The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
147
+
148
+ ### 2. LoRA makes 7B models trainable on a single GPU
149
+ With only **0.53% of parameters trainable** via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
150
+
151
+ ### 3. Local simulation beats API calls for training
152
+ Using `DatabaseSimulator` directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
153
+
154
+ ### 4. GRPO loss behaviour differs from supervised loss
155
+ Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β€” what matters is whether the reward is trending upward.
156
+
157
+ ### 5. Composite indexes are hard to learn
158
+ The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
159
+
160
+ ---
161
+
162
+ ## Live Demo
163
+
164
+ Try the agent yourself β€” pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
165
+
166
+ **[SQL Database Engineer Agent β€” Live Demo](https://huggingface.co/spaces/YOUR_USERNAME/sql-db-engineer-demo)**
167
+
168
+ ---
169
+
170
+ ## Resources
171
+
172
+ | Resource | Link |
173
+ |----------|------|
174
+ | Demo Space |https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
175
+ | |
176
+ | Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
177
+ | | HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
178
+ | |
179
+ |Training Run Notebook URL| https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb
180
+ |Google Collab| https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
181
+
182
+ ---
183
+
184
+ ## What's Next
185
+
186
+ - **More steps:** 200 steps showed strong learning β€” 500+ steps would likely push the average score above 50 pts
187
+ - **Harder scenarios:** 8-table schemas with nested subqueries and CTEs
188
+ - **Query rewriting:** The agent currently focuses on indexing β€” teaching it to rewrite SQL itself is the next frontier
189
+ - **Multi-step episodes:** Chain multiple actions per episode so the agent can inspect β†’ diagnose β†’ fix β†’ verify in sequence
190
+
191
+ ---
192
+
193
+ ## Acknowledgements
194
+
195
+ Built for the **META Γ— PyTorch Γ— SST Hackathon** using:
196
+ - [Unsloth](https://github.com/unslothai/unsloth) β€” 2x faster LoRA fine-tuning
197
+ - [TRL](https://github.com/huggingface/trl) β€” GRPO implementation
198
+ - [Hugging Face](https://huggingface.co) β€” model hosting and Spaces
199
+ - [Qwen2.5](https://huggingface.co/Qwen) β€” base language model