Spaces:
Sleeping
Sleeping
Update blog_post.md
Browse files- blog_post.md +199 -199
blog_post.md
CHANGED
|
@@ -1,199 +1,199 @@
|
|
| 1 |
-
# Training a SQL Database Engineer Agent with GRPO on Qwen2.5
|
| 2 |
-
|
| 3 |
-
*Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning*
|
| 4 |
-
|
| 5 |
-
---
|
| 6 |
-
|
| 7 |
-
## Overview
|
| 8 |
-
|
| 9 |
-
Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
|
| 10 |
-
|
| 11 |
-
In this project, we trained **Qwen2.5-7B-Instruct** to act as a senior database engineer β inspecting slow queries, identifying missing indexes, and applying targeted fixes β using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
|
| 12 |
-
|
| 13 |
-
After **200 training steps**, the agent achieved a **+94% reward improvement** (0.235 β 0.456) and outperformed a random baseline by an average of **+31.4 database performance points** across 15 scenarios.
|
| 14 |
-
|
| 15 |
-
---
|
| 16 |
-
|
| 17 |
-
## The Problem
|
| 18 |
-
|
| 19 |
-
Given a database with slow-running SQL queries, the agent must:
|
| 20 |
-
1. **Investigate** β understand why queries are slow
|
| 21 |
-
2. **Diagnose** β identify missing indexes or inefficient query patterns
|
| 22 |
-
3. **Fix** β apply the correct indexes and optimizations
|
| 23 |
-
4. **Verify** β confirm the performance score improved
|
| 24 |
-
|
| 25 |
-
A random agent that creates indexes on arbitrary columns scores **0 pts** on every scenario. Our trained agent had to learn β purely from feedback β which tables and columns actually matter.
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## Architecture
|
| 30 |
-
|
| 31 |
-
### Environment β DatabaseSimulator
|
| 32 |
-
We built a custom `DatabaseSimulator` that:
|
| 33 |
-
- Loads SQL scenarios (tables, slow queries, missing index hints)
|
| 34 |
-
- Tracks a **performance score (0β100)** that improves when correct indexes are applied
|
| 35 |
-
- Returns delta rewards based on how much the score improved
|
| 36 |
-
- Runs **locally** β no HTTP calls, no shared state, fully deterministic
|
| 37 |
-
|
| 38 |
-
### Scenarios
|
| 39 |
-
We created **15 scenarios** across 3 difficulty levels:
|
| 40 |
-
|
| 41 |
-
| Level | Count | Description |
|
| 42 |
-
|-------|-------|-------------|
|
| 43 |
-
| Easy | 5 | Single table, one missing index |
|
| 44 |
-
| Medium | 5 | E-commerce DB, composite indexes |
|
| 45 |
-
| Hard | 5 | 4-table financial schema, complex joins |
|
| 46 |
-
|
| 47 |
-
### Action Space
|
| 48 |
-
The agent can take 10 actions:
|
| 49 |
-
|
| 50 |
-
```json
|
| 51 |
-
{"action_type": "inspect_query", "payload": {"query_id": "q1"}}
|
| 52 |
-
{"action_type": "analyze_indexes", "payload": {}}
|
| 53 |
-
{"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
|
| 54 |
-
{"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
|
| 55 |
-
{"action_type": "analyze_statistics","payload": {"table": "orders"}}
|
| 56 |
-
{"action_type": "submit_report", "payload": {"summary": "..."}}
|
| 57 |
-
```
|
| 58 |
-
|
| 59 |
-
---
|
| 60 |
-
|
| 61 |
-
## Training Setup
|
| 62 |
-
|
| 63 |
-
### Model
|
| 64 |
-
- **Base model:** `unsloth/Qwen2.5-7B-Instruct` (7.66B parameters)
|
| 65 |
-
- **Trainable parameters:** 40,370,176 of 7,655,986,688 **(only 0.53% via LoRA)**
|
| 66 |
-
- **Fine-tuning:** LoRA (r=16, alpha=16) via Unsloth β 2x faster free finetuning
|
| 67 |
-
- **Training algorithm:** GRPO (Group Relative Policy Optimization)
|
| 68 |
-
- **Framework:** TRL + Unsloth + PyTorch
|
| 69 |
-
- **GPU:** Single GPU (1x)
|
| 70 |
-
|
| 71 |
-
### Training Data
|
| 72 |
-
- **Examples:** 15 scenarios
|
| 73 |
-
- **Epochs:** 29 (cycling through all 15 scenarios)
|
| 74 |
-
- **Total steps:** 200
|
| 75 |
-
- **Effective batch size:** 8 (batch size 4 Γ gradient accumulation 2 Γ 1 GPU)
|
| 76 |
-
|
| 77 |
-
### GRPO Reward Function
|
| 78 |
-
The reward function combines three signals:
|
| 79 |
-
|
| 80 |
-
```python
|
| 81 |
-
total_reward = step_reward + delta_reward + milestone_bonus
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
| Component | Description | Range |
|
| 85 |
-
|-----------|-------------|-------|
|
| 86 |
-
| `step_reward` | Base reward per valid action type | 0.05β0.20 |
|
| 87 |
-
| `delta_reward` | Proportional to DB performance improvement | 0.0β0.65 |
|
| 88 |
-
| `milestone_bonus` | Bonus at 25%, 50%, 75% improvement thresholds | 0.15β0.40 |
|
| 89 |
-
| `wrong_index_penalty` | Penalty for indexing useless columns | -0.05 |
|
| 90 |
-
|
| 91 |
-
**Expected rewards per action:**
|
| 92 |
-
```
|
| 93 |
-
inspect_query / analyze_indexes β ~0.10
|
| 94 |
-
create_index (no table/col match) β ~0.10
|
| 95 |
-
create_index (partial hint match) β ~0.20β0.45
|
| 96 |
-
create_index (perfect hint match) β ~0.55β0.80
|
| 97 |
-
create_index (simulator confirms) β ~0.75β0.99
|
| 98 |
-
Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
**Key design decision:** We used a **hint-match fallback** to give GRPO a gradient signal early in training β before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
|
| 102 |
-
|
| 103 |
-
### Training Config
|
| 104 |
-
```python
|
| 105 |
-
GRPOConfig(
|
| 106 |
-
max_steps = 200,
|
| 107 |
-
per_device_train_batch_size = 4,
|
| 108 |
-
gradient_accumulation_steps = 2,
|
| 109 |
-
learning_rate = 2e-5,
|
| 110 |
-
max_completion_length = 150,
|
| 111 |
-
num_generations = 4,
|
| 112 |
-
temperature = 1.0,
|
| 113 |
-
warmup_steps = 10,
|
| 114 |
-
)
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
---
|
| 118 |
-
|
| 119 |
-
## Results
|
| 120 |
-
|
| 121 |
-
### Training Curves
|
| 122 |
-
|
| 123 |
-
After 200 steps of GRPO training:
|
| 124 |
-
|
| 125 |
-
- **Loss:** `4.92e-07 β 1.23e-05`
|
| 126 |
-
*(GRPO policy loss rises as the model becomes more confident in its policy β this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)*
|
| 127 |
-
- **Reward:** `0.235 β 0.456 (+94% improvement)`
|
| 128 |
-
The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
|
| 129 |
-
|
| 130 |
-
### Evaluation β Trained vs Random Agent
|
| 131 |
-
|
| 132 |
-
We evaluated both agents across all 15 scenarios:
|
| 133 |
-
|
| 134 |
-
| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|
| 135 |
-
|-------|----------------|---------------|----------------|
|
| 136 |
-
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
|
| 137 |
-
| Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
|
| 138 |
-
|
| 139 |
-
The trained agent outperformed the random baseline on **every single scenario**, with an average improvement of **+31.4 database performance points**. Scenario 8 was flagged as a statistical outlier (Β±1.5Ο above mean) β the agent found an especially impactful index combination. The relative gain is **β** since the untrained baseline scored exactly 0 on all scenarios.
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
-
## Key Learnings
|
| 144 |
-
|
| 145 |
-
### 1. Reward shaping is everything in GRPO
|
| 146 |
-
The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
|
| 147 |
-
|
| 148 |
-
### 2. LoRA makes 7B models trainable on a single GPU
|
| 149 |
-
With only **0.53% of parameters trainable** via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
|
| 150 |
-
|
| 151 |
-
### 3. Local simulation beats API calls for training
|
| 152 |
-
Using `DatabaseSimulator` directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
|
| 153 |
-
|
| 154 |
-
### 4. GRPO loss behaviour differs from supervised loss
|
| 155 |
-
Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β what matters is whether the reward is trending upward.
|
| 156 |
-
|
| 157 |
-
### 5. Composite indexes are hard to learn
|
| 158 |
-
The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
|
| 159 |
-
|
| 160 |
-
---
|
| 161 |
-
|
| 162 |
-
## Live Demo
|
| 163 |
-
|
| 164 |
-
Try the agent yourself β pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
|
| 165 |
-
|
| 166 |
-
**[SQL Database Engineer Agent β Live Demo](https://huggingface.co/spaces/YOUR_USERNAME/sql-db-engineer-demo)**
|
| 167 |
-
|
| 168 |
-
---
|
| 169 |
-
|
| 170 |
-
## Resources
|
| 171 |
-
|
| 172 |
-
| Resource | Link |
|
| 173 |
-
|----------|------|
|
| 174 |
-
| Demo Space |https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
|
| 175 |
-
| |
|
| 176 |
-
| Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
|
| 177 |
-
| | HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
|
| 178 |
-
| |
|
| 179 |
-
|Training Run Notebook URL| https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
- **
|
| 187 |
-
- **
|
| 188 |
-
- **
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
- [
|
| 197 |
-
- [
|
| 198 |
-
- [
|
| 199 |
-
|
|
|
|
| 1 |
+
# Training a SQL Database Engineer Agent with GRPO on Qwen2.5
|
| 2 |
+
|
| 3 |
+
*Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Overview
|
| 8 |
+
|
| 9 |
+
Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
|
| 10 |
+
|
| 11 |
+
In this project, we trained **Qwen2.5-7B-Instruct** to act as a senior database engineer β inspecting slow queries, identifying missing indexes, and applying targeted fixes β using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
|
| 12 |
+
|
| 13 |
+
After **200 training steps**, the agent achieved a **+94% reward improvement** (0.235 β 0.456) and outperformed a random baseline by an average of **+31.4 database performance points** across 15 scenarios.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## The Problem
|
| 18 |
+
|
| 19 |
+
Given a database with slow-running SQL queries, the agent must:
|
| 20 |
+
1. **Investigate** β understand why queries are slow
|
| 21 |
+
2. **Diagnose** β identify missing indexes or inefficient query patterns
|
| 22 |
+
3. **Fix** β apply the correct indexes and optimizations
|
| 23 |
+
4. **Verify** β confirm the performance score improved
|
| 24 |
+
|
| 25 |
+
A random agent that creates indexes on arbitrary columns scores **0 pts** on every scenario. Our trained agent had to learn β purely from feedback β which tables and columns actually matter.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Architecture
|
| 30 |
+
|
| 31 |
+
### Environment β DatabaseSimulator
|
| 32 |
+
We built a custom `DatabaseSimulator` that:
|
| 33 |
+
- Loads SQL scenarios (tables, slow queries, missing index hints)
|
| 34 |
+
- Tracks a **performance score (0β100)** that improves when correct indexes are applied
|
| 35 |
+
- Returns delta rewards based on how much the score improved
|
| 36 |
+
- Runs **locally** β no HTTP calls, no shared state, fully deterministic
|
| 37 |
+
|
| 38 |
+
### Scenarios
|
| 39 |
+
We created **15 scenarios** across 3 difficulty levels:
|
| 40 |
+
|
| 41 |
+
| Level | Count | Description |
|
| 42 |
+
|-------|-------|-------------|
|
| 43 |
+
| Easy | 5 | Single table, one missing index |
|
| 44 |
+
| Medium | 5 | E-commerce DB, composite indexes |
|
| 45 |
+
| Hard | 5 | 4-table financial schema, complex joins |
|
| 46 |
+
|
| 47 |
+
### Action Space
|
| 48 |
+
The agent can take 10 actions:
|
| 49 |
+
|
| 50 |
+
```json
|
| 51 |
+
{"action_type": "inspect_query", "payload": {"query_id": "q1"}}
|
| 52 |
+
{"action_type": "analyze_indexes", "payload": {}}
|
| 53 |
+
{"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
|
| 54 |
+
{"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
|
| 55 |
+
{"action_type": "analyze_statistics","payload": {"table": "orders"}}
|
| 56 |
+
{"action_type": "submit_report", "payload": {"summary": "..."}}
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## Training Setup
|
| 62 |
+
|
| 63 |
+
### Model
|
| 64 |
+
- **Base model:** `unsloth/Qwen2.5-7B-Instruct` (7.66B parameters)
|
| 65 |
+
- **Trainable parameters:** 40,370,176 of 7,655,986,688 **(only 0.53% via LoRA)**
|
| 66 |
+
- **Fine-tuning:** LoRA (r=16, alpha=16) via Unsloth β 2x faster free finetuning
|
| 67 |
+
- **Training algorithm:** GRPO (Group Relative Policy Optimization)
|
| 68 |
+
- **Framework:** TRL + Unsloth + PyTorch
|
| 69 |
+
- **GPU:** Single GPU (1x)
|
| 70 |
+
|
| 71 |
+
### Training Data
|
| 72 |
+
- **Examples:** 15 scenarios
|
| 73 |
+
- **Epochs:** 29 (cycling through all 15 scenarios)
|
| 74 |
+
- **Total steps:** 200
|
| 75 |
+
- **Effective batch size:** 8 (batch size 4 Γ gradient accumulation 2 Γ 1 GPU)
|
| 76 |
+
|
| 77 |
+
### GRPO Reward Function
|
| 78 |
+
The reward function combines three signals:
|
| 79 |
+
|
| 80 |
+
```python
|
| 81 |
+
total_reward = step_reward + delta_reward + milestone_bonus
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
| Component | Description | Range |
|
| 85 |
+
|-----------|-------------|-------|
|
| 86 |
+
| `step_reward` | Base reward per valid action type | 0.05β0.20 |
|
| 87 |
+
| `delta_reward` | Proportional to DB performance improvement | 0.0β0.65 |
|
| 88 |
+
| `milestone_bonus` | Bonus at 25%, 50%, 75% improvement thresholds | 0.15β0.40 |
|
| 89 |
+
| `wrong_index_penalty` | Penalty for indexing useless columns | -0.05 |
|
| 90 |
+
|
| 91 |
+
**Expected rewards per action:**
|
| 92 |
+
```
|
| 93 |
+
inspect_query / analyze_indexes β ~0.10
|
| 94 |
+
create_index (no table/col match) β ~0.10
|
| 95 |
+
create_index (partial hint match) β ~0.20β0.45
|
| 96 |
+
create_index (perfect hint match) β ~0.55β0.80
|
| 97 |
+
create_index (simulator confirms) β ~0.75β0.99
|
| 98 |
+
Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
**Key design decision:** We used a **hint-match fallback** to give GRPO a gradient signal early in training β before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
|
| 102 |
+
|
| 103 |
+
### Training Config
|
| 104 |
+
```python
|
| 105 |
+
GRPOConfig(
|
| 106 |
+
max_steps = 200,
|
| 107 |
+
per_device_train_batch_size = 4,
|
| 108 |
+
gradient_accumulation_steps = 2,
|
| 109 |
+
learning_rate = 2e-5,
|
| 110 |
+
max_completion_length = 150,
|
| 111 |
+
num_generations = 4,
|
| 112 |
+
temperature = 1.0,
|
| 113 |
+
warmup_steps = 10,
|
| 114 |
+
)
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## Results
|
| 120 |
+
|
| 121 |
+
### Training Curves
|
| 122 |
+
|
| 123 |
+
After 200 steps of GRPO training:
|
| 124 |
+
|
| 125 |
+
- **Loss:** `4.92e-07 β 1.23e-05`
|
| 126 |
+
*(GRPO policy loss rises as the model becomes more confident in its policy β this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)*
|
| 127 |
+
- **Reward:** `0.235 β 0.456 (+94% improvement)`
|
| 128 |
+
The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
|
| 129 |
+
|
| 130 |
+
### Evaluation β Trained vs Random Agent
|
| 131 |
+
|
| 132 |
+
We evaluated both agents across all 15 scenarios:
|
| 133 |
+
|
| 134 |
+
| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|
| 135 |
+
|-------|----------------|---------------|----------------|
|
| 136 |
+
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
|
| 137 |
+
| Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
|
| 138 |
+
|
| 139 |
+
The trained agent outperformed the random baseline on **every single scenario**, with an average improvement of **+31.4 database performance points**. Scenario 8 was flagged as a statistical outlier (Β±1.5Ο above mean) β the agent found an especially impactful index combination. The relative gain is **β** since the untrained baseline scored exactly 0 on all scenarios.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## Key Learnings
|
| 144 |
+
|
| 145 |
+
### 1. Reward shaping is everything in GRPO
|
| 146 |
+
The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
|
| 147 |
+
|
| 148 |
+
### 2. LoRA makes 7B models trainable on a single GPU
|
| 149 |
+
With only **0.53% of parameters trainable** via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
|
| 150 |
+
|
| 151 |
+
### 3. Local simulation beats API calls for training
|
| 152 |
+
Using `DatabaseSimulator` directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
|
| 153 |
+
|
| 154 |
+
### 4. GRPO loss behaviour differs from supervised loss
|
| 155 |
+
Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β what matters is whether the reward is trending upward.
|
| 156 |
+
|
| 157 |
+
### 5. Composite indexes are hard to learn
|
| 158 |
+
The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Live Demo
|
| 163 |
+
|
| 164 |
+
Try the agent yourself β pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
|
| 165 |
+
|
| 166 |
+
**[SQL Database Engineer Agent β Live Demo](https://huggingface.co/spaces/YOUR_USERNAME/sql-db-engineer-demo)**
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Resources
|
| 171 |
+
|
| 172 |
+
| Resource | Link |
|
| 173 |
+
|----------|------|
|
| 174 |
+
| Demo Space |https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
|
| 175 |
+
| |
|
| 176 |
+
| Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
|
| 177 |
+
| | HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
|
| 178 |
+
| |
|
| 179 |
+
|Training Run Notebook URL| https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb
|
| 180 |
+
|Google Collab| https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## What's Next
|
| 185 |
+
|
| 186 |
+
- **More steps:** 200 steps showed strong learning β 500+ steps would likely push the average score above 50 pts
|
| 187 |
+
- **Harder scenarios:** 8-table schemas with nested subqueries and CTEs
|
| 188 |
+
- **Query rewriting:** The agent currently focuses on indexing β teaching it to rewrite SQL itself is the next frontier
|
| 189 |
+
- **Multi-step episodes:** Chain multiple actions per episode so the agent can inspect β diagnose β fix β verify in sequence
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## Acknowledgements
|
| 194 |
+
|
| 195 |
+
Built for the **META Γ PyTorch Γ SST Hackathon** using:
|
| 196 |
+
- [Unsloth](https://github.com/unslothai/unsloth) β 2x faster LoRA fine-tuning
|
| 197 |
+
- [TRL](https://github.com/huggingface/trl) β GRPO implementation
|
| 198 |
+
- [Hugging Face](https://huggingface.co) β model hosting and Spaces
|
| 199 |
+
- [Qwen2.5](https://huggingface.co/Qwen) β base language model
|