sql-db-engineer-agent / blog_post.md
junaid0600's picture
Update blog_post.md
ac0ea5f verified
|
Raw
History Blame Contribute Delete
9.17 kB
# Training a SQL Database Engineer Agent with GRPO on Qwen2.5
*Fine-tuning a language model to autonomously diagnose and fix slow database queries using Reinforcement Learning*
---
## Overview
Modern applications live and die by their database performance. Slow queries cause timeouts, poor user experience, and infrastructure costs β€” yet diagnosing and fixing them requires deep expertise. What if a language model could learn to do this autonomously?
In this project, we trained **Qwen2.5-7B-Instruct** to act as a senior database engineer β€” inspecting slow queries, identifying missing indexes, and applying targeted fixes β€” using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm that teaches the model through reward signals rather than labeled examples.
After **200 training steps**, the agent achieved a **+94% reward improvement** (0.235 β†’ 0.456) and outperformed a random baseline by an average of **+31.4 database performance points** across 15 scenarios.
---
## The Problem
Given a database with slow-running SQL queries, the agent must:
1. **Investigate** β€” understand why queries are slow
2. **Diagnose** β€” identify missing indexes or inefficient query patterns
3. **Fix** β€” apply the correct indexes and optimizations
4. **Verify** β€” confirm the performance score improved
A random agent that creates indexes on arbitrary columns scores **0 pts** on every scenario. Our trained agent had to learn β€” purely from feedback β€” which tables and columns actually matter.
---
## Architecture
### Environment β€” DatabaseSimulator
We built a custom `DatabaseSimulator` that:
- Loads SQL scenarios (tables, slow queries, missing index hints)
- Tracks a **performance score (0–100)** that improves when correct indexes are applied
- Returns delta rewards based on how much the score improved
- Runs **locally** β€” no HTTP calls, no shared state, fully deterministic
### Scenarios
We created **15 scenarios** across 3 difficulty levels:
| Level | Count | Description |
|-------|-------|-------------|
| Easy | 5 | Single table, one missing index |
| Medium | 5 | E-commerce DB, composite indexes |
| Hard | 5 | 4-table financial schema, complex joins |
### Action Space
The agent can take 10 actions:
```json
{"action_type": "inspect_query", "payload": {"query_id": "q1"}}
{"action_type": "analyze_indexes", "payload": {}}
{"action_type": "create_index", "payload": {"table": "orders", "columns": ["user_id", "status"]}}
{"action_type": "rewrite_query", "payload": {"query_id": "q1", "new_sql": "..."}}
{"action_type": "analyze_statistics","payload": {"table": "orders"}}
{"action_type": "submit_report", "payload": {"summary": "..."}}
```
---
## Training Setup
### Model
- **Base model:** `unsloth/Qwen2.5-7B-Instruct` (7.66B parameters)
- **Trainable parameters:** 40,370,176 of 7,655,986,688 **(only 0.53% via LoRA)**
- **Fine-tuning:** LoRA (r=16, alpha=16) via Unsloth β€” 2x faster free finetuning
- **Training algorithm:** GRPO (Group Relative Policy Optimization)
- **Framework:** TRL + Unsloth + PyTorch
- **GPU:** Single GPU (1x)
### Training Data
- **Examples:** 15 scenarios
- **Epochs:** 29 (cycling through all 15 scenarios)
- **Total steps:** 200
- **Effective batch size:** 8 (batch size 4 Γ— gradient accumulation 2 Γ— 1 GPU)
### GRPO Reward Function
The reward function combines three signals:
```python
total_reward = step_reward + delta_reward + milestone_bonus
```
| Component | Description | Range |
|-----------|-------------|-------|
| `step_reward` | Base reward per valid action type | 0.05–0.20 |
| `delta_reward` | Proportional to DB performance improvement | 0.0–0.65 |
| `milestone_bonus` | Bonus at 25%, 50%, 75% improvement thresholds | 0.15–0.40 |
| `wrong_index_penalty` | Penalty for indexing useless columns | -0.05 |
**Expected rewards per action:**
```
inspect_query / analyze_indexes β†’ ~0.10
create_index (no table/col match) β†’ ~0.10
create_index (partial hint match) β†’ ~0.20–0.45
create_index (perfect hint match) β†’ ~0.55–0.80
create_index (simulator confirms) β†’ ~0.75–0.99
Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
```
**Key design decision:** We used a **hint-match fallback** to give GRPO a gradient signal early in training β€” before the model has learned exact column names, partial column matches still receive proportional rewards. This prevented the cold-start problem where the model gets 0 reward for everything and never improves.
### Training Config
```python
GRPOConfig(
max_steps = 200,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 2,
learning_rate = 2e-5,
max_completion_length = 150,
num_generations = 4,
temperature = 1.0,
warmup_steps = 10,
)
```
---
## Results
### Training Curves
After 200 steps of GRPO training:
- **Loss:** `4.92e-07 β†’ 1.23e-05`
*(GRPO policy loss rises as the model becomes more confident in its policy β€” this is expected behaviour in GRPO, not divergence. The 10-step rolling average confirms stable learning without collapse.)*
- **Reward:** `0.235 β†’ 0.456 (+94% improvement)`
The reward shows a strong and consistent upward trend from ~0.20 to ~0.45, with the 10-step rolling average clearly confirming the model improved throughout training.
### Evaluation β€” Trained vs Random Agent
We evaluated both agents across all 15 scenarios:
| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|-------|----------------|---------------|----------------|
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
| Trained (GRPO) | +31.4 pts | +59 pts (Scenario 8 ) | +10 pts |
The trained agent outperformed the random baseline on **every single scenario**, with an average improvement of **+31.4 database performance points**. Scenario 8 was flagged as a statistical outlier (Β±1.5Οƒ above mean) β€” the agent found an especially impactful index combination. The relative gain is **∞** since the untrained baseline scored exactly 0 on all scenarios.
---
## Key Learnings
### 1. Reward shaping is everything in GRPO
The model started producing low-reward outputs for the first ~10 steps until the hint-match fallback kicked in. Without partial credit for close-but-not-perfect column names, training would have stalled completely.
### 2. LoRA makes 7B models trainable on a single GPU
With only **0.53% of parameters trainable** via LoRA, we fine-tuned a full 7B model on a single GPU in under 2 hours. Without LoRA this would require multiple A100s.
### 3. Local simulation beats API calls for training
Using `DatabaseSimulator` directly (instead of calling a REST API) made rewards deterministic, removed shared state bugs, and made training 10x faster with no network latency.
### 4. GRPO loss behaviour differs from supervised loss
Unlike cross-entropy loss in supervised fine-tuning, GRPO policy loss can increase as the model becomes more confident in its policy. This is normal and does not indicate a problem β€” what matters is whether the reward is trending upward.
### 5. Composite indexes are hard to learn
The model consistently struggled with scenarios requiring composite indexes on 3+ columns. Single-column indexes were learned quickly (by step ~20), while multi-column patterns took much longer to emerge.
---
## Live Demo
Try the agent yourself β€” pick a scenario difficulty, choose between the trained GRPO agent and the rule-based baseline, and watch it diagnose and fix the database in real time:
**[SQL Database Engineer Agent β€” Live Demo](https://huggingface.co/spaces/YOUR_USERNAME/sql-db-engineer-demo)**
---
## Resources
| Resource | Link |
|----------|------|
| Demo Space |https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
| |
| Source code | GitHub Repo - https://github.com/Mdjunaid06/sql-db-engineer-agent |
| | HF Repo - https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent|
| |
|Training Run Notebook URL| https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb
|Google Collab| https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
---
## What's Next
- **More steps:** 200 steps showed strong learning β€” 500+ steps would likely push the average score above 50 pts
- **Harder scenarios:** 8-table schemas with nested subqueries and CTEs
- **Query rewriting:** The agent currently focuses on indexing β€” teaching it to rewrite SQL itself is the next frontier
- **Multi-step episodes:** Chain multiple actions per episode so the agent can inspect β†’ diagnose β†’ fix β†’ verify in sequence
---
## Acknowledgements
Built for the **META Γ— PyTorch Γ— SST Hackathon** using:
- [Unsloth](https://github.com/unslothai/unsloth) β€” 2x faster LoRA fine-tuning
- [TRL](https://github.com/huggingface/trl) β€” GRPO implementation
- [Hugging Face](https://huggingface.co) β€” model hosting and Spaces
- [Qwen2.5](https://huggingface.co/Qwen) β€” base language model