File size: 14,810 Bytes
83509d4
809345d
 
83509d4
809345d
83509d4
 
 
 
 
 
809345d
 
 
 
 
83509d4
 
 
809345d
83509d4
f004baa
809345d
3c1b0c7
809345d
3c1b0c7
809345d
 
3bb58d2
5568a3a
 
3bb58d2
 
5568a3a
 
 
 
65cc4ae
5568a3a
3bb58d2
 
809345d
 
 
 
 
 
 
 
 
 
 
3c1b0c7
 
 
 
 
809345d
3c1b0c7
809345d
3c1b0c7
809345d
3c1b0c7
809345d
 
 
 
 
 
3c1b0c7
 
 
3bb58d2
 
 
 
 
 
a1514f8
3bb58d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1514f8
3bb58d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b0c7
 
 
 
809345d
 
 
 
 
3c1b0c7
809345d
3c1b0c7
 
 
809345d
3c1b0c7
809345d
 
3c1b0c7
809345d
 
 
 
 
 
 
 
 
 
 
 
 
3c1b0c7
 
 
 
 
809345d
3c1b0c7
 
809345d
 
3c1b0c7
809345d
 
 
 
 
 
 
 
 
 
 
 
 
3c1b0c7
 
809345d
 
3c1b0c7
 
 
809345d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bb58d2
 
 
 
 
 
 
 
 
 
809345d
 
 
 
 
 
3c1b0c7
 
 
809345d
 
 
3c1b0c7
809345d
3c1b0c7
 
809345d
 
 
 
 
3c1b0c7
809345d
3c1b0c7
 
809345d
 
 
 
 
3c1b0c7
809345d
3c1b0c7
 
809345d
 
 
 
 
 
 
 
3c1b0c7
 
 
809345d
3c1b0c7
 
809345d
 
 
3c1b0c7
 
809345d
 
 
 
3c1b0c7
 
 
 
 
 
 
 
809345d
3c1b0c7
809345d
 
3c1b0c7
 
 
809345d
3c1b0c7
 
809345d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b0c7
809345d
3c1b0c7
809345d
3c1b0c7
 
3bb58d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
809345d
3bb58d2
 
 
809345d
3bb58d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
809345d
3bb58d2
 
 
809345d
3bb58d2
 
809345d
3bb58d2
 
 
 
 
3c1b0c7
 
809345d
3c1b0c7
809345d
3c1b0c7
809345d
 
 
 
3c1b0c7
809345d
 
3c1b0c7
809345d
 
 
3c1b0c7
809345d
 
3c1b0c7
809345d
 
 
3bb58d2
 
 
809345d
3c1b0c7
 
 
809345d
 
 
 
 
3c1b0c7
 
 
 
 
 
 
3bb58d2
3c1b0c7
3bb58d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: SQL Database Engineer Agent
emoji: πŸ—„οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: true
tags:
  - openenv
  - reinforcement-learning
  - sql
  - database
  - engineering
  - long-horizon
  - self-improvement
  - wildcard
license: mit
---

# SQL Database Engineer Agent β€” OpenEnv Environment

> **META Γ— PyTorch Γ— SST OpenEnv Hackathon** | Finals April 25–26, 2026 | Bangalore
> Evolved from SQL Query Debugger (Round 1 β€” all 4 checks passed βœ…)

An OpenEnv-compliant reinforcement learning environment where AI agents learn to act like **senior database engineers**. The agent manages a simulated production database over 50+ steps β€” inspecting slow queries, creating indexes, rewriting queries, and partitioning tables.

---


## πŸ”— Quick Links

| Resource | Link |
|---|---|
| **Live Demo** | https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
| **Training Notebook** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb |
| **Google Collab ** | https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
| **Blog Post** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/blog_post.md |
| **Source Code (HF Space)** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent |
| **Source Code (GitHub)** | https://github.com/Mdjunaid06/sql-db-engineer-agent |
---

## From Round 1 β†’ Round 2

| | Round 1 β€” SQL Query Debugger | Round 2 β€” SQL Database Engineer Agent |
|---|---|---|
| **Task** | Fix one broken SQL query | Optimize entire production database |
| **Steps** | 20 per episode | 50 per episode |
| **Actions** | 6 (identify, fix, submit...) | 15 (inspect, index, rewrite, partition...) |
| **Reward** | Dense per step | Dense + milestone bonuses |
| **Scenarios** | 15 single-query tasks | 30 total (15 new + 15 original) |
| **Training** | Rule-based baseline | Unsloth + GRPO on Qwen2.5-7B |
| **Theme** | Real-world SQL | Long-Horizon + World Modeling + Wildcard |

---

## Motivation

Every production database degrades over time.

Your app launches. Queries run in 50ms. Six months later, users are complaining. P95 query time: **8,500ms**. A senior DBA sits down β€” runs EXPLAIN queries, finds missing indexes, rewrites bad JOINs, partitions 50-million-row tables. **This takes 10 years to learn.**

We asked: **can we train an LLM to do it?**

SQL database engineering is uniquely well-suited for RL:
1. **100% measurable** β€” query time in milliseconds, index hit rates, performance scores
2. **Long-horizon** β€” real fixes require 10-50 careful, ordered steps
3. **World modeling** β€” agent must maintain internal model of DB state, indexes, query plans
4. **Self-improving** β€” curriculum generates harder scenarios as agent improves
5. **Novel** β€” no OpenEnv environment for DB engineering exists anywhere

---

## πŸ“Š Training Results

Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth** (only 0.53% of parameters via LoRA):

### GRPO Training Curves β€” 200 Steps

![Demo](assests/loss_curve_demo.png)

| Metric | Value |
|---|---|
| Training steps | 200 |
| Loss | `4.92e-07 β†’ 1.23e-05` |
| Reward | `0.235 β†’ 0.456` |
| Improvement | **+94%** |
| Model | Qwen2.5-7B (0.53% trainable via LoRA) |
| Epochs | 29 |
| Batch size | 8 (4 Γ— 2 grad accum Γ— 1 GPU) |

> ⚠️ Note: GRPO policy loss rises as the model becomes more confident β€” this is expected behaviour, not divergence. The reward curve confirms consistent improvement.

### Evaluation β€” Trained vs Random Agent (15 Scenarios)

![Demo](assests/reward_curve_demo.png)

| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|---|---|---|---|
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
| Trained (GRPO) | **+31.4 pts** | **+59 pts** (Scenario 8 ) | +10 pts |

- Trained agent outperformed random baseline on **every single scenario**
- Scenario 8 flagged as outlier (Β±1.5Οƒ) β€” agent found especially impactful index combination
- Relative gain: **∞** (baseline scored exactly 0 on all scenarios)

### Training Progression

| Stage | Avg Reward | Agent Behavior |
|---|---|---|
| Before training | 0.05 | Random actions, no strategy |
| 50 steps | 0.25 | Learns to inspect before acting |
| 200 steps | **0.456** | Multi-step planning emerges |

---

## Environment Overview

| Property | Value |
|---|---|
| Domain | Database Engineering |
| Tasks | 30 (15 Round 2 scenarios + 15 Round 1 cases) |
| Max Steps | 50 per episode |
| Reward Type | Dense + milestone bonuses |
| Performance Score | 0–100 (real DB metric) |
| API Port | 7860 |
| Themes | Long-Horizon (2) + World Modeling (3.1) + Self-Improvement (4) + Wildcard (5) |

---

## Action Space (15 Actions)

### Round 2 β€” DB Engineering Actions
| Action | What It Does | Reward |
|---|---|---|
| `inspect_query` | EXPLAIN a slow query β€” scan type, rows examined, cost | +0.05 |
| `analyze_indexes` | Show all indexes + missing index hints | +0.05 |
| `create_index` | Add composite index on specified columns | +0.10 + delta |
| `rewrite_query` | Submit rewritten SQL β€” measures improvement | +0.15 + delta |
| `add_column` | Add denormalization column to reduce JOINs | +0.08 + delta |
| `drop_index` | Remove unused index (reduce write overhead) | +0.05 + delta |
| `partition_table` | Partition large table by date/ID range | +0.15 + delta |
| `analyze_statistics` | Update table statistics for query planner | +0.05 + delta |
| `request_hint` | Get progressive hint | βˆ’0.10 penalty |
| `submit_report` | **TERMINAL**: Final optimization report + full score | 0.0–1.0 |

### Round 1 β€” SQL Debugging Actions (backward compatible)
`identify_error` Β· `propose_fix` Β· `submit_answer` Β· `explain_issue` Β· `optimize_query` Β· `request_hint`

---

## Observation Space

Every observation contains the full DB state:
```json
{
  "task_id": "medium_s001",
  "task_description": "E-commerce DB: 50K orders. P95 query time > 8s. Target: < 500ms.",
  "current_context": {
    "performance_score": 12.5,
    "target_score": 75.0,
    "tables": [
      {"name": "orders", "rows": 50000, "indexes": ["PRIMARY"], "size_mb": 280},
      {"name": "users",  "rows": 8000,  "indexes": ["PRIMARY", "email_idx"]}
    ],
    "slow_queries": [
      {"id": "q1", "sql": "SELECT * FROM orders WHERE user_id=? AND status=?", "avg_ms": 8500},
      {"id": "q2", "sql": "SELECT COUNT(*) FROM orders o JOIN users u ON o.user_id=u.id", "avg_ms": 3200}
    ],
    "improvement_history": [12.5],
    "milestones_earned": [],
    "steps_remaining": 50
  },
  "step_count": 0,
  "difficulty": "medium",
  "max_steps": 50
}
```

---

## Reward Design

Dense reward at every step + milestone bonuses:

```
inspect_query / analyze_indexes  β†’ +0.05 (investigation rewarded)
create_index with improvement    β†’ +0.10 + delta_reward
Milestone: 25% improvement       β†’ +0.15 ONE-TIME bonus
Milestone: 50% improvement       β†’ +0.25 ONE-TIME bonus
Milestone: 75% improvement       β†’ +0.40 ONE-TIME bonus
submit_report (terminal)         β†’ 0.0–1.0 full score
Efficiency bonus (< 70% budget)  β†’ +0.10
Loop penalty (same action x2+)   β†’ βˆ’0.08
Hint penalty                     β†’ βˆ’0.10
Backtrack penalty                β†’ βˆ’0.05
Budget exhaustion                β†’ βˆ’0.15
```

### GRPO Reward Breakdown (Expected per action)
```
inspect_query / analyze_indexes       β†’  ~0.10
create_index (no table/col match)     β†’  ~0.10
create_index (partial hint match)     β†’  ~0.20–0.45
create_index (perfect hint match)     β†’  ~0.55–0.80
create_index (simulator confirms)     β†’  ~0.75–0.99
Milestones: 25%=+0.15  50%=+0.25  75%=+0.40  (cumulative)
```

### Terminal Score Formula
```python
perf_improvement = (final_score - baseline) / (100 - baseline)
step_efficiency  = 1.0 - (steps_used / max_steps)
terminal_score   = (perf_improvement * 0.60) + (step_efficiency * 0.20) + 0.10
```

---

## Scenarios β€” 30 Tasks

### Round 2: DB Engineering (15 new tasks)

#### Easy (15 steps, target 80+)
| ID | Description |
|---|---|
| easy_s001 | User lookup β€” missing email index on 10K users |
| easy_s002 | Order status β€” composite index on 50K orders |
| easy_s003 | Product search β€” LIKE query on 20K products |
| easy_s004 | Session lookup β€” 15K sessions, no index |
| easy_s005 | Log filter β€” compound index on 30K logs |

#### Medium (25–30 steps, target 72–78)
| ID | Description |
|---|---|
| medium_s001 | E-commerce: 50K orders + 8K users, 2 slow queries |
| medium_s002 | Blog: 100K posts + 20K authors, search slow |
| medium_s003 | Inventory: 200K stock movements, rewrite + index |
| medium_s004 | Ticketing: 60K tickets, status queue degraded |
| medium_s005 | Analytics: 150K events, funnel query slow |

#### Hard (50 steps, target 65–70)
| ID | Description |
|---|---|
| hard_s001 | Financial: 500K transactions, 4 tables, 3 slow queries |
| hard_s002 | SaaS: 8-table schema, 2M activity log, dashboard 20s+ |
| hard_s003 | Healthcare: 1M patient records, compliance queries |
| hard_s004 | Gaming: 2M players, 5M matches, leaderboard degraded |
| hard_s005 | Logistics: 6 tables, 3M shipments + 10M tracking rows |

### Round 1: SQL Debugging (15 original tasks β€” backward compatible)
Easy: syntax errors Β· Medium: logic bugs Β· Hard: performance anti-patterns

---

## Self-Improving Curriculum

```
Agent avg score > 0.75  β†’  Advance to harder tier
Agent avg score < 0.30  β†’  Drop back a tier
Ultra tier (tier 3)     β†’  Auto-generated 5-8 table scenarios, no hints
```

The environment gets harder as the agent gets smarter. **Genuine adaptive curriculum.**

---

## API Endpoints

| Endpoint | Method | Description |
|---|---|---|
| `/health` | GET | Liveness check β€” always 200 |
| `/reset` | POST | Start new episode β†’ Observation |
| `/step` | POST | Submit action β†’ (obs, reward, done, info) |
| `/state` | GET | Current episode state |
| `/tasks` | GET | All 30 tasks + action schema |
| `/grader` | POST | Grade an episode β†’ float score |
| `/baseline` | POST | Run baseline agent β†’ scores |
| `/progress` | GET | DB performance history + milestones |

---

## Live Demo

```bash
# Reset with e-commerce scenario
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "easy", "task_id": "easy_s001"}'

# Agent inspects slow query β†’ sees FULL TABLE SCAN
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "inspect_query", "payload": {"query_id": "q1"}}'

# Agent creates index β†’ performance score 8.0 β†’ 82.0
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "create_index", "payload": {"table": "users", "columns": ["email"]}}'

# Agent submits report β†’ terminal score 0.82
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "submit_report", "payload": {"summary": "Added email index. Performance 8 to 82."}}'
```

---

## Project Structure

```
sql-query-debugger/
β”œβ”€β”€ .env                          # Environment variables
β”œβ”€β”€ .env.example                  # Environment variables template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile                    # Container definition
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ blog_post.md                  # HF blog post (separate from README)
β”œβ”€β”€ loss_curve.png                # GRPO training curves βœ… evidence
β”œβ”€β”€ reward_curve.png              # Evaluation results βœ… evidence
β”œβ”€β”€ openenv.yaml                  # OpenEnv metadata (v2.0.0)
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt              # Pinned dependencies
β”œβ”€β”€ uv.lock
β”œβ”€β”€ baseline.py                   # Rule-based baseline agent
β”œβ”€β”€ demo_app.py                   # Gradio demo app
β”œβ”€β”€ inference.py                  # LLM inference agent
β”‚
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── server.py                 # FastAPI β€” 11 endpoints
β”‚
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ easy_cases.json           # Round 1: easy SQL tasks
β”‚   β”œβ”€β”€ easy_scenarios.json       # Round 2: easy DB scenarios
β”‚   β”œβ”€β”€ hard_cases.json           # Round 1: hard SQL tasks
β”‚   β”œβ”€β”€ hard_scenarios.json       # Round 2: hard DB scenarios
β”‚   β”œβ”€β”€ medium_cases.json         # Round 1: medium SQL tasks
β”‚   └── medium_scenarios.json     # Round 2: medium DB scenarios
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ scenarios/                # Scenario definitions
β”‚   β”œβ”€β”€ curriculum.py             # Self-improving curriculum
β”‚   β”œβ”€β”€ db_simulator.py           # DB performance simulator
β”‚   β”œβ”€β”€ environment.py            # Core: reset() step() state()
β”‚   β”œβ”€β”€ graders.py                # Deterministic graders
β”‚   β”œβ”€β”€ models.py                 # Pydantic models (15 action types)
β”‚   β”œβ”€β”€ reward.py                 # Dense reward + milestones
β”‚   β”œβ”€β”€ scenario_generator.py     # Dynamic scenario generation
β”‚   └── tasks.py                  # Task manager (30 tasks)
β”‚
β”œβ”€β”€ sdea-trained/
β”‚   └── eval_results.json         # Evaluation results JSON
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ colab_notebook.py         # Colab training notebook
β”‚   β”œβ”€β”€ evaluate_agent.py         # Evaluation + reward curve generator
β”‚   β”œβ”€β”€ generate_plots.py         # Fixed plot generator
β”‚   β”œβ”€β”€ generate_training_data.py # Expert trajectory collector
β”‚   └── train_agent.py            # Unsloth + GRPO training script
β”‚
└── tests/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ test_environment.py       # Environment tests
    β”œβ”€β”€ test_graders.py           # Grader tests
    β”œβ”€β”€ test_reward.py            # Reward tests
    └── test_tasks.py             # Task tests
```

---

## Setup & Installation

```bash
# Clone
git clone https://github.com/Mdjunaid06/sql-db-engineer-agent
cd sql-db-engineer-agent

# Install
pip install -r requirements.txt

# Configure
cp .env.example .env
# Add HF_TOKEN to .env

# Run
uvicorn api.server:app --host 0.0.0.0 --port 7860 --reload

# Verify
curl http://localhost:7860/health
# {"status":"ok","version":"2.0.0"}

# Open demo
# http://localhost:7860/demo
```

---

## Validation

```bash
pytest tests/ -v          # 24/24 passed
openenv validate .         # [OK] Ready for multi-mode deployment
```

---

## Built For

**META Γ— PyTorch Γ— SST OpenEnv Hackathon**
Finals: April 25–26, 2026 | Bangalore

*"We didn't build an environment. We built a DBA training simulator."*