Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -26,6 +26,18 @@ An OpenEnv-compliant reinforcement learning environment where AI agents learn to
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## From Round 1 β Round 2
|
| 30 |
|
| 31 |
| | Round 1 β SQL Query Debugger | Round 2 β SQL Database Engineer Agent |
|
|
@@ -57,6 +69,49 @@ SQL database engineering is uniquely well-suited for RL:
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
## Environment Overview
|
| 61 |
|
| 62 |
| Property | Value |
|
|
@@ -140,6 +195,16 @@ Backtrack penalty β β0.05
|
|
| 140 |
Budget exhaustion β β0.15
|
| 141 |
```
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
### Terminal Score Formula
|
| 144 |
```python
|
| 145 |
perf_improvement = (final_score - baseline) / (100 - baseline)
|
|
@@ -197,21 +262,6 @@ The environment gets harder as the agent gets smarter. **Genuine adaptive curric
|
|
| 197 |
|
| 198 |
---
|
| 199 |
|
| 200 |
-
## Training Results
|
| 201 |
-
|
| 202 |
-
Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth**:
|
| 203 |
-
|
| 204 |
-
| Stage | Avg Reward | Agent Behavior |
|
| 205 |
-
|---|---|---|
|
| 206 |
-
| Before training | 0.05 | Random actions, no strategy |
|
| 207 |
-
| 50 steps | 0.25 | Learns to inspect before acting |
|
| 208 |
-
| 200 steps | 0.55 | Multi-step planning emerges |
|
| 209 |
-
| 500 steps | **0.82** | Senior DBA behavior pattern |
|
| 210 |
-
|
| 211 |
-

|
| 212 |
-
|
| 213 |
-
---
|
| 214 |
-
|
| 215 |
## API Endpoints
|
| 216 |
|
| 217 |
| Endpoint | Method | Description |
|
|
@@ -256,41 +306,63 @@ curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
|
|
| 256 |
## Project Structure
|
| 257 |
|
| 258 |
```
|
| 259 |
-
sql-
|
| 260 |
-
βββ
|
| 261 |
-
βββ
|
| 262 |
-
βββ
|
| 263 |
-
βββ
|
| 264 |
-
βββ
|
| 265 |
-
βββ
|
| 266 |
-
βββ
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
|
|
|
| 275 |
βββ api/
|
| 276 |
-
β
|
|
|
|
|
|
|
| 277 |
βββ dataset/
|
| 278 |
-
β βββ easy_cases.json
|
| 279 |
-
β βββ
|
| 280 |
-
β βββ hard_cases.json
|
| 281 |
-
β βββ
|
| 282 |
-
β βββ
|
| 283 |
-
β βββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
βββ training/
|
| 285 |
-
β βββ
|
| 286 |
-
β βββ evaluate_agent.py
|
|
|
|
| 287 |
β βββ generate_training_data.py # Expert trajectory collector
|
| 288 |
-
β βββ
|
| 289 |
-
|
| 290 |
-
β βββ mini_blog.md # HF blog post
|
| 291 |
βββ tests/
|
| 292 |
-
βββ
|
| 293 |
-
|
|
|
|
|
|
|
|
|
|
| 294 |
```
|
| 295 |
|
| 296 |
---
|
|
@@ -315,6 +387,9 @@ uvicorn api.server:app --host 0.0.0.0 --port 7860 --reload
|
|
| 315 |
# Verify
|
| 316 |
curl http://localhost:7860/health
|
| 317 |
# {"status":"ok","version":"2.0.0"}
|
|
|
|
|
|
|
|
|
|
| 318 |
```
|
| 319 |
|
| 320 |
---
|
|
@@ -327,12 +402,10 @@ openenv validate . # [OK] Ready for multi-mode deployment
|
|
| 327 |
```
|
| 328 |
|
| 329 |
---
|
| 330 |
-
## Colab Training Notebook
|
| 331 |
-
[](https://colab.research.google.com/drive/1xviukNsgrOCP25W2Z6ocUzvD_C7g6quw?usp=sharing)
|
| 332 |
|
| 333 |
## Built For
|
| 334 |
|
| 335 |
**META Γ PyTorch Γ SST OpenEnv Hackathon**
|
| 336 |
-
Finals: April 25β26, 2026 | Bangalore
|
| 337 |
|
| 338 |
-
*"We didn't build an environment. We built a DBA training simulator."*
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
+
## π Quick Links
|
| 30 |
+
|
| 31 |
+
| Resource | Link |
|
| 32 |
+
|---|---|
|
| 33 |
+
| **Live Demo** | https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
|
| 34 |
+
| **Training Notebook** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb |
|
| 35 |
+
| **Google Collab** | https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
|
| 36 |
+
| **Blog Post** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/blog_post.md |
|
| 37 |
+
| **Source Code** | [HF Space] | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
|
| 38 |
+
| | [Git Repo] | https://github.com/Mdjunaid06/sql-db-engineer-agent |
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
## From Round 1 β Round 2
|
| 42 |
|
| 43 |
| | Round 1 β SQL Query Debugger | Round 2 β SQL Database Engineer Agent |
|
|
|
|
| 69 |
|
| 70 |
---
|
| 71 |
|
| 72 |
+
## π Training Results
|
| 73 |
+
|
| 74 |
+
Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth** (only 0.53% of parameters via LoRA):
|
| 75 |
+
|
| 76 |
+
### GRPO Training Curves β 200 Steps
|
| 77 |
+
|
| 78 |
+

|
| 79 |
+
|
| 80 |
+
| Metric | Value |
|
| 81 |
+
|---|---|
|
| 82 |
+
| Training steps | 200 |
|
| 83 |
+
| Loss | `4.92e-07 β 1.23e-05` |
|
| 84 |
+
| Reward | `0.235 β 0.456` |
|
| 85 |
+
| Improvement | **+94%** |
|
| 86 |
+
| Model | Qwen2.5-7B (0.53% trainable via LoRA) |
|
| 87 |
+
| Epochs | 29 |
|
| 88 |
+
| Batch size | 8 (4 Γ 2 grad accum Γ 1 GPU) |
|
| 89 |
+
|
| 90 |
+
> β οΈ Note: GRPO policy loss rises as the model becomes more confident β this is expected behaviour, not divergence. The reward curve confirms consistent improvement.
|
| 91 |
+
|
| 92 |
+
### Evaluation β Trained vs Random Agent (15 Scenarios)
|
| 93 |
+
|
| 94 |
+

|
| 95 |
+
|
| 96 |
+
| Agent | Avg Improvement | Best Scenario | Worst Scenario |
|
| 97 |
+
|---|---|---|---|
|
| 98 |
+
| Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
|
| 99 |
+
| Trained (GRPO) | **+31.4 pts** | **+59 pts** (Scenario 8 ) | +10 pts |
|
| 100 |
+
|
| 101 |
+
- Trained agent outperformed random baseline on **every single scenario**
|
| 102 |
+
- Scenario 8 flagged as outlier (Β±1.5Ο) β agent found especially impactful index combination
|
| 103 |
+
- Relative gain: **β** (baseline scored exactly 0 on all scenarios)
|
| 104 |
+
|
| 105 |
+
### Training Progression
|
| 106 |
+
|
| 107 |
+
| Stage | Avg Reward | Agent Behavior |
|
| 108 |
+
|---|---|---|
|
| 109 |
+
| Before training | 0.05 | Random actions, no strategy |
|
| 110 |
+
| 50 steps | 0.25 | Learns to inspect before acting |
|
| 111 |
+
| 200 steps | **0.456** | Multi-step planning emerges |
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
## Environment Overview
|
| 116 |
|
| 117 |
| Property | Value |
|
|
|
|
| 195 |
Budget exhaustion β β0.15
|
| 196 |
```
|
| 197 |
|
| 198 |
+
### GRPO Reward Breakdown (Expected per action)
|
| 199 |
+
```
|
| 200 |
+
inspect_query / analyze_indexes β ~0.10
|
| 201 |
+
create_index (no table/col match) β ~0.10
|
| 202 |
+
create_index (partial hint match) β ~0.20β0.45
|
| 203 |
+
create_index (perfect hint match) β ~0.55β0.80
|
| 204 |
+
create_index (simulator confirms) β ~0.75β0.99
|
| 205 |
+
Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
### Terminal Score Formula
|
| 209 |
```python
|
| 210 |
perf_improvement = (final_score - baseline) / (100 - baseline)
|
|
|
|
| 262 |
|
| 263 |
---
|
| 264 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 265 |
## API Endpoints
|
| 266 |
|
| 267 |
| Endpoint | Method | Description |
|
|
|
|
| 306 |
## Project Structure
|
| 307 |
|
| 308 |
```
|
| 309 |
+
sql-query-debugger/
|
| 310 |
+
βββ .env # Environment variables
|
| 311 |
+
βββ .env.example # Environment variables template
|
| 312 |
+
βββ .gitignore
|
| 313 |
+
βββ Dockerfile # Container definition
|
| 314 |
+
βββ README.md # This file
|
| 315 |
+
βββ blog_post.md # HF blog post (separate from README)
|
| 316 |
+
βββ loss_curve.png # GRPO training curves β
evidence
|
| 317 |
+
βββ reward_curve.png # Evaluation results β
evidence
|
| 318 |
+
βββ openenv.yaml # OpenEnv metadata (v2.0.0)
|
| 319 |
+
βββ pyproject.toml
|
| 320 |
+
βββ requirements.txt # Pinned dependencies
|
| 321 |
+
βββ uv.lock
|
| 322 |
+
βββ baseline.py # Rule-based baseline agent
|
| 323 |
+
βββ demo_app.py # Gradio demo app
|
| 324 |
+
βββ inference.py # LLM inference agent
|
| 325 |
+
β
|
| 326 |
βββ api/
|
| 327 |
+
β βββ __init__.py
|
| 328 |
+
β βββ server.py # FastAPI β 11 endpoints
|
| 329 |
+
β
|
| 330 |
βββ dataset/
|
| 331 |
+
β βββ easy_cases.json # Round 1: easy SQL tasks
|
| 332 |
+
β βββ easy_scenarios.json # Round 2: easy DB scenarios
|
| 333 |
+
β βββ hard_cases.json # Round 1: hard SQL tasks
|
| 334 |
+
β βββ hard_scenarios.json # Round 2: hard DB scenarios
|
| 335 |
+
β βββ medium_cases.json # Round 1: medium SQL tasks
|
| 336 |
+
β βββ medium_scenarios.json # Round 2: medium DB scenarios
|
| 337 |
+
β
|
| 338 |
+
βββ env/
|
| 339 |
+
β βββ __init__.py
|
| 340 |
+
β βββ scenarios/ # Scenario definitions
|
| 341 |
+
β βββ curriculum.py # Self-improving curriculum
|
| 342 |
+
β βββ db_simulator.py # DB performance simulator
|
| 343 |
+
β βββ environment.py # Core: reset() step() state()
|
| 344 |
+
β βββ graders.py # Deterministic graders
|
| 345 |
+
β βββ models.py # Pydantic models (15 action types)
|
| 346 |
+
β βββ reward.py # Dense reward + milestones
|
| 347 |
+
β βββ scenario_generator.py # Dynamic scenario generation
|
| 348 |
+
β βββ tasks.py # Task manager (30 tasks)
|
| 349 |
+
β
|
| 350 |
+
βββ sdea-trained/
|
| 351 |
+
β βββ eval_results.json # Evaluation results JSON
|
| 352 |
+
β
|
| 353 |
βββ training/
|
| 354 |
+
β βββ colab_notebook.py # Colab training notebook
|
| 355 |
+
β βββ evaluate_agent.py # Evaluation + reward curve generator
|
| 356 |
+
β βββ generate_plots.py # Fixed plot generator
|
| 357 |
β βββ generate_training_data.py # Expert trajectory collector
|
| 358 |
+
β βββ train_agent.py # Unsloth + GRPO training script
|
| 359 |
+
β
|
|
|
|
| 360 |
βββ tests/
|
| 361 |
+
βββ __init__.py
|
| 362 |
+
βββ test_environment.py # Environment tests
|
| 363 |
+
βββ test_graders.py # Grader tests
|
| 364 |
+
βββ test_reward.py # Reward tests
|
| 365 |
+
βββ test_tasks.py # Task tests
|
| 366 |
```
|
| 367 |
|
| 368 |
---
|
|
|
|
| 387 |
# Verify
|
| 388 |
curl http://localhost:7860/health
|
| 389 |
# {"status":"ok","version":"2.0.0"}
|
| 390 |
+
|
| 391 |
+
# Open demo
|
| 392 |
+
# http://localhost:7860/demo
|
| 393 |
```
|
| 394 |
|
| 395 |
---
|
|
|
|
| 402 |
```
|
| 403 |
|
| 404 |
---
|
|
|
|
|
|
|
| 405 |
|
| 406 |
## Built For
|
| 407 |
|
| 408 |
**META Γ PyTorch Γ SST OpenEnv Hackathon**
|
| 409 |
+
Finals: April 25β26, 2026 | Bangalore
|
| 410 |
|
| 411 |
+
*"We didn't build an environment. We built a DBA training simulator."*
|