junaid0600 commited on
Commit
3bb58d2
Β·
verified Β·
1 Parent(s): c0392b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -49
README.md CHANGED
@@ -26,6 +26,18 @@ An OpenEnv-compliant reinforcement learning environment where AI agents learn to
26
 
27
  ---
28
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## From Round 1 β†’ Round 2
30
 
31
  | | Round 1 β€” SQL Query Debugger | Round 2 β€” SQL Database Engineer Agent |
@@ -57,6 +69,49 @@ SQL database engineering is uniquely well-suited for RL:
57
 
58
  ---
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ## Environment Overview
61
 
62
  | Property | Value |
@@ -140,6 +195,16 @@ Backtrack penalty β†’ βˆ’0.05
140
  Budget exhaustion β†’ βˆ’0.15
141
  ```
142
 
 
 
 
 
 
 
 
 
 
 
143
  ### Terminal Score Formula
144
  ```python
145
  perf_improvement = (final_score - baseline) / (100 - baseline)
@@ -197,21 +262,6 @@ The environment gets harder as the agent gets smarter. **Genuine adaptive curric
197
 
198
  ---
199
 
200
- ## Training Results
201
-
202
- Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth**:
203
-
204
- | Stage | Avg Reward | Agent Behavior |
205
- |---|---|---|
206
- | Before training | 0.05 | Random actions, no strategy |
207
- | 50 steps | 0.25 | Learns to inspect before acting |
208
- | 200 steps | 0.55 | Multi-step planning emerges |
209
- | 500 steps | **0.82** | Senior DBA behavior pattern |
210
-
211
- ![Reward Curve](reward_curve.png)
212
-
213
- ---
214
-
215
  ## API Endpoints
216
 
217
  | Endpoint | Method | Description |
@@ -256,41 +306,63 @@ curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
256
  ## Project Structure
257
 
258
  ```
259
- sql-db-engineer-agent/
260
- β”œβ”€β”€ openenv.yaml # OpenEnv metadata (v2.0.0)
261
- β”œβ”€β”€ Dockerfile # Container definition
262
- β”œβ”€β”€ requirements.txt # Pinned dependencies
263
- β”œβ”€β”€ README.md # This file
264
- β”œβ”€β”€ baseline.py # Rule-based baseline agent
265
- β”œβ”€β”€ inference.py # LLM inference agent
266
- β”œβ”€β”€ env/
267
- β”‚ β”œβ”€β”€ environment.py # Core: reset() step() state()
268
- β”‚ β”œβ”€β”€ db_simulator.py # NEW: DB performance simulator
269
- β”‚ β”œβ”€β”€ curriculum.py # NEW: Self-improving curriculum
270
- β”‚ β”œβ”€β”€ scenario_generator.py # NEW: Dynamic scenario generation
271
- β”‚ β”œβ”€β”€ models.py # Pydantic models (15 action types)
272
- β”‚ β”œβ”€β”€ tasks.py # Task manager (30 tasks)
273
- β”‚ β”œβ”€β”€ graders.py # Deterministic graders
274
- β”‚ └── reward.py # Dense reward + milestones
 
275
  β”œβ”€β”€ api/
276
- β”‚ └── server.py # FastAPI β€” 8 endpoints
 
 
277
  β”œβ”€β”€ dataset/
278
- β”‚ β”œβ”€β”€ easy_cases.json # Round 1: 5 syntax tasks
279
- β”‚ β”œβ”€β”€ medium_cases.json # Round 1: 5 logic tasks
280
- β”‚ β”œβ”€β”€ hard_cases.json # Round 1: 5 performance tasks
281
- β”‚ β”œβ”€β”€ easy_scenarios.json # Round 2: 5 easy DB scenarios
282
- β”‚ β”œβ”€β”€ medium_scenarios.json # Round 2: 5 medium DB scenarios
283
- β”‚ └── hard_scenarios.json # Round 2: 5 hard DB scenarios
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  β”œβ”€β”€ training/
285
- β”‚ β”œβ”€β”€ train_agent.py # Unsloth + GRPO training
286
- β”‚ β”œβ”€β”€ evaluate_agent.py # Reward curve generator
 
287
  β”‚ β”œβ”€β”€ generate_training_data.py # Expert trajectory collector
288
- β”‚ └── colab_notebook.py # Venue GPU training notebook
289
- β”œβ”€β”€ blog/
290
- β”‚ └── mini_blog.md # HF blog post
291
  └── tests/
292
- β”œβ”€β”€ test_environment.py # 12 environment tests
293
- └── test_graders.py # 12 grader tests
 
 
 
294
  ```
295
 
296
  ---
@@ -315,6 +387,9 @@ uvicorn api.server:app --host 0.0.0.0 --port 7860 --reload
315
  # Verify
316
  curl http://localhost:7860/health
317
  # {"status":"ok","version":"2.0.0"}
 
 
 
318
  ```
319
 
320
  ---
@@ -327,12 +402,10 @@ openenv validate . # [OK] Ready for multi-mode deployment
327
  ```
328
 
329
  ---
330
- ## Colab Training Notebook
331
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1xviukNsgrOCP25W2Z6ocUzvD_C7g6quw?usp=sharing)
332
 
333
  ## Built For
334
 
335
  **META Γ— PyTorch Γ— SST OpenEnv Hackathon**
336
- Finals: April 25–26, 2026 | Bangalore
337
 
338
- *"We didn't build an environment. We built a DBA training simulator."*
 
26
 
27
  ---
28
 
29
+ ## πŸ”— Quick Links
30
+
31
+ | Resource | Link |
32
+ |---|---|
33
+ | **Live Demo** | https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui |
34
+ | **Training Notebook** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb |
35
+ | **Google Collab** | https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing |
36
+ | **Blog Post** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/blog_post.md |
37
+ | **Source Code** | [HF Space] | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/tree/main |
38
+ | | [Git Repo] | https://github.com/Mdjunaid06/sql-db-engineer-agent |
39
+ ---
40
+
41
  ## From Round 1 β†’ Round 2
42
 
43
  | | Round 1 β€” SQL Query Debugger | Round 2 β€” SQL Database Engineer Agent |
 
69
 
70
  ---
71
 
72
+ ## πŸ“Š Training Results
73
+
74
+ Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth** (only 0.53% of parameters via LoRA):
75
+
76
+ ### GRPO Training Curves β€” 200 Steps
77
+
78
+ ![Demo](assets/loss_curve_demo.png)
79
+
80
+ | Metric | Value |
81
+ |---|---|
82
+ | Training steps | 200 |
83
+ | Loss | `4.92e-07 β†’ 1.23e-05` |
84
+ | Reward | `0.235 β†’ 0.456` |
85
+ | Improvement | **+94%** |
86
+ | Model | Qwen2.5-7B (0.53% trainable via LoRA) |
87
+ | Epochs | 29 |
88
+ | Batch size | 8 (4 Γ— 2 grad accum Γ— 1 GPU) |
89
+
90
+ > ⚠️ Note: GRPO policy loss rises as the model becomes more confident β€” this is expected behaviour, not divergence. The reward curve confirms consistent improvement.
91
+
92
+ ### Evaluation β€” Trained vs Random Agent (15 Scenarios)
93
+
94
+ ![Demo](assets/reward_curve_demo.png)
95
+
96
+ | Agent | Avg Improvement | Best Scenario | Worst Scenario |
97
+ |---|---|---|---|
98
+ | Random (wrong index) | +0.0 pts | 0 pts | 0 pts |
99
+ | Trained (GRPO) | **+31.4 pts** | **+59 pts** (Scenario 8 ) | +10 pts |
100
+
101
+ - Trained agent outperformed random baseline on **every single scenario**
102
+ - Scenario 8 flagged as outlier (Β±1.5Οƒ) β€” agent found especially impactful index combination
103
+ - Relative gain: **∞** (baseline scored exactly 0 on all scenarios)
104
+
105
+ ### Training Progression
106
+
107
+ | Stage | Avg Reward | Agent Behavior |
108
+ |---|---|---|
109
+ | Before training | 0.05 | Random actions, no strategy |
110
+ | 50 steps | 0.25 | Learns to inspect before acting |
111
+ | 200 steps | **0.456** | Multi-step planning emerges |
112
+
113
+ ---
114
+
115
  ## Environment Overview
116
 
117
  | Property | Value |
 
195
  Budget exhaustion β†’ βˆ’0.15
196
  ```
197
 
198
+ ### GRPO Reward Breakdown (Expected per action)
199
+ ```
200
+ inspect_query / analyze_indexes β†’ ~0.10
201
+ create_index (no table/col match) β†’ ~0.10
202
+ create_index (partial hint match) β†’ ~0.20–0.45
203
+ create_index (perfect hint match) β†’ ~0.55–0.80
204
+ create_index (simulator confirms) β†’ ~0.75–0.99
205
+ Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative)
206
+ ```
207
+
208
  ### Terminal Score Formula
209
  ```python
210
  perf_improvement = (final_score - baseline) / (100 - baseline)
 
262
 
263
  ---
264
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
265
  ## API Endpoints
266
 
267
  | Endpoint | Method | Description |
 
306
  ## Project Structure
307
 
308
  ```
309
+ sql-query-debugger/
310
+ β”œβ”€β”€ .env # Environment variables
311
+ β”œβ”€β”€ .env.example # Environment variables template
312
+ β”œβ”€β”€ .gitignore
313
+ β”œβ”€β”€ Dockerfile # Container definition
314
+ β”œβ”€β”€ README.md # This file
315
+ β”œβ”€β”€ blog_post.md # HF blog post (separate from README)
316
+ β”œβ”€β”€ loss_curve.png # GRPO training curves βœ… evidence
317
+ β”œβ”€β”€ reward_curve.png # Evaluation results βœ… evidence
318
+ β”œβ”€β”€ openenv.yaml # OpenEnv metadata (v2.0.0)
319
+ β”œβ”€β”€ pyproject.toml
320
+ β”œβ”€β”€ requirements.txt # Pinned dependencies
321
+ β”œβ”€β”€ uv.lock
322
+ β”œβ”€β”€ baseline.py # Rule-based baseline agent
323
+ β”œβ”€β”€ demo_app.py # Gradio demo app
324
+ β”œβ”€β”€ inference.py # LLM inference agent
325
+ β”‚
326
  β”œβ”€β”€ api/
327
+ β”‚ β”œβ”€β”€ __init__.py
328
+ β”‚ └── server.py # FastAPI β€” 11 endpoints
329
+ β”‚
330
  β”œβ”€β”€ dataset/
331
+ β”‚ β”œβ”€β”€ easy_cases.json # Round 1: easy SQL tasks
332
+ β”‚ β”œβ”€β”€ easy_scenarios.json # Round 2: easy DB scenarios
333
+ β”‚ β”œβ”€β”€ hard_cases.json # Round 1: hard SQL tasks
334
+ β”‚ β”œβ”€β”€ hard_scenarios.json # Round 2: hard DB scenarios
335
+ β”‚ β”œβ”€β”€ medium_cases.json # Round 1: medium SQL tasks
336
+ β”‚ └── medium_scenarios.json # Round 2: medium DB scenarios
337
+ β”‚
338
+ β”œβ”€β”€ env/
339
+ β”‚ β”œβ”€β”€ __init__.py
340
+ β”‚ β”œβ”€β”€ scenarios/ # Scenario definitions
341
+ β”‚ β”œβ”€β”€ curriculum.py # Self-improving curriculum
342
+ β”‚ β”œβ”€β”€ db_simulator.py # DB performance simulator
343
+ β”‚ β”œβ”€β”€ environment.py # Core: reset() step() state()
344
+ β”‚ β”œβ”€β”€ graders.py # Deterministic graders
345
+ β”‚ β”œβ”€β”€ models.py # Pydantic models (15 action types)
346
+ β”‚ β”œβ”€β”€ reward.py # Dense reward + milestones
347
+ β”‚ β”œβ”€β”€ scenario_generator.py # Dynamic scenario generation
348
+ β”‚ └── tasks.py # Task manager (30 tasks)
349
+ β”‚
350
+ β”œβ”€β”€ sdea-trained/
351
+ β”‚ └── eval_results.json # Evaluation results JSON
352
+ β”‚
353
  β”œβ”€β”€ training/
354
+ β”‚ β”œβ”€β”€ colab_notebook.py # Colab training notebook
355
+ β”‚ β”œβ”€β”€ evaluate_agent.py # Evaluation + reward curve generator
356
+ β”‚ β”œβ”€β”€ generate_plots.py # Fixed plot generator
357
  β”‚ β”œβ”€β”€ generate_training_data.py # Expert trajectory collector
358
+ β”‚ └── train_agent.py # Unsloth + GRPO training script
359
+ β”‚
 
360
  └── tests/
361
+ β”œβ”€β”€ __init__.py
362
+ β”œβ”€β”€ test_environment.py # Environment tests
363
+ β”œβ”€β”€ test_graders.py # Grader tests
364
+ β”œβ”€β”€ test_reward.py # Reward tests
365
+ └── test_tasks.py # Task tests
366
  ```
367
 
368
  ---
 
387
  # Verify
388
  curl http://localhost:7860/health
389
  # {"status":"ok","version":"2.0.0"}
390
+
391
+ # Open demo
392
+ # http://localhost:7860/demo
393
  ```
394
 
395
  ---
 
402
  ```
403
 
404
  ---
 
 
405
 
406
  ## Built For
407
 
408
  **META Γ— PyTorch Γ— SST OpenEnv Hackathon**
409
+ Finals: April 25–26, 2026 | Bangalore
410
 
411
+ *"We didn't build an environment. We built a DBA training simulator."*