--- title: SQL Database Engineer Agent emoji: ๐Ÿ—„๏ธ colorFrom: blue colorTo: green sdk: docker pinned: true tags: - openenv - reinforcement-learning - sql - database - engineering - long-horizon - self-improvement - wildcard license: mit --- # SQL Database Engineer Agent โ€” OpenEnv Environment > **META ร— PyTorch ร— SST OpenEnv Hackathon** | Finals April 25โ€“26, 2026 | Bangalore > Evolved from SQL Query Debugger (Round 1 โ€” all 4 checks passed โœ…) An OpenEnv-compliant reinforcement learning environment where AI agents learn to act like **senior database engineers**. The agent manages a simulated production database over 50+ steps โ€” inspecting slow queries, creating indexes, rewriting queries, and partitioning tables. --- ## ๐Ÿ”— Quick Links | Resource | Link | |---|---| | **Live Demo** | https://huggingface.co/spaces/junaid0600/sql-db-agent-demo-ui | | **Training Notebook** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/SDEA_Training_Notebook.ipynb | | **Google Collab ** | https://colab.research.google.com/drive/1dTRcnVb9VotCFUnGeZSacaznb4fn_PD7?usp=sharing | | **Blog Post** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent/blob/main/blog_post.md | | **Source Code (HF Space)** | https://huggingface.co/spaces/junaid0600/sql-db-engineer-agent | | **Source Code (GitHub)** | https://github.com/Mdjunaid06/sql-db-engineer-agent | --- ## From Round 1 โ†’ Round 2 | | Round 1 โ€” SQL Query Debugger | Round 2 โ€” SQL Database Engineer Agent | |---|---|---| | **Task** | Fix one broken SQL query | Optimize entire production database | | **Steps** | 20 per episode | 50 per episode | | **Actions** | 6 (identify, fix, submit...) | 15 (inspect, index, rewrite, partition...) | | **Reward** | Dense per step | Dense + milestone bonuses | | **Scenarios** | 15 single-query tasks | 30 total (15 new + 15 original) | | **Training** | Rule-based baseline | Unsloth + GRPO on Qwen2.5-7B | | **Theme** | Real-world SQL | Long-Horizon + World Modeling + Wildcard | --- ## Motivation Every production database degrades over time. Your app launches. Queries run in 50ms. Six months later, users are complaining. P95 query time: **8,500ms**. A senior DBA sits down โ€” runs EXPLAIN queries, finds missing indexes, rewrites bad JOINs, partitions 50-million-row tables. **This takes 10 years to learn.** We asked: **can we train an LLM to do it?** SQL database engineering is uniquely well-suited for RL: 1. **100% measurable** โ€” query time in milliseconds, index hit rates, performance scores 2. **Long-horizon** โ€” real fixes require 10-50 careful, ordered steps 3. **World modeling** โ€” agent must maintain internal model of DB state, indexes, query plans 4. **Self-improving** โ€” curriculum generates harder scenarios as agent improves 5. **Novel** โ€” no OpenEnv environment for DB engineering exists anywhere --- ## ๐Ÿ“Š Training Results Trained **Qwen2.5-7B-Instruct** with **GRPO** using **Unsloth** (only 0.53% of parameters via LoRA): ### GRPO Training Curves โ€” 200 Steps ![Demo](assests/loss_curve_demo.png) | Metric | Value | |---|---| | Training steps | 200 | | Loss | `4.92e-07 โ†’ 1.23e-05` | | Reward | `0.235 โ†’ 0.456` | | Improvement | **+94%** | | Model | Qwen2.5-7B (0.53% trainable via LoRA) | | Epochs | 29 | | Batch size | 8 (4 ร— 2 grad accum ร— 1 GPU) | > โš ๏ธ Note: GRPO policy loss rises as the model becomes more confident โ€” this is expected behaviour, not divergence. The reward curve confirms consistent improvement. ### Evaluation โ€” Trained vs Random Agent (15 Scenarios) ![Demo](assests/reward_curve_demo.png) | Agent | Avg Improvement | Best Scenario | Worst Scenario | |---|---|---|---| | Random (wrong index) | +0.0 pts | 0 pts | 0 pts | | Trained (GRPO) | **+31.4 pts** | **+59 pts** (Scenario 8 ) | +10 pts | - Trained agent outperformed random baseline on **every single scenario** - Scenario 8 flagged as outlier (ยฑ1.5ฯƒ) โ€” agent found especially impactful index combination - Relative gain: **โˆž** (baseline scored exactly 0 on all scenarios) ### Training Progression | Stage | Avg Reward | Agent Behavior | |---|---|---| | Before training | 0.05 | Random actions, no strategy | | 50 steps | 0.25 | Learns to inspect before acting | | 200 steps | **0.456** | Multi-step planning emerges | --- ## Environment Overview | Property | Value | |---|---| | Domain | Database Engineering | | Tasks | 30 (15 Round 2 scenarios + 15 Round 1 cases) | | Max Steps | 50 per episode | | Reward Type | Dense + milestone bonuses | | Performance Score | 0โ€“100 (real DB metric) | | API Port | 7860 | | Themes | Long-Horizon (2) + World Modeling (3.1) + Self-Improvement (4) + Wildcard (5) | --- ## Action Space (15 Actions) ### Round 2 โ€” DB Engineering Actions | Action | What It Does | Reward | |---|---|---| | `inspect_query` | EXPLAIN a slow query โ€” scan type, rows examined, cost | +0.05 | | `analyze_indexes` | Show all indexes + missing index hints | +0.05 | | `create_index` | Add composite index on specified columns | +0.10 + delta | | `rewrite_query` | Submit rewritten SQL โ€” measures improvement | +0.15 + delta | | `add_column` | Add denormalization column to reduce JOINs | +0.08 + delta | | `drop_index` | Remove unused index (reduce write overhead) | +0.05 + delta | | `partition_table` | Partition large table by date/ID range | +0.15 + delta | | `analyze_statistics` | Update table statistics for query planner | +0.05 + delta | | `request_hint` | Get progressive hint | โˆ’0.10 penalty | | `submit_report` | **TERMINAL**: Final optimization report + full score | 0.0โ€“1.0 | ### Round 1 โ€” SQL Debugging Actions (backward compatible) `identify_error` ยท `propose_fix` ยท `submit_answer` ยท `explain_issue` ยท `optimize_query` ยท `request_hint` --- ## Observation Space Every observation contains the full DB state: ```json { "task_id": "medium_s001", "task_description": "E-commerce DB: 50K orders. P95 query time > 8s. Target: < 500ms.", "current_context": { "performance_score": 12.5, "target_score": 75.0, "tables": [ {"name": "orders", "rows": 50000, "indexes": ["PRIMARY"], "size_mb": 280}, {"name": "users", "rows": 8000, "indexes": ["PRIMARY", "email_idx"]} ], "slow_queries": [ {"id": "q1", "sql": "SELECT * FROM orders WHERE user_id=? AND status=?", "avg_ms": 8500}, {"id": "q2", "sql": "SELECT COUNT(*) FROM orders o JOIN users u ON o.user_id=u.id", "avg_ms": 3200} ], "improvement_history": [12.5], "milestones_earned": [], "steps_remaining": 50 }, "step_count": 0, "difficulty": "medium", "max_steps": 50 } ``` --- ## Reward Design Dense reward at every step + milestone bonuses: ``` inspect_query / analyze_indexes โ†’ +0.05 (investigation rewarded) create_index with improvement โ†’ +0.10 + delta_reward Milestone: 25% improvement โ†’ +0.15 ONE-TIME bonus Milestone: 50% improvement โ†’ +0.25 ONE-TIME bonus Milestone: 75% improvement โ†’ +0.40 ONE-TIME bonus submit_report (terminal) โ†’ 0.0โ€“1.0 full score Efficiency bonus (< 70% budget) โ†’ +0.10 Loop penalty (same action x2+) โ†’ โˆ’0.08 Hint penalty โ†’ โˆ’0.10 Backtrack penalty โ†’ โˆ’0.05 Budget exhaustion โ†’ โˆ’0.15 ``` ### GRPO Reward Breakdown (Expected per action) ``` inspect_query / analyze_indexes โ†’ ~0.10 create_index (no table/col match) โ†’ ~0.10 create_index (partial hint match) โ†’ ~0.20โ€“0.45 create_index (perfect hint match) โ†’ ~0.55โ€“0.80 create_index (simulator confirms) โ†’ ~0.75โ€“0.99 Milestones: 25%=+0.15 50%=+0.25 75%=+0.40 (cumulative) ``` ### Terminal Score Formula ```python perf_improvement = (final_score - baseline) / (100 - baseline) step_efficiency = 1.0 - (steps_used / max_steps) terminal_score = (perf_improvement * 0.60) + (step_efficiency * 0.20) + 0.10 ``` --- ## Scenarios โ€” 30 Tasks ### Round 2: DB Engineering (15 new tasks) #### Easy (15 steps, target 80+) | ID | Description | |---|---| | easy_s001 | User lookup โ€” missing email index on 10K users | | easy_s002 | Order status โ€” composite index on 50K orders | | easy_s003 | Product search โ€” LIKE query on 20K products | | easy_s004 | Session lookup โ€” 15K sessions, no index | | easy_s005 | Log filter โ€” compound index on 30K logs | #### Medium (25โ€“30 steps, target 72โ€“78) | ID | Description | |---|---| | medium_s001 | E-commerce: 50K orders + 8K users, 2 slow queries | | medium_s002 | Blog: 100K posts + 20K authors, search slow | | medium_s003 | Inventory: 200K stock movements, rewrite + index | | medium_s004 | Ticketing: 60K tickets, status queue degraded | | medium_s005 | Analytics: 150K events, funnel query slow | #### Hard (50 steps, target 65โ€“70) | ID | Description | |---|---| | hard_s001 | Financial: 500K transactions, 4 tables, 3 slow queries | | hard_s002 | SaaS: 8-table schema, 2M activity log, dashboard 20s+ | | hard_s003 | Healthcare: 1M patient records, compliance queries | | hard_s004 | Gaming: 2M players, 5M matches, leaderboard degraded | | hard_s005 | Logistics: 6 tables, 3M shipments + 10M tracking rows | ### Round 1: SQL Debugging (15 original tasks โ€” backward compatible) Easy: syntax errors ยท Medium: logic bugs ยท Hard: performance anti-patterns --- ## Self-Improving Curriculum ``` Agent avg score > 0.75 โ†’ Advance to harder tier Agent avg score < 0.30 โ†’ Drop back a tier Ultra tier (tier 3) โ†’ Auto-generated 5-8 table scenarios, no hints ``` The environment gets harder as the agent gets smarter. **Genuine adaptive curriculum.** --- ## API Endpoints | Endpoint | Method | Description | |---|---|---| | `/health` | GET | Liveness check โ€” always 200 | | `/reset` | POST | Start new episode โ†’ Observation | | `/step` | POST | Submit action โ†’ (obs, reward, done, info) | | `/state` | GET | Current episode state | | `/tasks` | GET | All 30 tasks + action schema | | `/grader` | POST | Grade an episode โ†’ float score | | `/baseline` | POST | Run baseline agent โ†’ scores | | `/progress` | GET | DB performance history + milestones | --- ## Live Demo ```bash # Reset with e-commerce scenario curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/reset \ -H "Content-Type: application/json" \ -d '{"difficulty": "easy", "task_id": "easy_s001"}' # Agent inspects slow query โ†’ sees FULL TABLE SCAN curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \ -H "Content-Type: application/json" \ -d '{"action_type": "inspect_query", "payload": {"query_id": "q1"}}' # Agent creates index โ†’ performance score 8.0 โ†’ 82.0 curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \ -H "Content-Type: application/json" \ -d '{"action_type": "create_index", "payload": {"table": "users", "columns": ["email"]}}' # Agent submits report โ†’ terminal score 0.82 curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \ -H "Content-Type: application/json" \ -d '{"action_type": "submit_report", "payload": {"summary": "Added email index. Performance 8 to 82."}}' ``` --- ## Project Structure ``` sql-query-debugger/ โ”œโ”€โ”€ .env # Environment variables โ”œโ”€โ”€ .env.example # Environment variables template โ”œโ”€โ”€ .gitignore โ”œโ”€โ”€ Dockerfile # Container definition โ”œโ”€โ”€ README.md # This file โ”œโ”€โ”€ blog_post.md # HF blog post (separate from README) โ”œโ”€โ”€ loss_curve.png # GRPO training curves โœ… evidence โ”œโ”€โ”€ reward_curve.png # Evaluation results โœ… evidence โ”œโ”€โ”€ openenv.yaml # OpenEnv metadata (v2.0.0) โ”œโ”€โ”€ pyproject.toml โ”œโ”€โ”€ requirements.txt # Pinned dependencies โ”œโ”€โ”€ uv.lock โ”œโ”€โ”€ baseline.py # Rule-based baseline agent โ”œโ”€โ”€ demo_app.py # Gradio demo app โ”œโ”€โ”€ inference.py # LLM inference agent โ”‚ โ”œโ”€โ”€ api/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ””โ”€โ”€ server.py # FastAPI โ€” 11 endpoints โ”‚ โ”œโ”€โ”€ dataset/ โ”‚ โ”œโ”€โ”€ easy_cases.json # Round 1: easy SQL tasks โ”‚ โ”œโ”€โ”€ easy_scenarios.json # Round 2: easy DB scenarios โ”‚ โ”œโ”€โ”€ hard_cases.json # Round 1: hard SQL tasks โ”‚ โ”œโ”€โ”€ hard_scenarios.json # Round 2: hard DB scenarios โ”‚ โ”œโ”€โ”€ medium_cases.json # Round 1: medium SQL tasks โ”‚ โ””โ”€โ”€ medium_scenarios.json # Round 2: medium DB scenarios โ”‚ โ”œโ”€โ”€ env/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ scenarios/ # Scenario definitions โ”‚ โ”œโ”€โ”€ curriculum.py # Self-improving curriculum โ”‚ โ”œโ”€โ”€ db_simulator.py # DB performance simulator โ”‚ โ”œโ”€โ”€ environment.py # Core: reset() step() state() โ”‚ โ”œโ”€โ”€ graders.py # Deterministic graders โ”‚ โ”œโ”€โ”€ models.py # Pydantic models (15 action types) โ”‚ โ”œโ”€โ”€ reward.py # Dense reward + milestones โ”‚ โ”œโ”€โ”€ scenario_generator.py # Dynamic scenario generation โ”‚ โ””โ”€โ”€ tasks.py # Task manager (30 tasks) โ”‚ โ”œโ”€โ”€ sdea-trained/ โ”‚ โ””โ”€โ”€ eval_results.json # Evaluation results JSON โ”‚ โ”œโ”€โ”€ training/ โ”‚ โ”œโ”€โ”€ colab_notebook.py # Colab training notebook โ”‚ โ”œโ”€โ”€ evaluate_agent.py # Evaluation + reward curve generator โ”‚ โ”œโ”€โ”€ generate_plots.py # Fixed plot generator โ”‚ โ”œโ”€โ”€ generate_training_data.py # Expert trajectory collector โ”‚ โ””โ”€โ”€ train_agent.py # Unsloth + GRPO training script โ”‚ โ””โ”€โ”€ tests/ โ”œโ”€โ”€ __init__.py โ”œโ”€โ”€ test_environment.py # Environment tests โ”œโ”€โ”€ test_graders.py # Grader tests โ”œโ”€โ”€ test_reward.py # Reward tests โ””โ”€โ”€ test_tasks.py # Task tests ``` --- ## Setup & Installation ```bash # Clone git clone https://github.com/Mdjunaid06/sql-db-engineer-agent cd sql-db-engineer-agent # Install pip install -r requirements.txt # Configure cp .env.example .env # Add HF_TOKEN to .env # Run uvicorn api.server:app --host 0.0.0.0 --port 7860 --reload # Verify curl http://localhost:7860/health # {"status":"ok","version":"2.0.0"} # Open demo # http://localhost:7860/demo ``` --- ## Validation ```bash pytest tests/ -v # 24/24 passed openenv validate . # [OK] Ready for multi-mode deployment ``` --- ## Built For **META ร— PyTorch ร— SST OpenEnv Hackathon** Finals: April 25โ€“26, 2026 | Bangalore *"We didn't build an environment. We built a DBA training simulator."*