junaid0600's picture
Update README.md
65cc4ae verified
|
Raw
History Blame Contribute Delete
14.8 kB
metadata
title: SQL Database Engineer Agent
emoji: πŸ—„οΈ
colorFrom: blue
colorTo: green
sdk: docker
pinned: true
tags:
  - openenv
  - reinforcement-learning
  - sql
  - database
  - engineering
  - long-horizon
  - self-improvement
  - wildcard
license: mit

SQL Database Engineer Agent β€” OpenEnv Environment

META Γ— PyTorch Γ— SST OpenEnv Hackathon | Finals April 25–26, 2026 | Bangalore Evolved from SQL Query Debugger (Round 1 β€” all 4 checks passed βœ…)

An OpenEnv-compliant reinforcement learning environment where AI agents learn to act like senior database engineers. The agent manages a simulated production database over 50+ steps β€” inspecting slow queries, creating indexes, rewriting queries, and partitioning tables.


πŸ”— Quick Links


From Round 1 β†’ Round 2

Round 1 β€” SQL Query Debugger Round 2 β€” SQL Database Engineer Agent
Task Fix one broken SQL query Optimize entire production database
Steps 20 per episode 50 per episode
Actions 6 (identify, fix, submit...) 15 (inspect, index, rewrite, partition...)
Reward Dense per step Dense + milestone bonuses
Scenarios 15 single-query tasks 30 total (15 new + 15 original)
Training Rule-based baseline Unsloth + GRPO on Qwen2.5-7B
Theme Real-world SQL Long-Horizon + World Modeling + Wildcard

Motivation

Every production database degrades over time.

Your app launches. Queries run in 50ms. Six months later, users are complaining. P95 query time: 8,500ms. A senior DBA sits down β€” runs EXPLAIN queries, finds missing indexes, rewrites bad JOINs, partitions 50-million-row tables. This takes 10 years to learn.

We asked: can we train an LLM to do it?

SQL database engineering is uniquely well-suited for RL:

  1. 100% measurable β€” query time in milliseconds, index hit rates, performance scores
  2. Long-horizon β€” real fixes require 10-50 careful, ordered steps
  3. World modeling β€” agent must maintain internal model of DB state, indexes, query plans
  4. Self-improving β€” curriculum generates harder scenarios as agent improves
  5. Novel β€” no OpenEnv environment for DB engineering exists anywhere

πŸ“Š Training Results

Trained Qwen2.5-7B-Instruct with GRPO using Unsloth (only 0.53% of parameters via LoRA):

GRPO Training Curves β€” 200 Steps

Demo

Metric Value
Training steps 200
Loss 4.92e-07 β†’ 1.23e-05
Reward 0.235 β†’ 0.456
Improvement +94%
Model Qwen2.5-7B (0.53% trainable via LoRA)
Epochs 29
Batch size 8 (4 Γ— 2 grad accum Γ— 1 GPU)

⚠️ Note: GRPO policy loss rises as the model becomes more confident β€” this is expected behaviour, not divergence. The reward curve confirms consistent improvement.

Evaluation β€” Trained vs Random Agent (15 Scenarios)

Demo

Agent Avg Improvement Best Scenario Worst Scenario
Random (wrong index) +0.0 pts 0 pts 0 pts
Trained (GRPO) +31.4 pts +59 pts (Scenario 8 ) +10 pts
  • Trained agent outperformed random baseline on every single scenario
  • Scenario 8 flagged as outlier (Β±1.5Οƒ) β€” agent found especially impactful index combination
  • Relative gain: ∞ (baseline scored exactly 0 on all scenarios)

Training Progression

Stage Avg Reward Agent Behavior
Before training 0.05 Random actions, no strategy
50 steps 0.25 Learns to inspect before acting
200 steps 0.456 Multi-step planning emerges

Environment Overview

Property Value
Domain Database Engineering
Tasks 30 (15 Round 2 scenarios + 15 Round 1 cases)
Max Steps 50 per episode
Reward Type Dense + milestone bonuses
Performance Score 0–100 (real DB metric)
API Port 7860
Themes Long-Horizon (2) + World Modeling (3.1) + Self-Improvement (4) + Wildcard (5)

Action Space (15 Actions)

Round 2 β€” DB Engineering Actions

Action What It Does Reward
inspect_query EXPLAIN a slow query β€” scan type, rows examined, cost +0.05
analyze_indexes Show all indexes + missing index hints +0.05
create_index Add composite index on specified columns +0.10 + delta
rewrite_query Submit rewritten SQL β€” measures improvement +0.15 + delta
add_column Add denormalization column to reduce JOINs +0.08 + delta
drop_index Remove unused index (reduce write overhead) +0.05 + delta
partition_table Partition large table by date/ID range +0.15 + delta
analyze_statistics Update table statistics for query planner +0.05 + delta
request_hint Get progressive hint βˆ’0.10 penalty
submit_report TERMINAL: Final optimization report + full score 0.0–1.0

Round 1 β€” SQL Debugging Actions (backward compatible)

identify_error Β· propose_fix Β· submit_answer Β· explain_issue Β· optimize_query Β· request_hint


Observation Space

Every observation contains the full DB state:

{
  "task_id": "medium_s001",
  "task_description": "E-commerce DB: 50K orders. P95 query time > 8s. Target: < 500ms.",
  "current_context": {
    "performance_score": 12.5,
    "target_score": 75.0,
    "tables": [
      {"name": "orders", "rows": 50000, "indexes": ["PRIMARY"], "size_mb": 280},
      {"name": "users",  "rows": 8000,  "indexes": ["PRIMARY", "email_idx"]}
    ],
    "slow_queries": [
      {"id": "q1", "sql": "SELECT * FROM orders WHERE user_id=? AND status=?", "avg_ms": 8500},
      {"id": "q2", "sql": "SELECT COUNT(*) FROM orders o JOIN users u ON o.user_id=u.id", "avg_ms": 3200}
    ],
    "improvement_history": [12.5],
    "milestones_earned": [],
    "steps_remaining": 50
  },
  "step_count": 0,
  "difficulty": "medium",
  "max_steps": 50
}

Reward Design

Dense reward at every step + milestone bonuses:

inspect_query / analyze_indexes  β†’ +0.05 (investigation rewarded)
create_index with improvement    β†’ +0.10 + delta_reward
Milestone: 25% improvement       β†’ +0.15 ONE-TIME bonus
Milestone: 50% improvement       β†’ +0.25 ONE-TIME bonus
Milestone: 75% improvement       β†’ +0.40 ONE-TIME bonus
submit_report (terminal)         β†’ 0.0–1.0 full score
Efficiency bonus (< 70% budget)  β†’ +0.10
Loop penalty (same action x2+)   β†’ βˆ’0.08
Hint penalty                     β†’ βˆ’0.10
Backtrack penalty                β†’ βˆ’0.05
Budget exhaustion                β†’ βˆ’0.15

GRPO Reward Breakdown (Expected per action)

inspect_query / analyze_indexes       β†’  ~0.10
create_index (no table/col match)     β†’  ~0.10
create_index (partial hint match)     β†’  ~0.20–0.45
create_index (perfect hint match)     β†’  ~0.55–0.80
create_index (simulator confirms)     β†’  ~0.75–0.99
Milestones: 25%=+0.15  50%=+0.25  75%=+0.40  (cumulative)

Terminal Score Formula

perf_improvement = (final_score - baseline) / (100 - baseline)
step_efficiency  = 1.0 - (steps_used / max_steps)
terminal_score   = (perf_improvement * 0.60) + (step_efficiency * 0.20) + 0.10

Scenarios β€” 30 Tasks

Round 2: DB Engineering (15 new tasks)

Easy (15 steps, target 80+)

ID Description
easy_s001 User lookup β€” missing email index on 10K users
easy_s002 Order status β€” composite index on 50K orders
easy_s003 Product search β€” LIKE query on 20K products
easy_s004 Session lookup β€” 15K sessions, no index
easy_s005 Log filter β€” compound index on 30K logs

Medium (25–30 steps, target 72–78)

ID Description
medium_s001 E-commerce: 50K orders + 8K users, 2 slow queries
medium_s002 Blog: 100K posts + 20K authors, search slow
medium_s003 Inventory: 200K stock movements, rewrite + index
medium_s004 Ticketing: 60K tickets, status queue degraded
medium_s005 Analytics: 150K events, funnel query slow

Hard (50 steps, target 65–70)

ID Description
hard_s001 Financial: 500K transactions, 4 tables, 3 slow queries
hard_s002 SaaS: 8-table schema, 2M activity log, dashboard 20s+
hard_s003 Healthcare: 1M patient records, compliance queries
hard_s004 Gaming: 2M players, 5M matches, leaderboard degraded
hard_s005 Logistics: 6 tables, 3M shipments + 10M tracking rows

Round 1: SQL Debugging (15 original tasks β€” backward compatible)

Easy: syntax errors Β· Medium: logic bugs Β· Hard: performance anti-patterns


Self-Improving Curriculum

Agent avg score > 0.75  β†’  Advance to harder tier
Agent avg score < 0.30  β†’  Drop back a tier
Ultra tier (tier 3)     β†’  Auto-generated 5-8 table scenarios, no hints

The environment gets harder as the agent gets smarter. Genuine adaptive curriculum.


API Endpoints

Endpoint Method Description
/health GET Liveness check β€” always 200
/reset POST Start new episode β†’ Observation
/step POST Submit action β†’ (obs, reward, done, info)
/state GET Current episode state
/tasks GET All 30 tasks + action schema
/grader POST Grade an episode β†’ float score
/baseline POST Run baseline agent β†’ scores
/progress GET DB performance history + milestones

Live Demo

# Reset with e-commerce scenario
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"difficulty": "easy", "task_id": "easy_s001"}'

# Agent inspects slow query β†’ sees FULL TABLE SCAN
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "inspect_query", "payload": {"query_id": "q1"}}'

# Agent creates index β†’ performance score 8.0 β†’ 82.0
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "create_index", "payload": {"table": "users", "columns": ["email"]}}'

# Agent submits report β†’ terminal score 0.82
curl -X POST https://junaid0600-sql-db-engineer-agent.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type": "submit_report", "payload": {"summary": "Added email index. Performance 8 to 82."}}'

Project Structure

sql-query-debugger/
β”œβ”€β”€ .env                          # Environment variables
β”œβ”€β”€ .env.example                  # Environment variables template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile                    # Container definition
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ blog_post.md                  # HF blog post (separate from README)
β”œβ”€β”€ loss_curve.png                # GRPO training curves βœ… evidence
β”œβ”€β”€ reward_curve.png              # Evaluation results βœ… evidence
β”œβ”€β”€ openenv.yaml                  # OpenEnv metadata (v2.0.0)
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt              # Pinned dependencies
β”œβ”€β”€ uv.lock
β”œβ”€β”€ baseline.py                   # Rule-based baseline agent
β”œβ”€β”€ demo_app.py                   # Gradio demo app
β”œβ”€β”€ inference.py                  # LLM inference agent
β”‚
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── server.py                 # FastAPI β€” 11 endpoints
β”‚
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ easy_cases.json           # Round 1: easy SQL tasks
β”‚   β”œβ”€β”€ easy_scenarios.json       # Round 2: easy DB scenarios
β”‚   β”œβ”€β”€ hard_cases.json           # Round 1: hard SQL tasks
β”‚   β”œβ”€β”€ hard_scenarios.json       # Round 2: hard DB scenarios
β”‚   β”œβ”€β”€ medium_cases.json         # Round 1: medium SQL tasks
β”‚   └── medium_scenarios.json     # Round 2: medium DB scenarios
β”‚
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ scenarios/                # Scenario definitions
β”‚   β”œβ”€β”€ curriculum.py             # Self-improving curriculum
β”‚   β”œβ”€β”€ db_simulator.py           # DB performance simulator
β”‚   β”œβ”€β”€ environment.py            # Core: reset() step() state()
β”‚   β”œβ”€β”€ graders.py                # Deterministic graders
β”‚   β”œβ”€β”€ models.py                 # Pydantic models (15 action types)
β”‚   β”œβ”€β”€ reward.py                 # Dense reward + milestones
β”‚   β”œβ”€β”€ scenario_generator.py     # Dynamic scenario generation
β”‚   └── tasks.py                  # Task manager (30 tasks)
β”‚
β”œβ”€β”€ sdea-trained/
β”‚   └── eval_results.json         # Evaluation results JSON
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ colab_notebook.py         # Colab training notebook
β”‚   β”œβ”€β”€ evaluate_agent.py         # Evaluation + reward curve generator
β”‚   β”œβ”€β”€ generate_plots.py         # Fixed plot generator
β”‚   β”œβ”€β”€ generate_training_data.py # Expert trajectory collector
β”‚   └── train_agent.py            # Unsloth + GRPO training script
β”‚
└── tests/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ test_environment.py       # Environment tests
    β”œβ”€β”€ test_graders.py           # Grader tests
    β”œβ”€β”€ test_reward.py            # Reward tests
    └── test_tasks.py             # Task tests

Setup & Installation

# Clone
git clone https://github.com/Mdjunaid06/sql-db-engineer-agent
cd sql-db-engineer-agent

# Install
pip install -r requirements.txt

# Configure
cp .env.example .env
# Add HF_TOKEN to .env

# Run
uvicorn api.server:app --host 0.0.0.0 --port 7860 --reload

# Verify
curl http://localhost:7860/health
# {"status":"ok","version":"2.0.0"}

# Open demo
# http://localhost:7860/demo

Validation

pytest tests/ -v          # 24/24 passed
openenv validate .         # [OK] Ready for multi-mode deployment

Built For

META Γ— PyTorch Γ— SST OpenEnv Hackathon Finals: April 25–26, 2026 | Bangalore

"We didn't build an environment. We built a DBA training simulator."