Spaces:
Running
Running
metadata
title: Sql Debug Env
emoji: π»
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
SQL Debug Environment (sql-debug-env)
Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.
Quick Links
- Live Space: https://md896-sql-debug-env.hf.space
- Demo page: https://md896-sql-debug-env.hf.space/demo
- Gradio app: https://md896-sql-debug-env.hf.space/gradio/
- Swagger: https://md896-sql-debug-env.hf.space/docs
- OpenAPI: https://md896-sql-debug-env.hf.space/openapi.json
- HF model: https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2
- GitHub: https://github.com/mdayan8/sql-debug-env
- W&B dashboard: https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag
Model Card Highlights
Model: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2
| Field | Value |
|---|---|
| Task | Text generation (SQL repair style prompts) |
| Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible |
| Family tags | qwen2, grpo, conversational, text-generation-inference |
| Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track |
| Training signal | Execution-grounded reward from OpenEnv SQL tasks |
| Reference | arXiv:1910.09700 (as listed in model metadata) |
Problem and Motivation
SQL debugging is expensive, repetitive, and operationally risky:
- static checks catch syntax, not business-logic correctness
- generated SQL can look plausible and still fail at execution time
- production schemas and data distribution shifts expose brittle query behavior
This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.
Benchmark Snapshot
| Metric snapshot | Value |
|---|---|
| Spider chart: Industry baseline | 48.2% |
| Spider chart: Qwen-7B base | 52.4% |
| Spider chart: RL agent | 78.5% |
| Performance leap view | 0.0% -> 25.0% |
| Eval artifact pass | 32-run |
Proof and Evidence Artifacts
Main visual proofs
- End-to-end workflow map:
server/static/diagram-end-to-end-workflow.png - Performance leap chart:
server/static/chart-performance-leap.png - Comparison + reward shift:
server/static/chart-comparison-shift.png - Spider headline chart:
server/static/chart-spider-benchmark.png
Training/eval static exports
| File | Purpose |
|---|---|
server/static/training_reward_curve_final.png |
Reward over steps |
server/static/training_diagnostics_dual_axis_final.png |
Multi-metric diagnostics |
server/static/baseline_vs_trained_by_task_final.png |
Per-task base vs trained |
server/static/task_delta_post_minus_base_final.png |
Improvement deltas |
server/static/reward_distribution_shift_red_green_final.png |
Distribution shift |
server/static/presentation_combo_final.png |
Consolidated visual summary |
server/static/benchmark_style_summary_final.png |
Benchmark-style summary |
server/static/checkpoint_leaderboard_step_vs_reward_final.png |
Checkpoint quality tracking |
server/static/cost_vs_performance_final.png |
Cost/performance trade-off |
Run folders and model
- Sample rewards (32 eval): HF artifacts folder
- Earlier 32-eval pass: HF artifacts folder
- Model card: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2
System Architecture
flowchart LR
agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
api --> env[SQLDebugEnv]
env --> db[In-memory SQLite DB]
env --> tasks[Task Registry + Graders]
tasks --> reward[Reward Engine]
env --> reward
reward --> api
Core components:
- API layer:
server/main.py - Environment engine:
server/env.py - Episode DB:
server/database.py - Typed models:
server/models.py - Reward logic:
server/reward.py - Task + graders:
server/tasks/ - Baseline runner:
inference.py
OpenEnv Contract and Action Space
API surface:
POST /resetPOST /stepGET /stateGET /tasksGET /healthGET /benchmark
Actions:
| Action | Required fields | Purpose |
|---|---|---|
submit_query |
query |
Execute/grade SQL candidate |
inspect_schema |
none | Return schema metadata |
inspect_error |
none | Return last execution error |
inspect_sample |
table_name |
Return sample rows |
reset_query |
none | Restore original broken query |
Reward (clamped to [0.0, 1.0]) blends:
- correctness (
0.0-0.6) - efficiency (
0.0-0.2) - syntax_progress (
0.0-0.1) - schema_bonus (
0.0-0.1) - penalties (
0.0-0.2magnitude)
Task Suite
- Easy:
easy_syntax_fix - Medium:
medium_logic_fix - Hard:
hard_multi_bug - Expert:
hard_finance_explosion(fan-trap/cartesian explosion)
Reliability and Validation
openenv validate --verbose: PASSpython3 -m unittest discover -s tests -p "test_*.py": PASS- Docker smoke checks: PASS (
/health,/tasks,/reset,/step)
Live benchmark example:
curl "http://localhost:7860/benchmark?runs=20"
Quick Start
Local
pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860
Docker
docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env
Baseline Inference
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py
Repository Structure
sql-debug-env/
βββ Dockerfile
βββ openenv.yaml
βββ README.md
βββ requirements.txt
βββ pyproject.toml
βββ uv.lock
βββ inference.py
βββ launch_job.py
βββ presentation_graphs.py
βββ server/
β βββ main.py
β βββ gradio_ui.py
β βββ demo_page.html
β βββ env.py
β βββ models.py
β βββ database.py
β βββ reward.py
β βββ static/
β βββ tasks/
βββ tests/