--- title: Sql Debug Env emoji: 💻 colorFrom: indigo colorTo: gray sdk: docker pinned: false --- # SQL Debug Environment (`sql-debug-env`) ![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f) ![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white) ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white) ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white) ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white) ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white) ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111) Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation. ## Quick Links - Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space) - Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo) - Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/) - Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs) - OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json) - HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) - GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env) - W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag) ## Model Card Highlights Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) | Field | Value | |---|---| | Task | Text generation (SQL repair style prompts) | | Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible | | Family tags | qwen2, grpo, conversational, text-generation-inference | | Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track | | Training signal | Execution-grounded reward from OpenEnv SQL tasks | | Reference | arXiv:1910.09700 (as listed in model metadata) | ## Problem and Motivation SQL debugging is expensive, repetitive, and operationally risky: - static checks catch syntax, not business-logic correctness - generated SQL can look plausible and still fail at execution time - production schemas and data distribution shifts expose brittle query behavior This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks. ## Benchmark Snapshot | Metric snapshot | Value | |---|---:| | Spider chart: Industry baseline | 48.2% | | Spider chart: Qwen-7B base | 52.4% | | Spider chart: RL agent | 78.5% | | Performance leap view | 0.0% -> 25.0% | | Eval artifact pass | 32-run | ## Proof and Evidence Artifacts ### Main visual proofs - End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png` - Performance leap chart: `server/static/chart-performance-leap.png` - Comparison + reward shift: `server/static/chart-comparison-shift.png` - Spider headline chart: `server/static/chart-spider-benchmark.png` ### Training/eval static exports | File | Purpose | |---|---| | `server/static/training_reward_curve_final.png` | Reward over steps | | `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics | | `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained | | `server/static/task_delta_post_minus_base_final.png` | Improvement deltas | | `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift | | `server/static/presentation_combo_final.png` | Consolidated visual summary | | `server/static/benchmark_style_summary_final.png` | Benchmark-style summary | | `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking | | `server/static/cost_vs_performance_final.png` | Cost/performance trade-off | ### Run folders and model - Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval) - Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval) - Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) ## System Architecture ```mermaid flowchart LR agent[Client / Agent / Evaluator] --> api[FastAPI API Layer] api --> env[SQLDebugEnv] env --> db[In-memory SQLite DB] env --> tasks[Task Registry + Graders] tasks --> reward[Reward Engine] env --> reward reward --> api ``` Core components: - API layer: `server/main.py` - Environment engine: `server/env.py` - Episode DB: `server/database.py` - Typed models: `server/models.py` - Reward logic: `server/reward.py` - Task + graders: `server/tasks/` - Baseline runner: `inference.py` ## OpenEnv Contract and Action Space API surface: - `POST /reset` - `POST /step` - `GET /state` - `GET /tasks` - `GET /health` - `GET /benchmark` Actions: | Action | Required fields | Purpose | |---|---|---| | `submit_query` | `query` | Execute/grade SQL candidate | | `inspect_schema` | none | Return schema metadata | | `inspect_error` | none | Return last execution error | | `inspect_sample` | `table_name` | Return sample rows | | `reset_query` | none | Restore original broken query | Reward (clamped to `[0.0, 1.0]`) blends: - correctness (`0.0-0.6`) - efficiency (`0.0-0.2`) - syntax_progress (`0.0-0.1`) - schema_bonus (`0.0-0.1`) - penalties (`0.0-0.2` magnitude) ## Task Suite - Easy: `easy_syntax_fix` - Medium: `medium_logic_fix` - Hard: `hard_multi_bug` - Expert: `hard_finance_explosion` (fan-trap/cartesian explosion) ## Reliability and Validation - `openenv validate --verbose`: PASS - `python3 -m unittest discover -s tests -p "test_*.py"`: PASS - Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`) Live benchmark example: ```bash curl "http://localhost:7860/benchmark?runs=20" ``` ## Quick Start ### Local ```bash pip install -r requirements.txt uvicorn server.main:app --host 0.0.0.0 --port 7860 ``` ### Docker ```bash docker build -t sql-debug-env . docker run -p 7860:7860 sql-debug-env ``` ### Baseline Inference ```bash export API_BASE_URL="https://api.openai.com/v1" export MODEL_NAME="gpt-4o-mini" export OPENAI_API_KEY="your-key" export HF_TOKEN="$OPENAI_API_KEY" export ENV_BASE_URL="http://localhost:7860" export SEED="1" python inference.py ``` ## Repository Structure ```text sql-debug-env/ ├── Dockerfile ├── openenv.yaml ├── README.md ├── requirements.txt ├── pyproject.toml ├── uv.lock ├── inference.py ├── launch_job.py ├── presentation_graphs.py ├── server/ │ ├── main.py │ ├── gradio_ui.py │ ├── demo_page.html │ ├── env.py │ ├── models.py │ ├── database.py │ ├── reward.py │ ├── static/ │ └── tasks/ └── tests/ ```