Spaces:
Running
Running
| title: Sql Debug Env | |
| emoji: π» | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| # SQL Debug Environment (`sql-debug-env`) | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation. | |
| ## Quick Links | |
| - Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space) | |
| - Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo) | |
| - Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/) | |
| - Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs) | |
| - OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json) | |
| - HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) | |
| - GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env) | |
| - W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag) | |
| ## Model Card Highlights | |
| Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) | |
| | Field | Value | | |
| |---|---| | |
| | Task | Text generation (SQL repair style prompts) | | |
| | Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible | | |
| | Family tags | qwen2, grpo, conversational, text-generation-inference | | |
| | Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track | | |
| | Training signal | Execution-grounded reward from OpenEnv SQL tasks | | |
| | Reference | arXiv:1910.09700 (as listed in model metadata) | | |
| ## Problem and Motivation | |
| SQL debugging is expensive, repetitive, and operationally risky: | |
| - static checks catch syntax, not business-logic correctness | |
| - generated SQL can look plausible and still fail at execution time | |
| - production schemas and data distribution shifts expose brittle query behavior | |
| This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks. | |
| ## Benchmark Snapshot | |
| | Metric snapshot | Value | | |
| |---|---:| | |
| | Spider chart: Industry baseline | 48.2% | | |
| | Spider chart: Qwen-7B base | 52.4% | | |
| | Spider chart: RL agent | 78.5% | | |
| | Performance leap view | 0.0% -> 25.0% | | |
| | Eval artifact pass | 32-run | | |
| ## Proof and Evidence Artifacts | |
| ### Main visual proofs | |
| - End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png` | |
| - Performance leap chart: `server/static/chart-performance-leap.png` | |
| - Comparison + reward shift: `server/static/chart-comparison-shift.png` | |
| - Spider headline chart: `server/static/chart-spider-benchmark.png` | |
| ### Training/eval static exports | |
| | File | Purpose | | |
| |---|---| | |
| | `server/static/training_reward_curve_final.png` | Reward over steps | | |
| | `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics | | |
| | `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained | | |
| | `server/static/task_delta_post_minus_base_final.png` | Improvement deltas | | |
| | `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift | | |
| | `server/static/presentation_combo_final.png` | Consolidated visual summary | | |
| | `server/static/benchmark_style_summary_final.png` | Benchmark-style summary | | |
| | `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking | | |
| | `server/static/cost_vs_performance_final.png` | Cost/performance trade-off | | |
| ### Run folders and model | |
| - Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval) | |
| - Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval) | |
| - Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2) | |
| ## System Architecture | |
| ```mermaid | |
| flowchart LR | |
| agent[Client / Agent / Evaluator] --> api[FastAPI API Layer] | |
| api --> env[SQLDebugEnv] | |
| env --> db[In-memory SQLite DB] | |
| env --> tasks[Task Registry + Graders] | |
| tasks --> reward[Reward Engine] | |
| env --> reward | |
| reward --> api | |
| ``` | |
| Core components: | |
| - API layer: `server/main.py` | |
| - Environment engine: `server/env.py` | |
| - Episode DB: `server/database.py` | |
| - Typed models: `server/models.py` | |
| - Reward logic: `server/reward.py` | |
| - Task + graders: `server/tasks/` | |
| - Baseline runner: `inference.py` | |
| ## OpenEnv Contract and Action Space | |
| API surface: | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /state` | |
| - `GET /tasks` | |
| - `GET /health` | |
| - `GET /benchmark` | |
| Actions: | |
| | Action | Required fields | Purpose | | |
| |---|---|---| | |
| | `submit_query` | `query` | Execute/grade SQL candidate | | |
| | `inspect_schema` | none | Return schema metadata | | |
| | `inspect_error` | none | Return last execution error | | |
| | `inspect_sample` | `table_name` | Return sample rows | | |
| | `reset_query` | none | Restore original broken query | | |
| Reward (clamped to `[0.0, 1.0]`) blends: | |
| - correctness (`0.0-0.6`) | |
| - efficiency (`0.0-0.2`) | |
| - syntax_progress (`0.0-0.1`) | |
| - schema_bonus (`0.0-0.1`) | |
| - penalties (`0.0-0.2` magnitude) | |
| ## Task Suite | |
| - Easy: `easy_syntax_fix` | |
| - Medium: `medium_logic_fix` | |
| - Hard: `hard_multi_bug` | |
| - Expert: `hard_finance_explosion` (fan-trap/cartesian explosion) | |
| ## Reliability and Validation | |
| - `openenv validate --verbose`: PASS | |
| - `python3 -m unittest discover -s tests -p "test_*.py"`: PASS | |
| - Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`) | |
| Live benchmark example: | |
| ```bash | |
| curl "http://localhost:7860/benchmark?runs=20" | |
| ``` | |
| ## Quick Start | |
| ### Local | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn server.main:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t sql-debug-env . | |
| docker run -p 7860:7860 sql-debug-env | |
| ``` | |
| ### Baseline Inference | |
| ```bash | |
| export API_BASE_URL="https://api.openai.com/v1" | |
| export MODEL_NAME="gpt-4o-mini" | |
| export OPENAI_API_KEY="your-key" | |
| export HF_TOKEN="$OPENAI_API_KEY" | |
| export ENV_BASE_URL="http://localhost:7860" | |
| export SEED="1" | |
| python inference.py | |
| ``` | |
| ## Repository Structure | |
| ```text | |
| sql-debug-env/ | |
| βββ Dockerfile | |
| βββ openenv.yaml | |
| βββ README.md | |
| βββ requirements.txt | |
| βββ pyproject.toml | |
| βββ uv.lock | |
| βββ inference.py | |
| βββ launch_job.py | |
| βββ presentation_graphs.py | |
| βββ server/ | |
| β βββ main.py | |
| β βββ gradio_ui.py | |
| β βββ demo_page.html | |
| β βββ env.py | |
| β βββ models.py | |
| β βββ database.py | |
| β βββ reward.py | |
| β βββ static/ | |
| β βββ tasks/ | |
| βββ tests/ | |
| ``` | |