Spaces:
Running
Running
File size: 7,594 Bytes
157ae69 74e3e43 157ae69 87464f9 6674354 87464f9 6674354 87464f9 6674354 3d1b780 6674354 87464f9 279d788 6674354 d061422 6674354 d061422 6674354 d061422 6674354 d061422 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 6674354 87464f9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | ---
title: Sql Debug Env
emoji: π»
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
---
# SQL Debug Environment (`sql-debug-env`)







Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.
## Quick Links
- Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space)
- Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo)
- Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/)
- Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs)
- OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json)
- HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
- GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env)
- W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag)
## Model Card Highlights
Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
| Field | Value |
|---|---|
| Task | Text generation (SQL repair style prompts) |
| Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible |
| Family tags | qwen2, grpo, conversational, text-generation-inference |
| Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track |
| Training signal | Execution-grounded reward from OpenEnv SQL tasks |
| Reference | arXiv:1910.09700 (as listed in model metadata) |
## Problem and Motivation
SQL debugging is expensive, repetitive, and operationally risky:
- static checks catch syntax, not business-logic correctness
- generated SQL can look plausible and still fail at execution time
- production schemas and data distribution shifts expose brittle query behavior
This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.
## Benchmark Snapshot
| Metric snapshot | Value |
|---|---:|
| Spider chart: Industry baseline | 48.2% |
| Spider chart: Qwen-7B base | 52.4% |
| Spider chart: RL agent | 78.5% |
| Performance leap view | 0.0% -> 25.0% |
| Eval artifact pass | 32-run |
## Proof and Evidence Artifacts
### Main visual proofs
- End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png`
- Performance leap chart: `server/static/chart-performance-leap.png`
- Comparison + reward shift: `server/static/chart-comparison-shift.png`
- Spider headline chart: `server/static/chart-spider-benchmark.png`
### Training/eval static exports
| File | Purpose |
|---|---|
| `server/static/training_reward_curve_final.png` | Reward over steps |
| `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics |
| `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained |
| `server/static/task_delta_post_minus_base_final.png` | Improvement deltas |
| `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift |
| `server/static/presentation_combo_final.png` | Consolidated visual summary |
| `server/static/benchmark_style_summary_final.png` | Benchmark-style summary |
| `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking |
| `server/static/cost_vs_performance_final.png` | Cost/performance trade-off |
### Run folders and model
- Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval)
- Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval)
- Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
## System Architecture
```mermaid
flowchart LR
agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
api --> env[SQLDebugEnv]
env --> db[In-memory SQLite DB]
env --> tasks[Task Registry + Graders]
tasks --> reward[Reward Engine]
env --> reward
reward --> api
```
Core components:
- API layer: `server/main.py`
- Environment engine: `server/env.py`
- Episode DB: `server/database.py`
- Typed models: `server/models.py`
- Reward logic: `server/reward.py`
- Task + graders: `server/tasks/`
- Baseline runner: `inference.py`
## OpenEnv Contract and Action Space
API surface:
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /tasks`
- `GET /health`
- `GET /benchmark`
Actions:
| Action | Required fields | Purpose |
|---|---|---|
| `submit_query` | `query` | Execute/grade SQL candidate |
| `inspect_schema` | none | Return schema metadata |
| `inspect_error` | none | Return last execution error |
| `inspect_sample` | `table_name` | Return sample rows |
| `reset_query` | none | Restore original broken query |
Reward (clamped to `[0.0, 1.0]`) blends:
- correctness (`0.0-0.6`)
- efficiency (`0.0-0.2`)
- syntax_progress (`0.0-0.1`)
- schema_bonus (`0.0-0.1`)
- penalties (`0.0-0.2` magnitude)
## Task Suite
- Easy: `easy_syntax_fix`
- Medium: `medium_logic_fix`
- Hard: `hard_multi_bug`
- Expert: `hard_finance_explosion` (fan-trap/cartesian explosion)
## Reliability and Validation
- `openenv validate --verbose`: PASS
- `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
- Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`)
Live benchmark example:
```bash
curl "http://localhost:7860/benchmark?runs=20"
```
## Quick Start
### Local
```bash
pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860
```
### Docker
```bash
docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env
```
### Baseline Inference
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py
```
## Repository Structure
```text
sql-debug-env/
βββ Dockerfile
βββ openenv.yaml
βββ README.md
βββ requirements.txt
βββ pyproject.toml
βββ uv.lock
βββ inference.py
βββ launch_job.py
βββ presentation_graphs.py
βββ server/
β βββ main.py
β βββ gradio_ui.py
β βββ demo_page.html
β βββ env.py
β βββ models.py
β βββ database.py
β βββ reward.py
β βββ static/
β βββ tasks/
βββ tests/
```
|