sql-debug-env / README.md
md896's picture
README: add model card highlights section and metadata snapshot.
279d788
---
title: Sql Debug Env
emoji: πŸ’»
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
---
# SQL Debug Environment (`sql-debug-env`)
![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.
## Quick Links
- Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space)
- Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo)
- Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/)
- Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs)
- OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json)
- HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
- GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env)
- W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag)
## Model Card Highlights
Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
| Field | Value |
|---|---|
| Task | Text generation (SQL repair style prompts) |
| Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible |
| Family tags | qwen2, grpo, conversational, text-generation-inference |
| Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track |
| Training signal | Execution-grounded reward from OpenEnv SQL tasks |
| Reference | arXiv:1910.09700 (as listed in model metadata) |
## Problem and Motivation
SQL debugging is expensive, repetitive, and operationally risky:
- static checks catch syntax, not business-logic correctness
- generated SQL can look plausible and still fail at execution time
- production schemas and data distribution shifts expose brittle query behavior
This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.
## Benchmark Snapshot
| Metric snapshot | Value |
|---|---:|
| Spider chart: Industry baseline | 48.2% |
| Spider chart: Qwen-7B base | 52.4% |
| Spider chart: RL agent | 78.5% |
| Performance leap view | 0.0% -> 25.0% |
| Eval artifact pass | 32-run |
## Proof and Evidence Artifacts
### Main visual proofs
- End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png`
- Performance leap chart: `server/static/chart-performance-leap.png`
- Comparison + reward shift: `server/static/chart-comparison-shift.png`
- Spider headline chart: `server/static/chart-spider-benchmark.png`
### Training/eval static exports
| File | Purpose |
|---|---|
| `server/static/training_reward_curve_final.png` | Reward over steps |
| `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics |
| `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained |
| `server/static/task_delta_post_minus_base_final.png` | Improvement deltas |
| `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift |
| `server/static/presentation_combo_final.png` | Consolidated visual summary |
| `server/static/benchmark_style_summary_final.png` | Benchmark-style summary |
| `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking |
| `server/static/cost_vs_performance_final.png` | Cost/performance trade-off |
### Run folders and model
- Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval)
- Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval)
- Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
## System Architecture
```mermaid
flowchart LR
agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
api --> env[SQLDebugEnv]
env --> db[In-memory SQLite DB]
env --> tasks[Task Registry + Graders]
tasks --> reward[Reward Engine]
env --> reward
reward --> api
```
Core components:
- API layer: `server/main.py`
- Environment engine: `server/env.py`
- Episode DB: `server/database.py`
- Typed models: `server/models.py`
- Reward logic: `server/reward.py`
- Task + graders: `server/tasks/`
- Baseline runner: `inference.py`
## OpenEnv Contract and Action Space
API surface:
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /tasks`
- `GET /health`
- `GET /benchmark`
Actions:
| Action | Required fields | Purpose |
|---|---|---|
| `submit_query` | `query` | Execute/grade SQL candidate |
| `inspect_schema` | none | Return schema metadata |
| `inspect_error` | none | Return last execution error |
| `inspect_sample` | `table_name` | Return sample rows |
| `reset_query` | none | Restore original broken query |
Reward (clamped to `[0.0, 1.0]`) blends:
- correctness (`0.0-0.6`)
- efficiency (`0.0-0.2`)
- syntax_progress (`0.0-0.1`)
- schema_bonus (`0.0-0.1`)
- penalties (`0.0-0.2` magnitude)
## Task Suite
- Easy: `easy_syntax_fix`
- Medium: `medium_logic_fix`
- Hard: `hard_multi_bug`
- Expert: `hard_finance_explosion` (fan-trap/cartesian explosion)
## Reliability and Validation
- `openenv validate --verbose`: PASS
- `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
- Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`)
Live benchmark example:
```bash
curl "http://localhost:7860/benchmark?runs=20"
```
## Quick Start
### Local
```bash
pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860
```
### Docker
```bash
docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env
```
### Baseline Inference
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py
```
## Repository Structure
```text
sql-debug-env/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock
β”œβ”€β”€ inference.py
β”œβ”€β”€ launch_job.py
β”œβ”€β”€ presentation_graphs.py
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ main.py
β”‚ β”œβ”€β”€ gradio_ui.py
β”‚ β”œβ”€β”€ demo_page.html
β”‚ β”œβ”€β”€ env.py
β”‚ β”œβ”€β”€ models.py
β”‚ β”œβ”€β”€ database.py
β”‚ β”œβ”€β”€ reward.py
β”‚ β”œβ”€β”€ static/
β”‚ └── tasks/
└── tests/
```