Spaces:

md896
/

sql-debug-env

Running

App Files Files Community

sql-debug-env / README.md

md896

README: add model card highlights section and metadata snapshot.

279d788 10 days ago

preview code

raw

history blame contribute delete

7.59 kB

	---
	title: Sql Debug Env
	emoji: 💻
	colorFrom: indigo
	colorTo: gray
	sdk: docker
	pinned: false
	---

	# SQL Debug Environment (`sql-debug-env`)

	![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
	![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
	![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
	![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
	![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
	![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
	![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)

	Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.

	## Quick Links

	- Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space)
	- Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo)
	- Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/)
	- Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs)
	- OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json)
	- HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
	- GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env)
	- W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag)

	## Model Card Highlights

	Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)

	\| Field \| Value \|
	\|---\|---\|
	\| Task \| Text generation (SQL repair style prompts) \|
	\| Libraries \| Transformers, TRL (GRPO), Safetensors, TGI-compatible \|
	\| Family tags \| qwen2, grpo, conversational, text-generation-inference \|
	\| Base tracks used in workflow \| Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track \|
	\| Training signal \| Execution-grounded reward from OpenEnv SQL tasks \|
	\| Reference \| arXiv:1910.09700 (as listed in model metadata) \|

	## Problem and Motivation

	SQL debugging is expensive, repetitive, and operationally risky:

	- static checks catch syntax, not business-logic correctness
	- generated SQL can look plausible and still fail at execution time
	- production schemas and data distribution shifts expose brittle query behavior

	This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.

	## Benchmark Snapshot

	\| Metric snapshot \| Value \|
	\|---\|---:\|
	\| Spider chart: Industry baseline \| 48.2% \|
	\| Spider chart: Qwen-7B base \| 52.4% \|
	\| Spider chart: RL agent \| 78.5% \|
	\| Performance leap view \| 0.0% -> 25.0% \|
	\| Eval artifact pass \| 32-run \|

	## Proof and Evidence Artifacts

	### Main visual proofs

	- End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png`
	- Performance leap chart: `server/static/chart-performance-leap.png`
	- Comparison + reward shift: `server/static/chart-comparison-shift.png`
	- Spider headline chart: `server/static/chart-spider-benchmark.png`

	### Training/eval static exports

	\| File \| Purpose \|
	\|---\|---\|
	\| `server/static/training_reward_curve_final.png` \| Reward over steps \|
	\| `server/static/training_diagnostics_dual_axis_final.png` \| Multi-metric diagnostics \|
	\| `server/static/baseline_vs_trained_by_task_final.png` \| Per-task base vs trained \|
	\| `server/static/task_delta_post_minus_base_final.png` \| Improvement deltas \|
	\| `server/static/reward_distribution_shift_red_green_final.png` \| Distribution shift \|
	\| `server/static/presentation_combo_final.png` \| Consolidated visual summary \|
	\| `server/static/benchmark_style_summary_final.png` \| Benchmark-style summary \|
	\| `server/static/checkpoint_leaderboard_step_vs_reward_final.png` \| Checkpoint quality tracking \|
	\| `server/static/cost_vs_performance_final.png` \| Cost/performance trade-off \|

	### Run folders and model

	- Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval)
	- Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval)
	- Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)

	## System Architecture

	```mermaid
	flowchart LR
	agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
	api --> env[SQLDebugEnv]
	env --> db[In-memory SQLite DB]
	env --> tasks[Task Registry + Graders]
	tasks --> reward[Reward Engine]
	env --> reward
	reward --> api
	```

	Core components:

	- API layer: `server/main.py`
	- Environment engine: `server/env.py`
	- Episode DB: `server/database.py`
	- Typed models: `server/models.py`
	- Reward logic: `server/reward.py`
	- Task + graders: `server/tasks/`
	- Baseline runner: `inference.py`

	## OpenEnv Contract and Action Space

	API surface:

	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `GET /tasks`
	- `GET /health`
	- `GET /benchmark`

	Actions:

	\| Action \| Required fields \| Purpose \|
	\|---\|---\|---\|
	\| `submit_query` \| `query` \| Execute/grade SQL candidate \|
	\| `inspect_schema` \| none \| Return schema metadata \|
	\| `inspect_error` \| none \| Return last execution error \|
	\| `inspect_sample` \| `table_name` \| Return sample rows \|
	\| `reset_query` \| none \| Restore original broken query \|

	Reward (clamped to `[0.0, 1.0]`) blends:

	- correctness (`0.0-0.6`)
	- efficiency (`0.0-0.2`)
	- syntax_progress (`0.0-0.1`)
	- schema_bonus (`0.0-0.1`)
	- penalties (`0.0-0.2` magnitude)

	## Task Suite

	- Easy: `easy_syntax_fix`
	- Medium: `medium_logic_fix`
	- Hard: `hard_multi_bug`
	- Expert: `hard_finance_explosion` (fan-trap/cartesian explosion)

	## Reliability and Validation

	- `openenv validate --verbose`: PASS
	- `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
	- Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`)

	Live benchmark example:

	```bash
	curl "http://localhost:7860/benchmark?runs=20"
	```

	## Quick Start

	### Local

	```bash
	pip install -r requirements.txt
	uvicorn server.main:app --host 0.0.0.0 --port 7860
	```

	### Docker

	```bash
	docker build -t sql-debug-env .
	docker run -p 7860:7860 sql-debug-env
	```

	### Baseline Inference

	```bash
	export API_BASE_URL="https://api.openai.com/v1"
	export MODEL_NAME="gpt-4o-mini"
	export OPENAI_API_KEY="your-key"
	export HF_TOKEN="$OPENAI_API_KEY"
	export ENV_BASE_URL="http://localhost:7860"
	export SEED="1"
	python inference.py
	```

	## Repository Structure

	```text
	sql-debug-env/
	├── Dockerfile
	├── openenv.yaml
	├── README.md
	├── requirements.txt
	├── pyproject.toml
	├── uv.lock
	├── inference.py
	├── launch_job.py
	├── presentation_graphs.py
	├── server/
	│ ├── main.py
	│ ├── gradio_ui.py
	│ ├── demo_page.html
	│ ├── env.py
	│ ├── models.py
	│ ├── database.py
	│ ├── reward.py
	│ ├── static/
	│ └── tasks/
	└── tests/
	```