Spaces:

md896
/

sql-debug-env

Running

App Files Files Community

sql-debug-env / README.md

md896

README: add model card highlights section and metadata snapshot.

279d788 10 days ago

preview code

raw

history blame contribute delete

7.59 kB

metadata

title: Sql Debug Env
emoji: 💻
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false

SQL Debug Environment (`sql-debug-env`)

Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.

Quick Links

Live Space: https://md896-sql-debug-env.hf.space
Demo page: https://md896-sql-debug-env.hf.space/demo
Gradio app: https://md896-sql-debug-env.hf.space/gradio/
Swagger: https://md896-sql-debug-env.hf.space/docs
OpenAPI: https://md896-sql-debug-env.hf.space/openapi.json
HF model: https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2
GitHub: https://github.com/mdayan8/sql-debug-env
W&B dashboard: https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag

Model Card Highlights

Model: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2

Field	Value
Task	Text generation (SQL repair style prompts)
Libraries	Transformers, TRL (GRPO), Safetensors, TGI-compatible
Family tags	qwen2, grpo, conversational, text-generation-inference
Base tracks used in workflow	Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track
Training signal	Execution-grounded reward from OpenEnv SQL tasks
Reference	arXiv:1910.09700 (as listed in model metadata)

Problem and Motivation

SQL debugging is expensive, repetitive, and operationally risky:

static checks catch syntax, not business-logic correctness
generated SQL can look plausible and still fail at execution time
production schemas and data distribution shifts expose brittle query behavior

This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.

Benchmark Snapshot

Metric snapshot	Value
Spider chart: Industry baseline	48.2%
Spider chart: Qwen-7B base	52.4%
Spider chart: RL agent	78.5%
Performance leap view	0.0% -> 25.0%
Eval artifact pass	32-run

Proof and Evidence Artifacts

Main visual proofs

End-to-end workflow map: server/static/diagram-end-to-end-workflow.png
Performance leap chart: server/static/chart-performance-leap.png
Comparison + reward shift: server/static/chart-comparison-shift.png
Spider headline chart: server/static/chart-spider-benchmark.png

Training/eval static exports

File	Purpose
`server/static/training_reward_curve_final.png`	Reward over steps
`server/static/training_diagnostics_dual_axis_final.png`	Multi-metric diagnostics
`server/static/baseline_vs_trained_by_task_final.png`	Per-task base vs trained
`server/static/task_delta_post_minus_base_final.png`	Improvement deltas
`server/static/reward_distribution_shift_red_green_final.png`	Distribution shift
`server/static/presentation_combo_final.png`	Consolidated visual summary
`server/static/benchmark_style_summary_final.png`	Benchmark-style summary
`server/static/checkpoint_leaderboard_step_vs_reward_final.png`	Checkpoint quality tracking
`server/static/cost_vs_performance_final.png`	Cost/performance trade-off

Run folders and model

Sample rewards (32 eval): HF artifacts folder
Earlier 32-eval pass: HF artifacts folder
Model card: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2

System Architecture

flowchart LR
    agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
    api --> env[SQLDebugEnv]
    env --> db[In-memory SQLite DB]
    env --> tasks[Task Registry + Graders]
    tasks --> reward[Reward Engine]
    env --> reward
    reward --> api

Core components:

API layer: server/main.py
Environment engine: server/env.py
Episode DB: server/database.py
Typed models: server/models.py
Reward logic: server/reward.py
Task + graders: server/tasks/
Baseline runner: inference.py

OpenEnv Contract and Action Space

API surface:

POST /reset
POST /step
GET /state
GET /tasks
GET /health
GET /benchmark

Actions:

Action	Required fields	Purpose
`submit_query`	`query`	Execute/grade SQL candidate
`inspect_schema`	none	Return schema metadata
`inspect_error`	none	Return last execution error
`inspect_sample`	`table_name`	Return sample rows
`reset_query`	none	Restore original broken query

Reward (clamped to [0.0, 1.0]) blends:

correctness (0.0-0.6)
efficiency (0.0-0.2)
syntax_progress (0.0-0.1)
schema_bonus (0.0-0.1)
penalties (0.0-0.2 magnitude)

Task Suite

Easy: easy_syntax_fix
Medium: medium_logic_fix
Hard: hard_multi_bug
Expert: hard_finance_explosion (fan-trap/cartesian explosion)

Reliability and Validation

openenv validate --verbose: PASS
python3 -m unittest discover -s tests -p "test_*.py": PASS
Docker smoke checks: PASS (/health, /tasks, /reset, /step)

Live benchmark example:

curl "http://localhost:7860/benchmark?runs=20"

Quick Start

Local

pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860

Docker

docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env

Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py

Repository Structure

sql-debug-env/
├── Dockerfile
├── openenv.yaml
├── README.md
├── requirements.txt
├── pyproject.toml
├── uv.lock
├── inference.py
├── launch_job.py
├── presentation_graphs.py
├── server/
│   ├── main.py
│   ├── gradio_ui.py
│   ├── demo_page.html
│   ├── env.py
│   ├── models.py
│   ├── database.py
│   ├── reward.py
│   ├── static/
│   └── tasks/
└── tests/

SQL Debug Environment (sql-debug-env)