Spaces:

md896
/

sql-debug-env

Running

File size: 7,594 Bytes

---
title: Sql Debug Env
emoji: 💻
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
---

# SQL Debug Environment (`sql-debug-env`)

![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)

Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.

## Quick Links

- Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space)
- Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo)
- Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/)
- Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs)
- OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json)
- HF model: [https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
- GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env)
- W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag)

## Model Card Highlights

Model: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)

| Field | Value |
|---|---|
| Task | Text generation (SQL repair style prompts) |
| Libraries | Transformers, TRL (GRPO), Safetensors, TGI-compatible |
| Family tags | qwen2, grpo, conversational, text-generation-inference |
| Base tracks used in workflow | Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track |
| Training signal | Execution-grounded reward from OpenEnv SQL tasks |
| Reference | arXiv:1910.09700 (as listed in model metadata) |

## Problem and Motivation

SQL debugging is expensive, repetitive, and operationally risky:

- static checks catch syntax, not business-logic correctness
- generated SQL can look plausible and still fail at execution time
- production schemas and data distribution shifts expose brittle query behavior

This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.

## Benchmark Snapshot

| Metric snapshot | Value |
|---|---:|
| Spider chart: Industry baseline | 48.2% |
| Spider chart: Qwen-7B base | 52.4% |
| Spider chart: RL agent | 78.5% |
| Performance leap view | 0.0% -> 25.0% |
| Eval artifact pass | 32-run |

## Proof and Evidence Artifacts

### Main visual proofs

- End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png`
- Performance leap chart: `server/static/chart-performance-leap.png`
- Comparison + reward shift: `server/static/chart-comparison-shift.png`
- Spider headline chart: `server/static/chart-spider-benchmark.png`

### Training/eval static exports

| File | Purpose |
|---|---|
| `server/static/training_reward_curve_final.png` | Reward over steps |
| `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics |
| `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained |
| `server/static/task_delta_post_minus_base_final.png` | Improvement deltas |
| `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift |
| `server/static/presentation_combo_final.png` | Consolidated visual summary |
| `server/static/benchmark_style_summary_final.png` | Benchmark-style summary |
| `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking |
| `server/static/cost_vs_performance_final.png` | Cost/performance trade-off |

### Run folders and model

- Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval)
- Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval)
- Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)

## System Architecture

```mermaid
flowchart LR
    agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
    api --> env[SQLDebugEnv]
    env --> db[In-memory SQLite DB]
    env --> tasks[Task Registry + Graders]
    tasks --> reward[Reward Engine]
    env --> reward
    reward --> api
```

Core components:

- API layer: `server/main.py`
- Environment engine: `server/env.py`
- Episode DB: `server/database.py`
- Typed models: `server/models.py`
- Reward logic: `server/reward.py`
- Task + graders: `server/tasks/`
- Baseline runner: `inference.py`

## OpenEnv Contract and Action Space

API surface:

- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /tasks`
- `GET /health`
- `GET /benchmark`

Actions:

| Action | Required fields | Purpose |
|---|---|---|
| `submit_query` | `query` | Execute/grade SQL candidate |
| `inspect_schema` | none | Return schema metadata |
| `inspect_error` | none | Return last execution error |
| `inspect_sample` | `table_name` | Return sample rows |
| `reset_query` | none | Restore original broken query |

Reward (clamped to `[0.0, 1.0]`) blends:

- correctness (`0.0-0.6`)
- efficiency (`0.0-0.2`)
- syntax_progress (`0.0-0.1`)
- schema_bonus (`0.0-0.1`)
- penalties (`0.0-0.2` magnitude)

## Task Suite

- Easy: `easy_syntax_fix`
- Medium: `medium_logic_fix`
- Hard: `hard_multi_bug`
- Expert: `hard_finance_explosion` (fan-trap/cartesian explosion)

## Reliability and Validation

- `openenv validate --verbose`: PASS
- `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
- Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`)

Live benchmark example:

```bash
curl "http://localhost:7860/benchmark?runs=20"
```

## Quick Start

### Local

```bash
pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860
```

### Docker

```bash
docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env
```

### Baseline Inference

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py
```

## Repository Structure

```text
sql-debug-env/
├── Dockerfile
├── openenv.yaml
├── README.md
├── requirements.txt
├── pyproject.toml
├── uv.lock
├── inference.py
├── launch_job.py
├── presentation_graphs.py
├── server/
│   ├── main.py
│   ├── gradio_ui.py
│   ├── demo_page.html
│   ├── env.py
│   ├── models.py
│   ├── database.py
│   ├── reward.py
│   ├── static/
│   └── tasks/
└── tests/
```