sql-debug-env / README.md
md896's picture
README: add model card highlights section and metadata snapshot.
279d788
metadata
title: Sql Debug Env
emoji: πŸ’»
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false

SQL Debug Environment (sql-debug-env)

OpenEnv Docker Python FastAPI Pydantic SQLite Uvicorn

Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.

Quick Links

Model Card Highlights

Model: md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2

Field Value
Task Text generation (SQL repair style prompts)
Libraries Transformers, TRL (GRPO), Safetensors, TGI-compatible
Family tags qwen2, grpo, conversational, text-generation-inference
Base tracks used in workflow Qwen2.5-Coder 0.5B bridge + Qwen2.5-Coder 7B benchmark/eval track
Training signal Execution-grounded reward from OpenEnv SQL tasks
Reference arXiv:1910.09700 (as listed in model metadata)

Problem and Motivation

SQL debugging is expensive, repetitive, and operationally risky:

  • static checks catch syntax, not business-logic correctness
  • generated SQL can look plausible and still fail at execution time
  • production schemas and data distribution shifts expose brittle query behavior

This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.

Benchmark Snapshot

Metric snapshot Value
Spider chart: Industry baseline 48.2%
Spider chart: Qwen-7B base 52.4%
Spider chart: RL agent 78.5%
Performance leap view 0.0% -> 25.0%
Eval artifact pass 32-run

Proof and Evidence Artifacts

Main visual proofs

  • End-to-end workflow map: server/static/diagram-end-to-end-workflow.png
  • Performance leap chart: server/static/chart-performance-leap.png
  • Comparison + reward shift: server/static/chart-comparison-shift.png
  • Spider headline chart: server/static/chart-spider-benchmark.png

Training/eval static exports

File Purpose
server/static/training_reward_curve_final.png Reward over steps
server/static/training_diagnostics_dual_axis_final.png Multi-metric diagnostics
server/static/baseline_vs_trained_by_task_final.png Per-task base vs trained
server/static/task_delta_post_minus_base_final.png Improvement deltas
server/static/reward_distribution_shift_red_green_final.png Distribution shift
server/static/presentation_combo_final.png Consolidated visual summary
server/static/benchmark_style_summary_final.png Benchmark-style summary
server/static/checkpoint_leaderboard_step_vs_reward_final.png Checkpoint quality tracking
server/static/cost_vs_performance_final.png Cost/performance trade-off

Run folders and model

System Architecture

flowchart LR
    agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
    api --> env[SQLDebugEnv]
    env --> db[In-memory SQLite DB]
    env --> tasks[Task Registry + Graders]
    tasks --> reward[Reward Engine]
    env --> reward
    reward --> api

Core components:

  • API layer: server/main.py
  • Environment engine: server/env.py
  • Episode DB: server/database.py
  • Typed models: server/models.py
  • Reward logic: server/reward.py
  • Task + graders: server/tasks/
  • Baseline runner: inference.py

OpenEnv Contract and Action Space

API surface:

  • POST /reset
  • POST /step
  • GET /state
  • GET /tasks
  • GET /health
  • GET /benchmark

Actions:

Action Required fields Purpose
submit_query query Execute/grade SQL candidate
inspect_schema none Return schema metadata
inspect_error none Return last execution error
inspect_sample table_name Return sample rows
reset_query none Restore original broken query

Reward (clamped to [0.0, 1.0]) blends:

  • correctness (0.0-0.6)
  • efficiency (0.0-0.2)
  • syntax_progress (0.0-0.1)
  • schema_bonus (0.0-0.1)
  • penalties (0.0-0.2 magnitude)

Task Suite

  • Easy: easy_syntax_fix
  • Medium: medium_logic_fix
  • Hard: hard_multi_bug
  • Expert: hard_finance_explosion (fan-trap/cartesian explosion)

Reliability and Validation

  • openenv validate --verbose: PASS
  • python3 -m unittest discover -s tests -p "test_*.py": PASS
  • Docker smoke checks: PASS (/health, /tasks, /reset, /step)

Live benchmark example:

curl "http://localhost:7860/benchmark?runs=20"

Quick Start

Local

pip install -r requirements.txt
uvicorn server.main:app --host 0.0.0.0 --port 7860

Docker

docker build -t sql-debug-env .
docker run -p 7860:7860 sql-debug-env

Baseline Inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export HF_TOKEN="$OPENAI_API_KEY"
export ENV_BASE_URL="http://localhost:7860"
export SEED="1"
python inference.py

Repository Structure

sql-debug-env/
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock
β”œβ”€β”€ inference.py
β”œβ”€β”€ launch_job.py
β”œβ”€β”€ presentation_graphs.py
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ main.py
β”‚   β”œβ”€β”€ gradio_ui.py
β”‚   β”œβ”€β”€ demo_page.html
β”‚   β”œβ”€β”€ env.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ database.py
β”‚   β”œβ”€β”€ reward.py
β”‚   β”œβ”€β”€ static/
β”‚   └── tasks/
└── tests/