sql-debug-env / docs /FULL_PROOF_REPORT.md
md896's picture
Fix: Mock vllm and llm_blender to stabilize GRPOTrainer in HF Jobs environment
bc20ef9

SQL Debug Env — Full Proof Verification Report

Date: 2026-04-23
Workspace: /Users/mdayan/Desktop/sql-debug-env
Branch/commit: main @ 9b71d1b

Executive Summary

Working (verified):

  • Core environment logic (server/env.py, server/database.py, task graders, reward shaping)
  • Unit tests (10/10) passing via unittest
  • FastAPI server endpoints respond correctly when exercised via curl
  • openenv validate --verbose passes (environment is “Ready for multi-mode deployment”)
  • Docker image build succeeds and the container serves /health, /tasks, /reset correctly

Not fully verified from this Codex sandbox (blocked by runtime constraints):

  • Python HTTP client scripts (scripts/benchmark_local.py, inference.py) cannot connect to localhost here due to sandbox socket restrictions (PermissionError: [Errno 1] Operation not permitted)

Potential “works-on-my-machine” risks (not failures in unit tests):

  • Local installed package versions do not match requirements.txt pins (server still works in these checks, but reproducibility depends on using the pinned environment, e.g. Docker).
  • inference.py uses openai Chat Completions style and hard-fails at import-time if HF_TOKEN is missing; compatibility depends on the installed openai package major version and env vars.

What’s Implemented (“What’s Done”)

This repo implements a deterministic SQL debugging RL environment with:

  • Typed action/observation/reward models (server/models.py)
  • In-memory SQLite episode DB per reset (server/database.py)
  • 3 deterministic tasks (easy/medium/hard) with schema + seed + expected output + graders (server/tasks/)
  • Dense reward shaping with strict clamping into (0, 1) for validator compatibility (server/reward.py)
  • OpenEnv-compatible HTTP API (server/main.py) with:
    • POST /reset, POST /step, GET /state
    • GET /tasks, GET /health, GET /benchmark
  • OpenEnv entrypoint wrapper (server/app.py)
  • Baseline agent runner that calls an OpenAI model + steps the env (inference.py)

How the Approach Works (and Why)

Design intent

The environment is designed to be deterministic and gradeable:

  • Deterministic SQLite schema + seed data → same query always yields same result.
  • Deterministic expected outputs + graders → consistent scoring across runs/models.
  • Strict score clamping into (0, 1) → aligns with OpenEnv validator expectations.

Runtime flow

  1. POST /reset creates a fresh SQLDebugEnv, which creates a new in-memory EpisodeDatabase and an EpisodeState.
  2. Each POST /step executes one action:
    • submit_query executes a SELECT-only SQL query, then grades rows.
    • inspect_schema / inspect_error / inspect_sample returns info without grading changes.
    • reset_query resets current_query and applies a penalty.
  3. compute_reward(...) returns a dense reward combining correctness/efficiency/progress/schema bonus minus penalties.

Verification Environment

Python/runtime

  • Python: 3.14.2

Installed library versions (observed in this environment)

  • fastapi 0.128.0
  • uvicorn 0.40.0
  • pydantic 2.12.5
  • openai 2.30.0
  • httpx 0.28.1
  • openenv-core 0.2.3

Note: requirements.txt pins older versions (e.g. fastapi==0.115.0, uvicorn==0.30.6, pydantic==2.9.2).

Tests / Checks Run (with Results)

1) Unit tests

Command:

python3 -m unittest discover -s tests -p "test_*.py" -v

Result:

  • Ran 10 tests in 0.003sOK

2) Bytecode compilation (syntax sanity)

Command:

python3 -m compileall -q .

Result:

  • No errors

3) Dependency sanity

Command:

python3 -m pip check

Result:

  • No broken requirements found.

4) OpenEnv structural validation

Command:

openenv validate --verbose

Result:

  • [OK] sql-debug-env: Ready for multi-mode deployment

5) Docker build + container smoke test

Commands:

# start daemon (example: Colima)
colima start

docker build -t sql-debug-env:localtest .
docker run --rm -p 17860:7860 sql-debug-env:localtest

Result (verified here):

  • docker build completed successfully.
  • Container responded with:
    • GET /health200 OK
    • GET /tasks → 3 tasks
    • POST /reset (tested with medium_logic_fix) → 200 OK

API Smoke Test (Local)

Server started (foreground) with:

uvicorn server.main:app --host 127.0.0.1 --port 7860

Verified endpoints (via curl)

  • GET /health200 OK with {"status":"ok","sessions_active":0}
  • GET /tasks200 OK with 3 tasks: easy_syntax_fix, medium_logic_fix, hard_multi_bug
  • POST /reset (x-session-id: smoke) → 200 OK and observation includes task_id and steps_taken=0
  • POST /step with:
    • inspect_schema → returns schema tables and small positive reward
    • submit_query (invalid table) → returns success=false, error recorded, not done
    • inspect_error → returns last error message
    • inspect_sample → returns 3 sample rows for a table
    • reset_query → resets query and returns min clamped reward
  • GET /state → returns episode state (task id, steps, best score)

What’s Broken / Blocked (Observed Here)

A) Python HTTP clients cannot connect to localhost in this Codex sandbox

Observed failures:

  • python3 scripts/benchmark_local.pyhttpx.ConnectError: [Errno 1] Operation not permitted
  • urllib.request.urlopen("http://127.0.0.1:7860/health")PermissionError: [Errno 1] Operation not permitted

Implication:

  • Any verification path that depends on Python making TCP connections (including inference.py) cannot be “fully proved” from this sandbox session.
  • The server itself works (verified via curl), so this appears to be a sandbox constraint, not necessarily a repo bug.

Recommended Next Proof Steps (If You Want CI-Grade Confidence)

  • Add an integration test using FastAPI’s TestClient (no real sockets needed) to cover /reset, /step, /state.
  • Add a Docker build + container smoke test in CI to ensure pinned deps and entrypoints stay healthy.
  • Decide whether to:
    • Pin openai<2 (to match chat.completions usage), or
    • Update inference.py to the current OpenAI client style and avoid import-time hard failure when env vars are missing.