File size: 6,379 Bytes
bc20ef9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# SQL Debug Env — Full Proof Verification Report

Date: 2026-04-23  
Workspace: `/Users/mdayan/Desktop/sql-debug-env`  
Branch/commit: `main` @ `9b71d1b`

## Executive Summary

**Working (verified):**
- Core environment logic (`server/env.py`, `server/database.py`, task graders, reward shaping)
- Unit tests (10/10) passing via `unittest`
- FastAPI server endpoints respond correctly when exercised via `curl`
- `openenv validate --verbose` passes (environment is “Ready for multi-mode deployment”)
- Docker image build succeeds and the container serves `/health`, `/tasks`, `/reset` correctly

**Not fully verified from this Codex sandbox (blocked by runtime constraints):**
- Python HTTP client scripts (`scripts/benchmark_local.py`, `inference.py`) cannot connect to `localhost` here due to sandbox socket restrictions (`PermissionError: [Errno 1] Operation not permitted`)

**Potential “works-on-my-machine” risks (not failures in unit tests):**
- Local installed package versions do **not** match `requirements.txt` pins (server still works in these checks, but reproducibility depends on using the pinned environment, e.g. Docker).
- `inference.py` uses `openai` Chat Completions style and hard-fails at import-time if `HF_TOKEN` is missing; compatibility depends on the installed `openai` package major version and env vars.

## What’s Implemented (“What’s Done”)

This repo implements a deterministic SQL debugging RL environment with:
- **Typed action/observation/reward** models (`server/models.py`)
- **In-memory SQLite episode DB** per reset (`server/database.py`)
- **3 deterministic tasks** (easy/medium/hard) with schema + seed + expected output + graders (`server/tasks/`)
- **Dense reward shaping** with strict clamping into `(0, 1)` for validator compatibility (`server/reward.py`)
- **OpenEnv-compatible HTTP API** (`server/main.py`) with:
  - `POST /reset`, `POST /step`, `GET /state`
  - `GET /tasks`, `GET /health`, `GET /benchmark`
- **OpenEnv entrypoint** wrapper (`server/app.py`)
- **Baseline agent runner** that calls an OpenAI model + steps the env (`inference.py`)

## How the Approach Works (and Why)

### Design intent
The environment is designed to be **deterministic** and **gradeable**:
- Deterministic SQLite schema + seed data → same query always yields same result.
- Deterministic expected outputs + graders → consistent scoring across runs/models.
- Strict score clamping into `(0, 1)` → aligns with OpenEnv validator expectations.

### Runtime flow
1. `POST /reset` creates a fresh `SQLDebugEnv`, which creates a new in-memory `EpisodeDatabase` and an `EpisodeState`.
2. Each `POST /step` executes one action:
   - `submit_query` executes a **SELECT-only** SQL query, then grades rows.
   - `inspect_schema` / `inspect_error` / `inspect_sample` returns info without grading changes.
   - `reset_query` resets `current_query` and applies a penalty.
3. `compute_reward(...)` returns a dense reward combining correctness/efficiency/progress/schema bonus minus penalties.

## Verification Environment

### Python/runtime
- Python: `3.14.2`

### Installed library versions (observed in this environment)
- `fastapi 0.128.0`
- `uvicorn 0.40.0`
- `pydantic 2.12.5`
- `openai 2.30.0`
- `httpx 0.28.1`
- `openenv-core 0.2.3`

Note: `requirements.txt` pins older versions (e.g. `fastapi==0.115.0`, `uvicorn==0.30.6`, `pydantic==2.9.2`).

## Tests / Checks Run (with Results)

### 1) Unit tests
Command:
```bash
python3 -m unittest discover -s tests -p "test_*.py" -v
```
Result:
- `Ran 10 tests in 0.003s``OK`

### 2) Bytecode compilation (syntax sanity)
Command:
```bash
python3 -m compileall -q .
```
Result:
- No errors

### 3) Dependency sanity
Command:
```bash
python3 -m pip check
```
Result:
- `No broken requirements found.`

### 4) OpenEnv structural validation
Command:
```bash
openenv validate --verbose
```
Result:
- `[OK] sql-debug-env: Ready for multi-mode deployment`

### 5) Docker build + container smoke test
Commands:
```bash
# start daemon (example: Colima)
colima start

docker build -t sql-debug-env:localtest .
docker run --rm -p 17860:7860 sql-debug-env:localtest
```
Result (verified here):
- `docker build` completed successfully.
- Container responded with:
  - `GET /health``200 OK`
  - `GET /tasks` → 3 tasks
  - `POST /reset` (tested with `medium_logic_fix`) → `200 OK`

## API Smoke Test (Local)

Server started (foreground) with:
```bash
uvicorn server.main:app --host 127.0.0.1 --port 7860
```

### Verified endpoints (via `curl`)
- `GET /health``200 OK` with `{"status":"ok","sessions_active":0}`
- `GET /tasks``200 OK` with 3 tasks: `easy_syntax_fix`, `medium_logic_fix`, `hard_multi_bug`
- `POST /reset` (`x-session-id: smoke`) → `200 OK` and observation includes `task_id` and `steps_taken=0`
- `POST /step` with:
  - `inspect_schema` → returns schema tables and small positive reward
  - `submit_query` (invalid table) → returns `success=false`, error recorded, not done
  - `inspect_error` → returns last error message
  - `inspect_sample` → returns 3 sample rows for a table
  - `reset_query` → resets query and returns min clamped reward
- `GET /state` → returns episode state (task id, steps, best score)

## What’s Broken / Blocked (Observed Here)

### A) Python HTTP clients cannot connect to localhost in this Codex sandbox
Observed failures:
- `python3 scripts/benchmark_local.py``httpx.ConnectError: [Errno 1] Operation not permitted`
- `urllib.request.urlopen("http://127.0.0.1:7860/health")``PermissionError: [Errno 1] Operation not permitted`

Implication:
- Any verification path that depends on Python making TCP connections (including `inference.py`) cannot be “fully proved” from this sandbox session.
- The server itself works (verified via `curl`), so this appears to be a sandbox constraint, not necessarily a repo bug.

## Recommended Next Proof Steps (If You Want CI-Grade Confidence)

- Add an integration test using FastAPI’s `TestClient` (no real sockets needed) to cover `/reset`, `/step`, `/state`.
- Add a Docker build + container smoke test in CI to ensure pinned deps and entrypoints stay healthy.
- Decide whether to:
  - Pin `openai<2` (to match `chat.completions` usage), or
  - Update `inference.py` to the current OpenAI client style and avoid import-time hard failure when env vars are missing.