md896 commited on
Commit
00849df
·
1 Parent(s): 18e112b

Initial OpenEnv SQL debug environment

Browse files
Files changed (1) hide show
  1. README.md +154 -149
README.md CHANGED
@@ -1,143 +1,125 @@
1
- ---
2
- title: sql-debug-env
3
- emoji: "🧪"
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
  # SQL Debug Environment (`sql-debug-env`)
11
 
12
- ![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
13
- ![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
14
  ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
15
  ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
16
  ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
17
- ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
18
- ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
19
- ![OpenAI](https://img.shields.io/badge/OpenAI-Baseline_API-412991?logo=openai&logoColor=white)
20
-
21
- **Deterministic OpenEnv benchmark for real SQL debugging workflows.**
22
-
23
- **Quick links:** [Live Space](https://md896-sql-debug-env.hf.space) · [Swagger](https://md896-sql-debug-env.hf.space/docs) · [OpenAPI](https://md896-sql-debug-env.hf.space/openapi.json) · [GitHub](https://github.com/mdayan8/sql-debug-env)
24
-
25
- An OpenEnv environment for a real engineering workflow: SQL query debugging. Agents iterate on broken SQL using schema/error/sample inspection until they produce the expected result.
26
-
27
- ## Abstract
28
- This project implements a deterministic OpenEnv benchmark for SQL debugging. It includes three graded tasks (easy -> medium -> hard), typed action/observation/reward models, dense reward shaping, reproducible behavior, Docker deployment, and a baseline inference runner with strict structured logs.
29
-
30
- ## Why this matters
31
- - SQL debugging is a daily task in analytics and backend teams.
32
- - Deterministic graders allow fair model comparison.
33
- - Dense reward shaping supports step-by-step agent learning.
34
- - Fast local runtime enables quick iteration and validation.
35
-
36
- ## Core Components
37
- - API layer: `server/main.py`
38
- - Environment engine: `server/env.py`
39
- - Episode database: `server/database.py` (in-memory SQLite)
40
- - Typed models: `server/models.py`
41
- - Reward logic: `server/reward.py`
42
- - Task + graders: `server/tasks/`
43
- - Baseline runner: `inference.py`
44
-
45
- ## Architecture
46
- ```mermaid
47
- flowchart LR
48
- agent[Agent Or Evaluator] --> api[FastAPI API Layer]
49
- api --> env[SQLDebugEnv]
50
- env --> db[InMemory SQLite DB]
51
- env --> tasks[Task Registry easy medium hard]
52
- tasks --> grader[Deterministic Grader]
53
- env --> reward[Reward Engine]
54
- grader --> reward
55
- reward --> api
56
- ```
57
 
58
- ## API Surface
59
- - `POST /reset`
60
- - `POST /step`
61
- - `GET /state`
62
- - `GET /tasks`
63
- - `GET /health`
64
- - `GET /benchmark`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- ## API Docs
67
  - Swagger UI: `http://localhost:7860/docs`
68
  - ReDoc: `http://localhost:7860/redoc`
69
- - OpenAPI: `http://localhost:7860/openapi.json`
70
 
71
  ## Action Space
72
- | Action | Required fields | Purpose |
73
  |---|---|---|
74
- | `submit_query` | `query` | Submit SQL candidate for execution + grading |
75
- | `inspect_schema` | none | Return schema metadata |
76
- | `inspect_error` | none | Return last execution error details |
77
- | `inspect_sample` | `table_name` | Return sample rows from table |
78
- | `reset_query` | none | Reset current query to original broken query |
79
-
80
- ## Reward Design
81
- Reward is clamped to `[0.0, 1.0]` and combines:
82
- - correctness (`0.0-0.6`)
83
- - efficiency (`0.0-0.2`)
84
- - syntax_progress (`0.0-0.1`)
85
- - schema_bonus (`0.0-0.1`)
86
- - penalty deduction magnitude (`0.0-0.2`)
87
-
88
- ## Episode Lifecycle
89
- 1. Client calls `POST /reset` with optional `task_id`.
90
- 2. Environment creates a fresh in-memory SQLite DB seeded for that task.
91
- 3. Client iteratively calls `POST /step` with one action at a time.
92
- 4. Server returns `(observation, reward, done, info)` each step.
93
- 5. Episode ends on high grade score or max-step boundary.
94
-
95
- ## Task Suite
96
- - Easy: `easy_syntax_fix`
97
- - Medium: `medium_logic_fix`
98
- - Hard: `hard_multi_bug`
99
-
100
- ## Repository Structure
101
- ```text
102
- sql-debug-env/
103
- ├── Dockerfile
104
- ├── openenv.yaml
105
- ├── inference.py
106
- ├── README.md
107
- ├── requirements.txt
108
- ├── pyproject.toml
109
- ├── uv.lock
110
- ├── scripts/
111
- │ └── benchmark_local.py
112
- ├── server/
113
- │ ├── main.py
114
- │ ├── env.py
115
- │ ├── models.py
116
- │ ├── database.py
117
- │ ├── reward.py
118
- │ └── tasks/
119
- │ ├── base.py
120
- │ ├── task_easy.py
121
- │ ├── task_medium.py
122
- │ └── task_hard.py
123
- └── tests/
124
- ├── test_env.py
125
- ├── test_graders.py
126
- └── test_reward.py
127
- ```
128
 
129
- ## Reliability and Benchmarking
130
- - `openenv validate --verbose`: PASS
131
- - `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
132
- - Docker smoke test: PASS (`/health`, `/tasks`, `/reset`, `/step`)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- Live benchmark endpoint:
135
  ```bash
136
- curl "http://localhost:7860/benchmark?runs=20"
137
  ```
138
 
139
- ## Quick Start
140
- ### Local
 
 
 
 
 
141
  ```bash
142
  pip install -r requirements.txt
143
  uvicorn server.main:app --host 0.0.0.0 --port 7860
@@ -149,40 +131,63 @@ docker build -t sql-debug-env .
149
  docker run -p 7860:7860 sql-debug-env
150
  ```
151
 
152
- ### Baseline Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ```bash
154
  export API_BASE_URL="https://api.openai.com/v1"
155
  export MODEL_NAME="gpt-4o-mini"
156
  export OPENAI_API_KEY="your-key"
157
- export HF_TOKEN="$OPENAI_API_KEY"
158
  export ENV_BASE_URL="http://localhost:7860"
 
159
  export SEED="1"
160
  python inference.py
161
  ```
162
 
163
- ## Required Environment Variables
164
- | Variable | Required | Purpose |
165
- |---|---|---|
166
- | `API_BASE_URL` | Yes (for baseline) | LLM API base endpoint |
167
- | `MODEL_NAME` | Yes (for baseline) | Model ID for inference |
168
- | `OPENAI_API_KEY` | Yes (for baseline) | OpenAI client authentication |
169
- | `HF_TOKEN` | Recommended | Compatibility with evaluator instructions |
170
- | `ENV_BASE_URL` | Yes (for baseline) | Environment server URL |
171
- | `SEED` | Optional | Reproducibility control |
172
-
173
- ## Submission Validation Checklist
174
- - Space URL is live and `/health` returns `200`.
175
- - `/reset` returns a valid observation payload.
176
- - `openenv validate --verbose` passes.
177
- - Docker build and run succeed locally.
178
- - `inference.py` is in repo root and emits `[START]`, `[STEP]`, `[END]`.
179
- - `openenv.yaml` has correct deployed `api.base_url`.
180
- - `.env` and `.cursor/` are ignored in git.
181
-
182
- ## Hugging Face Spaces
183
- Verify deployment:
184
  ```bash
185
- curl https://md896-sql-debug-env.hf.space/health
186
- curl -X POST https://md896-sql-debug-env.hf.space/reset -H "Content-Type: application/json" -d '{}'
187
- curl https://md896-sql-debug-env.hf.space/docs
188
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # SQL Debug Environment (`sql-debug-env`)
2
 
 
 
3
  ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
4
  ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
5
  ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
6
+ ![SQLite](https://img.shields.io/badge/SQLite-In_Memory-003B57?logo=sqlite&logoColor=white)
7
+ ![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?logo=docker&logoColor=white)
8
+ ![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ An OpenEnv environment for a real task people do every day: **debugging SQL**. The agent gets a broken query, a live (in-memory) SQLite database, and a description of the expected output. It can inspect schema/errors/samples and submit fixed queries until it solves the task.
11
+
12
+ ## What’s in this repo
13
+ - **FastAPI server**: `server/main.py` (endpoints: `/health`, `/tasks`, `/reset`, `/step`, `/state`)
14
+ - **Environment logic**: `server/env.py` + `server/database.py`
15
+ - **Tasks**: `server/tasks/` (easy → medium → hard, deterministic seed data)
16
+ - **Baseline agent**: `inference.py` (OpenAI client + `[START]/[STEP]/[END]` logs)
17
+
18
+ ## Tech Stack
19
+ - Python 3.11+
20
+ - FastAPI + Uvicorn
21
+ - Pydantic v2
22
+ - SQLite (in-memory)
23
+ - OpenEnv Core
24
+ - Docker
25
+ - OpenAI Python SDK (baseline inference)
26
+
27
+ ## Production Notes
28
+ - Stateless HTTP API with per-session environment instances keyed by `X-Session-Id`
29
+ - Deterministic task data (in-memory SQLite) for reproducible grading
30
+ - Reward clamped to `[0.0, 1.0]` with partial-progress shaping
31
+ - Docker-first deployment path (local and Hugging Face Spaces)
32
+ - Local benchmark endpoint for live latency checks (`/benchmark`)
33
+
34
+ ## API Docs (FastAPI Auto Docs)
35
+ Use these for interactive testing in browser:
36
 
 
37
  - Swagger UI: `http://localhost:7860/docs`
38
  - ReDoc: `http://localhost:7860/redoc`
39
+ - OpenAPI spec: `http://localhost:7860/openapi.json`
40
 
41
  ## Action Space
42
+ | Action | Required fields | Cost / reward effect |
43
  |---|---|---|
44
+ | `submit_query` | `query` | Main evaluation step (dense reward based on grading) |
45
+ | `inspect_schema` | none | Free information action (small positive reward component) |
46
+ | `inspect_error` | none | Free information action (small positive reward component) |
47
+ | `inspect_sample` | `table_name` | Free information action (small positive reward component) |
48
+ | `reset_query` | none | Penalty action (reduces reward for that step) |
49
+
50
+ ## Observation Space
51
+ | Field | Type |
52
+ |---|---|
53
+ | `task_id` | `string` |
54
+ | `task_description` | `string` |
55
+ | `original_query` | `string` |
56
+ | `current_query` | `string_or_null` |
57
+ | `expected_description` | `string` |
58
+ | `last_action_type` | `string` |
59
+ | `last_query_result` | `object_or_null` |
60
+ | `steps_taken` | `integer` |
61
+ | `steps_remaining` | `integer` |
62
+ | `current_score` | `float` |
63
+ | `schema_info` | `object_or_null` |
64
+ | `error_details` | `string_or_null` |
65
+ | `sample_rows` | `array_or_null` |
66
+ | `hint` | `string_or_null` |
67
+ | `is_done` | `boolean` |
68
+ | `success` | `boolean` |
69
+
70
+ ## Reward Function
71
+ | Component | Range | Description |
72
+ |---|---|---|
73
+ | `correctness` | `[0.0, 0.6]` | Row-level match vs expected output |
74
+ | `efficiency` | `[0.0, 0.2]` | Bonus for solving with fewer steps |
75
+ | `syntax_progress` | `[0.0, 0.1]` | Small reward for producing syntactically valid SQL |
76
+ | `schema_bonus` | `[0.0, 0.1]` | Bonus for referencing correct tables/columns |
77
+ | `penalty` | `[0.0, 0.2]` | Deduction magnitude for resets/regressions/urgency near step limit |
78
+
79
+ ## Tasks
80
+ ### Task 1: Easy — Syntax Error Fix (`easy_syntax_fix`)
81
+ Two straightforward issues: a misspelled keyword (`GRUP BY`) and an `ORDER BY` alias mismatch.
82
+
83
+ ### Task 2: Medium — Logic Error Fix (`medium_logic_fix`)
84
+ Logic bugs around outer joins + filtering scope + aggregation scope.
85
+
86
+ ### Task 3: Hard — Multi-Bug Fix (`hard_multi_bug`)
87
+ Five bugs across correlated subqueries, window functions, CTE scope, date logic, and duplication.
88
+
89
+ ## Baseline
90
+ The baseline script is intentionally simple: it loops `reset → step` and asks an OpenAI model to choose the next JSON action.
 
 
 
 
 
 
 
91
 
92
+ ## Reliability & Benchmarking
93
+
94
+ ### Verified status (local)
95
+ - `openenv validate --verbose`: **PASS**
96
+ - `python3 -m unittest discover -s tests -p "test_*.py"`: **10/10 PASS**
97
+ - Docker smoke test: **PASS** (`/health`, `/tasks`, `/reset`, `/step`)
98
+ - FastAPI docs available: **PASS** (`/docs`, `/redoc`, `/openapi.json`)
99
+
100
+ ### Endpoint benchmark (local Docker run, n=25)
101
+ Measured with `scripts/benchmark_local.py` on a running local container:
102
+
103
+ | Endpoint | avg | p50 | p95 |
104
+ |---|---:|---:|---:|
105
+ | `GET /health` | 0.69 ms | 0.67 ms | 0.76 ms |
106
+ | `GET /tasks` | 0.82 ms | 0.81 ms | 0.90 ms |
107
+ | `POST /reset` | 1.34 ms | 1.26 ms | 1.62 ms |
108
+ | `POST /step` (`inspect_schema`) | 1.07 ms | 1.01 ms | 1.34 ms |
109
+
110
+ Re-run anytime:
111
 
 
112
  ```bash
113
+ python3 scripts/benchmark_local.py
114
  ```
115
 
116
+ Notes:
117
+ - These are local-machine numbers (single container, warm runtime).
118
+ - For submission-grade reporting, also capture one run against your HF Space URL after deploy.
119
+
120
+ ## Setup & Usage
121
+
122
+ ### Local Development
123
  ```bash
124
  pip install -r requirements.txt
125
  uvicorn server.main:app --host 0.0.0.0 --port 7860
 
131
  docker run -p 7860:7860 sql-debug-env
132
  ```
133
 
134
+ ### Quick smoke test
135
+ ```bash
136
+ curl http://localhost:7860/health
137
+ curl http://localhost:7860/tasks
138
+ curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id":"easy_syntax_fix"}'
139
+ curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"action":{"action_type":"inspect_schema"}}'
140
+ curl "http://localhost:7860/benchmark?runs=20"
141
+ ```
142
+
143
+ ### Real-time benchmark API (for dashboards/web pages)
144
+ This is a live endpoint, not static/dummy data. Every request runs fresh measurements.
145
+
146
+ - Endpoint: `GET /benchmark?runs=20`
147
+ - `runs` range: `1` to `100`
148
+ - Returns JSON with `avg_ms`, `p50_ms`, `p95_ms`, `n`, and a fresh `timestamp_epoch_ms`
149
+
150
+ Example:
151
+ ```bash
152
+ curl "http://localhost:7860/benchmark?runs=30"
153
+ ```
154
+
155
+ ### Run Baseline
156
  ```bash
157
  export API_BASE_URL="https://api.openai.com/v1"
158
  export MODEL_NAME="gpt-4o-mini"
159
  export OPENAI_API_KEY="your-key"
 
160
  export ENV_BASE_URL="http://localhost:7860"
161
+ export HF_TOKEN="$OPENAI_API_KEY"
162
  export SEED="1"
163
  python inference.py
164
  ```
165
 
166
+ ### OpenEnv Validation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  ```bash
168
+ pip install openenv-core
169
+ openenv validate
 
170
  ```
171
+
172
+ ### Suggested pre-submit check
173
+ ```bash
174
+ openenv validate --verbose
175
+ python3 -m unittest discover -s tests -p "test_*.py"
176
+ docker build -t sql-debug-env .
177
+ docker run --rm -p 7860:7860 sql-debug-env
178
+ # in another terminal:
179
+ curl -s http://localhost:7860/health
180
+ curl -s http://localhost:7860/docs >/dev/null
181
+ curl -s "http://localhost:7860/benchmark?runs=20"
182
+ ```
183
+
184
+ ## Hugging Face Spaces (Docker)
185
+ 1. Create a new **Space → Docker**.
186
+ 2. Push this repo.
187
+ 3. Update `openenv.yaml` → `api.base_url` to your Space URL: `https://<your-space>.hf.space`
188
+ 4. Wait for build, then verify:
189
+
190
+ ```bash
191
+ curl -X POST https://<your-space>.hf.space/reset -H "Content-Type: application/json" -d '{}'
192
+ ```
193
+