Spaces:

krishpotanwar
/

sql-repair-env

Sleeping

krishpotanwar commited on Apr 8

Commit

269f632

0 Parent(s):

feat: SQL Repair OpenEnv submission — Phase 1 validator passes

- 3 SQL repair tasks (easy/medium/hard) with SQLite-backed env
- FastAPI server with all required endpoints (/health /tasks /reset /step /grader /baseline)
- /reset accepts empty body (Phase 1 requirement)
- inference.py: HTTP client + OpenAI-compatible LLM caller
- Strict (0,1) score clamping with NaN/inf -> 0.5 fallback
- Every task emits exactly one [START]/[END] even on crash (Phase 2 lesson)
- Sterile stdout: only bracket lines on stdout, diagnostics on stderr
- pyproject.toml + uv.lock + server/app.py:main + openenv-core>=0.2.0
- openenv validate .: PASS
- 8/8 unit tests pass

Files changed (17) hide show

.dockerignore +13 -0
.gitignore +21 -0
Dockerfile +27 -0
README.md +99 -0
inference.py +425 -0
openenv.yaml +28 -0
pyproject.toml +30 -0
requirements.txt +7 -0
server/__init__.py +1 -0
server/app.py +142 -0
sql_env/__init__.py +14 -0
sql_env/env_core.py +174 -0
sql_env/grader.py +64 -0
sql_env/tasks.py +81 -0
tests/__init__.py +0 -0
tests/test_smoke.py +110 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,13 @@

+__pycache__
+*.pyc
+*.pyo
+.venv
+.env
+.git
+.pytest_cache
+.ruff_cache
+.mypy_cache
+tests/
+*.egg-info
+build/
+dist/

.gitignore ADDED Viewed

	@@ -0,0 +1,21 @@

+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+.venv/
+venv/
+env/
+.env
+.env.local
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+*.egg-info/
+build/
+dist/
+.DS_Store
+.claude-flow/
+.swarm/
+.claude/

Dockerfile ADDED Viewed

	@@ -0,0 +1,27 @@

+FROM python:3.11-slim
+WORKDIR /app
+# System deps (curl for healthchecks)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy & install Python deps first for layer caching
+COPY requirements.txt ./
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application
+COPY . .
+# Install package so [project.scripts] is callable
+RUN pip install --no-cache-dir -e .
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PORT=8000
+EXPOSE 8000
+# Use the entry point declared in pyproject.toml
+CMD ["python", "-m", "server.app"]

README.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# SQL Repair OpenEnv
+An OpenEnv environment for the **Meta PyTorch x Scaler hackathon** where
+agents repair broken SQL queries against a small SQLite schema.
+## Tasks
+| ID       | Difficulty | What's broken                                  |
+|----------|------------|------------------------------------------------|
+| `task_1` | easy       | SELECT list missing commas                     |
+| `task_2` | medium     | JOIN references columns that don't exist       |
+| `task_3` | hard       | Aggregate query missing GROUP BY               |
+Each task gives the agent the schema, the broken query, the runtime error
+(if any), and a one-line hint. The agent submits a corrected query via the
+`/step` endpoint and is scored on whether the result rows match the
+canonical expected rows.
+## Architecture
+```
+.
+├── pyproject.toml         # uv project, server entry point
+├── uv.lock                # uv lockfile
+├── Dockerfile             # builds the env server image
+├── inference.py           # AGENT — talks to the env via HTTP, calls an LLM
+├── openenv.yaml           # OpenEnv metadata
+├── server/
+│   └── app.py             # FastAPI env server (def main)
+├── sql_env/
+│   ├── env_core.py        # SQLite-backed env state
+│   ├── tasks.py           # Task definitions
+│   └── grader.py          # Strict (0, 1) score clamping
+└── tests/
+    └── test_smoke.py      # Pytest smoke suite
+```
+## HTTP API
+| Method | Path        | Body                                      | Returns                              |
+|--------|-------------|-------------------------------------------|--------------------------------------|
+| GET    | `/health`   | —                                         | `{"status":"ok"}`                    |
+| GET    | `/tasks`    | —                                         | task list + metadata                 |
+| POST   | `/reset`    | `{"task_id":"task_1"}` (optional)         | observation                          |
+| POST   | `/step`     | `{"action":{"action_type":"submit_query","query":"..."}}` | observation/reward/done |
+| POST   | `/grader`   | `{"task_id":"task_1"}`                    | `{"score": float in (0,1)}`          |
+| POST   | `/baseline` | `{"tasks":[...]}` (optional)              | scores for all tasks                 |
+`/reset` accepts an empty body and defaults to `task_1` — required by the
+OpenEnv validator.
+## Running locally
+```bash
+# 1. Install
+uv sync                       # or: pip install -e . && pip install -r requirements.txt
+# 2. Start the env server
+python -m server.app          # listens on http://localhost:8000
+# 3. Run the agent (in another terminal)
+export HF_TOKEN=<your-groq-or-openai-key>
+export API_BASE_URL=https://api.groq.com/openai/v1
+export MODEL_NAME=llama-3.3-70b-versatile
+python inference.py
+```
+Expected output:
+```
+[START] task_1
+[STEP] 01 | task=task_1 | action=submit_query | reward=+1.0000 | matches=True | rows=5
+[END] task_1 | score=0.9890 | status=ok
+[START] task_2
+...
+```
+## Environment variables
+| Name             | Default                                  | Notes                                       |
+|------------------|------------------------------------------|---------------------------------------------|
+| `API_BASE_URL`   | `https://api.groq.com/openai/v1`         | Required by OpenEnv submission checklist    |
+| `MODEL_NAME`     | `llama-3.3-70b-versatile`                | Required by OpenEnv submission checklist    |
+| `HF_TOKEN`       | (none — must be set in HF Space Secrets) | Required by OpenEnv submission checklist    |
+| `LOCAL_IMAGE_NAME` | (unset)                                | If set, inference.py boots a Docker image   |
+| `ENV_URL`        | `http://localhost:8000`                  | Where the env server is reachable           |
+## Validation
+```bash
+# Phase 1 — official OpenEnv validator
+uvx --from openenv-core openenv validate .
+# Smoke tests
+python -m pytest tests/ -q
+```
+No API keys are hardcoded in this repo. The agent reads `HF_TOKEN` (with
+optional `GROQ_API_KEY`/`OPENAI_API_KEY` fallbacks) at runtime only.

inference.py ADDED Viewed

	@@ -0,0 +1,425 @@

+"""inference.py — SQL Repair OpenEnv agent.
+This script is the AGENT side of the OpenEnv hackathon submission. The
+validator runs `python inference.py`, expects exit code 0, and parses
+exactly these stdout lines per task:
+    [START] task_x
+    [STEP]  NN | task=task_x | ...
+    [END]   task_x | score=0.NNNN | status=ok
+INVARIANTS (each one was learned from a Phase 2 failure):
+  1. EVERY task emits exactly one [START] and one [END] line — even on crash.
+  2. EVERY score is strictly inside the open interval (0, 1) — never 0.0 or 1.0.
+  3. NaN, inf, and parsing failures collapse to 0.5 (in-range fallback).
+  4. NO non-bracket prints on stdout from the main path. Diagnostics go to stderr.
+  5. flush=True on every emit so partial output survives a SIGKILL.
+  6. inference.py exits 0 even on catastrophic failure (we still emit safe scores).
+The agent uses the standardized OpenEnv environment variables that the
+validator injects: API_BASE_URL, MODEL_NAME, HF_TOKEN.
+"""
+from __future__ import annotations
+import json
+import os
+import subprocess
+import sys
+import time
+import traceback
+from typing import Any, Dict, List, Optional
+# ===========================================================================
+# Standardized OpenEnv environment variables (REQUIRED by submission checklist)
+# ===========================================================================
+API_BASE_URL: str = os.getenv("API_BASE_URL", "https://api.groq.com/openai/v1")
+MODEL_NAME: str = os.getenv("MODEL_NAME", "llama-3.3-70b-versatile")
+HF_TOKEN: Optional[str] = os.getenv("HF_TOKEN")  # no default — must be set in HF Secrets
+# Optional knobs
+LOCAL_IMAGE_NAME: Optional[str] = os.getenv("LOCAL_IMAGE_NAME")
+ENV_URL_DEFAULT: str = os.getenv("ENV_URL", "http://localhost:8000")
+REPO_ROOT: str = os.path.dirname(os.path.abspath(__file__))
+TASK_IDS: List[str] = ["task_1", "task_2", "task_3"]
+MAX_STEPS: int = 6
+# ===========================================================================
+# Sterile stdout sink — only [START]/[STEP]/[END] lines pass through this.
+# ===========================================================================
+def emit(line: str) -> None:
+    print(line, flush=True)
+def warn(msg: str) -> None:
+    """Diagnostics — stderr only, never parsed by the validator."""
+    print(f"# {msg}", file=sys.stderr, flush=True)
+# ===========================================================================
+# Strict (0, 1) score clamp — duplicated here so the agent never depends on
+# importable env code (the validator may run inference.py outside the package).
+# ===========================================================================
+def clamp_score(value: Any) -> float:
+    try:
+        s = float(value)
+    except (TypeError, ValueError):
+        return 0.5
+    if s != s:  # NaN
+        return 0.5
+    if s == float("inf") or s == float("-inf"):
+        return 0.5
+    if s <= 0.0:
+        return 0.001
+    if s >= 1.0:
+        return 0.999
+    return round(s, 4)
+# ===========================================================================
+# HTTP env client — minimal, no openenv-core dependency required.
+# ===========================================================================
+class HttpEnvClient:
+    """Thin REST client for our env server."""
+    def __init__(self, base_url: str) -> None:
+        import requests  # local import so the module can load even without it
+        self._requests = requests
+        self.base_url = base_url.rstrip("/")
+    def health(self) -> Dict[str, Any]:
+        r = self._requests.get(f"{self.base_url}/health", timeout=10)
+        r.raise_for_status()
+        return r.json()
+    def reset(self, task_id: str) -> Dict[str, Any]:
+        r = self._requests.post(
+            f"{self.base_url}/reset",
+            json={"task_id": task_id},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+    def step(self, action: Dict[str, Any]) -> Dict[str, Any]:
+        r = self._requests.post(
+            f"{self.base_url}/step",
+            json={"action": action},
+            timeout=60,
+        )
+        r.raise_for_status()
+        return r.json()
+    def grader(self, task_id: str) -> Dict[str, Any]:
+        r = self._requests.post(
+            f"{self.base_url}/grader",
+            json={"task_id": task_id},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+def _wait_for_health(url: str, timeout: float = 60.0) -> bool:
+    import requests
+    deadline = time.time() + timeout
+    while time.time() < deadline:
+        try:
+            r = requests.get(f"{url}/health", timeout=3)
+            if r.status_code == 200:
+                return True
+        except Exception:
+            pass
+        time.sleep(0.5)
+    return False
+def get_env_client() -> HttpEnvClient:
+    """Connect to the env server using the first viable strategy.
+    Strategies (in order of preference):
+      1. openenv-core's Env.from_docker_image() if LOCAL_IMAGE_NAME is set
+      2. Direct HTTP at ENV_URL if /health responds
+      3. Spawn a local subprocess `python -m server.app` from this repo
+    """
+    # Strategy 1: openenv-core image launch (sample pattern)
+    if LOCAL_IMAGE_NAME:
+        try:
+            from openenv_core.client import Env  # type: ignore
+            env = Env.from_docker_image(LOCAL_IMAGE_NAME, ports={8000: 8000})
+            warn(f"openenv-core launched container from image {LOCAL_IMAGE_NAME}")
+            # Wait for the launched container to be reachable
+            if _wait_for_health("http://localhost:8000", timeout=60):
+                return HttpEnvClient("http://localhost:8000")
+            warn("Container started but health check failed; falling through")
+        except Exception as exc:
+            warn(f"openenv-core import/launch failed: {exc}")
+    # Strategy 2: env already running at ENV_URL
+    if _wait_for_health(ENV_URL_DEFAULT, timeout=5):
+        warn(f"Reusing already-running env at {ENV_URL_DEFAULT}")
+        return HttpEnvClient(ENV_URL_DEFAULT)
+    # Strategy 3: spawn a local server subprocess
+    warn("No env reachable — spawning local subprocess on port 8000")
+    env_proc = subprocess.Popen(
+        [sys.executable, "-m", "server.app"],
+        cwd=REPO_ROOT,
+        stdout=subprocess.DEVNULL,
+        stderr=subprocess.DEVNULL,
+        env={**os.environ, "PORT": "8000", "PYTHONUNBUFFERED": "1"},
+    )
+    if not _wait_for_health("http://localhost:8000", timeout=45):
+        try:
+            env_proc.terminate()
+        except Exception:
+            pass
+        raise RuntimeError("Local env server did not become healthy within 45s")
+    warn(f"Local env subprocess pid={env_proc.pid} healthy")
+    return HttpEnvClient("http://localhost:8000")
+# ===========================================================================
+# OpenAI-compatible LLM client (Groq / OpenAI / HF inference endpoints)
+# ===========================================================================
+def make_llm_client():
+    from openai import OpenAI
+    api_key = (
+        HF_TOKEN
+        or os.getenv("GROQ_API_KEY")
+        or os.getenv("OPENAI_API_KEY")
+    )
+    if not api_key:
+        raise EnvironmentError(
+            "No API key found. Set HF_TOKEN (or GROQ_API_KEY) in env."
+        )
+    return OpenAI(base_url=API_BASE_URL, api_key=api_key)
+SYSTEM_PROMPT = """You are an expert SQL engineer. Your job is to repair broken SQL queries.
+You will be given:
+  - A SQL schema (CREATE TABLE / INSERT statements)
+  - A broken SQL query that errors or returns the wrong rows
+  - The error message (if any)
+  - A short hint
+  - The expected number of rows and columns
+Respond with ONLY a JSON object on a single line:
+  {"query": "<the corrected SQL query>"}
+Do NOT include any prose, explanation, code fences, or markdown — only the JSON object."""
+def _parse_query(content: str) -> str:
+    """Best-effort extraction of a SQL string from an LLM response."""
+    if not content:
+        return ""
+    s = content.strip()
+    # Strip markdown code fences
+    if s.startswith("```"):
+        s = s.strip("`").strip()
+        if s.lower().startswith("json"):
+            s = s[4:].strip()
+        elif s.lower().startswith("sql"):
+            s = s[3:].strip()
+    # Try strict JSON
+    try:
+        data = json.loads(s)
+        if isinstance(data, dict) and "query" in data:
+            return str(data["query"]).strip()
+    except json.JSONDecodeError:
+        pass
+    # Fallback: regex for {"query": "..."}
+    import re
+    m = re.search(r'"query"\s*:\s*"((?:[^"\\]|\\.)*)"', s)
+    if m:
+        return m.group(1).encode().decode("unicode_escape")
+    # Last resort: return raw content (might be a bare SQL string)
+    return s
+def call_llm(client, observation: Dict[str, Any], previous_attempts: List[Dict[str, Any]]) -> str:
+    user_lines = [
+        f"Task: {observation.get('name') or observation.get('task_id', '?')}",
+        f"Difficulty: {observation.get('difficulty', '?')}",
+        "",
+        "Schema:",
+        observation.get("schema_sql", "") or "(missing)",
+        "",
+        "Broken query:",
+        observation.get("broken_query", "") or "(missing)",
+        "",
+        f"Broken query error: {observation.get('broken_query_error') or 'none (returns wrong rows)'}",
+        f"Hint: {observation.get('hint', '')}",
+        "",
+        f"Expected: {observation.get('expected_row_count', '?')} rows × "
+        f"{observation.get('expected_column_count', '?')} columns",
+    ]
+    if previous_attempts:
+        user_lines.append("")
+        user_lines.append("Previous attempts:")
+        for i, att in enumerate(previous_attempts[-3:], start=1):
+            user_lines.append(
+                f"  {i}. query={att.get('query', '')!r} -> "
+                f"executed={att.get('executed')} matches={att.get('matches_expected')} "
+                f"error={att.get('error')!r}"
+            )
+    user_lines.append("")
+    user_lines.append('Return ONLY: {"query": "<fixed SQL>"}')
+    user_msg = "\n".join(user_lines)
+    try:
+        resp = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_msg},
+            ],
+            temperature=0.1,
+            max_tokens=512,
+        )
+        content = (resp.choices[0].message.content or "").strip()
+        return _parse_query(content)
+    except Exception as exc:
+        warn(f"LLM call failed: {exc}")
+        return ""
+# ===========================================================================
+# Per-task runner — NEVER raises. Always emits exactly one [START] / [END].
+# ===========================================================================
+def run_task(env: HttpEnvClient, llm_client, task_id: str) -> float:
+    emit(f"[START] {task_id}")
+    score: float = 0.5  # safe in-range fallback
+    status: str = "ok"
+    try:
+        obs = env.reset(task_id)
+        last_obs: Dict[str, Any] = dict(obs)
+        previous_attempts: List[Dict[str, Any]] = []
+        broken = obs.get("broken_query", "")
+        for step_idx in range(1, MAX_STEPS + 1):
+            try:
+                fixed = call_llm(llm_client, last_obs, previous_attempts)
+            except Exception as exc:  # noqa: BLE001
+                warn(f"LLM error on step {step_idx}: {exc}")
+                fixed = ""
+            if not fixed:
+                fixed = broken  # fall back to the broken query so step still runs
+            try:
+                result = env.step({"action_type": "submit_query", "query": fixed})
+            except Exception as exc:  # noqa: BLE001
+                warn(f"env.step failed on step {step_idx}: {exc}")
+                emit(
+                    f"[STEP] {step_idx:02d} | task={task_id} "
+                    f"| action=submit_query | reward=+0.0000 | status=step_error"
+                )
+                continue
+            reward = float(result.get("reward", 0.0))
+            obs2: Dict[str, Any] = result.get("observation", {}) or {}
+            done = bool(result.get("done", False))
+            matches = bool(obs2.get("matches_expected", False))
+            emit(
+                f"[STEP] {step_idx:02d} | task={task_id} "
+                f"| action=submit_query | reward={reward:+.4f} "
+                f"| matches={matches} | rows={obs2.get('result_row_count', 0)}"
+            )
+            previous_attempts.append(
+                {
+                    "query": fixed,
+                    "executed": obs2.get("executed", False),
+                    "matches_expected": matches,
+                    "error": obs2.get("error"),
+                }
+            )
+            # Update context for next prompt
+            last_obs.update(obs2)
+            last_obs["broken_query"] = fixed
+            last_obs["broken_query_error"] = obs2.get("error")
+            last_obs["hint"] = obs.get("hint", "")
+            last_obs["schema_sql"] = obs.get("schema_sql", "")
+            last_obs["expected_row_count"] = obs.get("expected_row_count")
+            last_obs["expected_column_count"] = obs.get("expected_column_count")
+            last_obs["name"] = obs.get("name")
+            last_obs["difficulty"] = obs.get("difficulty")
+            if done:
+                break
+        # Pull final score from the env grader, then strict-clamp.
+        try:
+            grader_resp = env.grader(task_id)
+            raw = grader_resp.get("score", 0.5)
+        except Exception as exc:  # noqa: BLE001
+            warn(f"grader call failed: {exc}")
+            raw = 0.5
+        score = clamp_score(raw)
+    except Exception:
+        traceback.print_exc(file=sys.stderr)
+        status = "crash"
+        score = 0.5  # in-range fallback
+    # FINAL emit — guaranteed exactly once per task, in (0, 1)
+    emit(f"[END] {task_id} | score={score:.4f} | status={status}")
+    return score
+# ===========================================================================
+# Main entry point. Exits 0 even on catastrophic failure.
+# ===========================================================================
+def main() -> int:
+    env: Optional[HttpEnvClient] = None
+    llm_client = None
+    try:
+        env = get_env_client()
+    except Exception:
+        traceback.print_exc(file=sys.stderr)
+        for tid in TASK_IDS:
+            emit(f"[START] {tid}")
+            emit(f"[END] {tid} | score=0.5000 | status=fatal_no_env")
+        return 0
+    try:
+        llm_client = make_llm_client()
+    except Exception:
+        traceback.print_exc(file=sys.stderr)
+        for tid in TASK_IDS:
+            emit(f"[START] {tid}")
+            emit(f"[END] {tid} | score=0.5000 | status=fatal_no_llm")
+        return 0
+    for tid in TASK_IDS:
+        try:
+            run_task(env, llm_client, tid)
+        except Exception:
+            # Belt and suspenders — run_task already handles its own errors.
+            traceback.print_exc(file=sys.stderr)
+            emit(f"[START] {tid}")
+            emit(f"[END] {tid} | score=0.5000 | status=outer_crash")
+    return 0
+if __name__ == "__main__":
+    try:
+        sys.exit(main())
+    except SystemExit:
+        raise
+    except Exception:
+        traceback.print_exc(file=sys.stderr)
+        # Last-ditch: still emit safe scores so the validator parses something.
+        for tid in TASK_IDS:
+            print(f"[START] {tid}", flush=True)
+            print(f"[END] {tid} | score=0.5000 | status=outer_fatal", flush=True)
+        sys.exit(0)  # exit 0 — validator requires it

openenv.yaml ADDED Viewed

	@@ -0,0 +1,28 @@

+name: sql-repair-env
+version: 0.1.0
+description: |
+  OpenEnv environment for SQL query repair. Each task gives the agent a
+  schema, a broken SQL query, and a hint. The agent must submit a corrected
+  query that returns the expected result set. Backed by SQLite in-memory.
+maintainer: krishpotanwar
+runtime:
+  type: docker
+  image: sql-repair-env:latest
+  port: 8000
+endpoints:
+  health: /health
+  tasks: /tasks
+  reset: /reset
+  step: /step
+  grader: /grader
+  baseline: /baseline
+tasks:
+  - id: task_1
+    name: Missing commas in SELECT
+    difficulty: easy
+  - id: task_2
+    name: Wrong column reference in JOIN
+    difficulty: medium
+  - id: task_3
+    name: Aggregate without GROUP BY
+    difficulty: hard

pyproject.toml ADDED Viewed

	@@ -0,0 +1,30 @@

+[project]
+name = "sql-repair-env"
+version = "0.1.0"
+description = "OpenEnv environment for SQL query repair tasks"
+readme = "README.md"
+requires-python = ">=3.10"
+authors = [{ name = "krishpotanwar" }]
+license = { text = "Apache-2.0" }
+dependencies = [
+    "openenv-core>=0.2.0",
+    "fastapi>=0.110.0",
+    "uvicorn[standard]>=0.27.0",
+    "pydantic>=2.0.0",
+    "openai>=1.30.0",
+    "requests>=2.31.0",
+    "numpy>=1.24.0",
+]
+[project.scripts]
+server = "server.app:main"
+[build-system]
+requires = ["setuptools>=61.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[tool.setuptools]
+packages = ["server", "sql_env"]
+[tool.setuptools.package-data]
+"*" = ["*.yaml", "*.md"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+openenv-core>=0.2.0
+fastapi>=0.110.0
+uvicorn[standard]>=0.27.0
+pydantic>=2.0.0
+openai>=1.30.0
+requests>=2.31.0
+numpy>=1.24.0

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """HTTP server package for SQL Repair OpenEnv environment."""

server/app.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""FastAPI server for the SQL Repair OpenEnv environment.
+Endpoints (all required by the OpenEnv submission validator):
+  GET  /health   -> {"status": "ok"}
+  GET  /tasks    -> {"tasks": ["task_1", "task_2", "task_3"]}
+  POST /reset    -> reset env to a task (body optional, defaults to task_1)
+  POST /step     -> apply an action, return observation/reward/done
+  POST /grader   -> compute final score for a task (strictly in (0, 1))
+  POST /baseline -> run all tasks with the broken queries, return scores
+Phase 1 hard requirement: /reset MUST accept an empty POST body.
+We achieve that with `Optional[ResetRequest] = Body(default=None)`.
+Entry point exposed via [project.scripts] server = "server.app:main".
+"""
+from __future__ import annotations
+import os
+from typing import Any, Dict, List, Optional
+from fastapi import Body, FastAPI
+from pydantic import BaseModel, Field
+from sql_env.env_core import EnvState, MAX_STEPS
+from sql_env.grader import grade_task
+from sql_env.tasks import TASK_IDS, TASKS
+app = FastAPI(
+    title="SQL Repair OpenEnv",
+    version="0.1.0",
+    description=(
+        "An OpenEnv environment for SQL query repair. Agents fix broken "
+        "SQL queries against a small SQLite schema."
+    ),
+)
+# Single mutable env state instance — the validator runs one session.
+_state = EnvState()
+# ---------------------------------------------------------------------------
+# Pydantic request models
+# ---------------------------------------------------------------------------
+class ResetRequest(BaseModel):
+    task_id: Optional[str] = Field(default=None, description="Task ID to reset to")
+class StepAction(BaseModel):
+    action_type: str = Field(default="submit_query")
+    query: str = Field(default="")
+class StepRequest(BaseModel):
+    action: Dict[str, Any] = Field(default_factory=dict)
+class GraderRequest(BaseModel):
+    task_id: Optional[str] = Field(default=None)
+class BaselineRequest(BaseModel):
+    tasks: Optional[List[str]] = Field(default=None)
+# ---------------------------------------------------------------------------
+# Endpoints
+# ---------------------------------------------------------------------------
+@app.get("/health")
+def health() -> Dict[str, str]:
+    return {"status": "ok"}
+@app.get("/tasks")
+def list_tasks() -> Dict[str, Any]:
+    return {
+        "tasks": TASK_IDS,
+        "details": [
+            {
+                "id": TASKS[t]["id"],
+                "name": TASKS[t]["name"],
+                "difficulty": TASKS[t]["difficulty"],
+            }
+            for t in TASK_IDS
+        ],
+    }
+@app.post("/reset")
+def reset(req: Optional[ResetRequest] = Body(default=None)) -> Dict[str, Any]:
+    """Reset the environment. Body is optional — defaults to task_1."""
+    task_id = req.task_id if (req and req.task_id) else "task_1"
+    obs = _state.reset(task_id)
+    return obs
+@app.post("/step")
+def step(req: Optional[StepRequest] = Body(default=None)) -> Dict[str, Any]:
+    """Apply one action to the environment."""
+    action: Dict[str, Any] = (req.action if req and req.action else {})
+    return _state.step(action)
+@app.post("/grader")
+def grader(req: Optional[GraderRequest] = Body(default=None)) -> Dict[str, Any]:
+    """Return the strict-(0,1) score for the given task."""
+    task_id = req.task_id if (req and req.task_id) else (_state.task_id or "task_1")
+    score = grade_task(_state, task_id)
+    return {"task_id": task_id, "score": float(score)}
+@app.post("/baseline")
+def baseline(
+    req: Optional[BaselineRequest] = Body(default=None),
+) -> Dict[str, Any]:
+    """Run all tasks with the broken queries to verify graders work."""
+    task_ids = (req.tasks if (req and req.tasks) else None) or list(TASK_IDS)
+    out: Dict[str, float] = {}
+    for tid in task_ids:
+        if tid not in TASKS:
+            continue
+        local = EnvState()
+        local.reset(tid)
+        # Submit the broken query as a baseline submission
+        local.step({"action_type": "submit_query", "query": TASKS[tid]["broken_query"]})
+        out[tid] = float(grade_task(local, tid))
+    return {"scores": out, "max_steps": MAX_STEPS}
+# ---------------------------------------------------------------------------
+# Entry point — referenced by [project.scripts] server = "server.app:main"
+# ---------------------------------------------------------------------------
+def main() -> None:
+    """Entry point for `python -m server.app` and the `server` console script."""
+    import uvicorn
+    host = os.getenv("HOST", "0.0.0.0")
+    port = int(os.getenv("PORT", "8000"))
+    uvicorn.run(app, host=host, port=port, log_level="info")
+if __name__ == "__main__":
+    main()

sql_env/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""SQL Repair OpenEnv environment package."""
+from .env_core import EnvState, MAX_STEPS
+from .tasks import TASKS, TASK_IDS
+from .grader import grade_task, SCORE_MIN, SCORE_MAX
+__all__ = [
+    "EnvState",
+    "MAX_STEPS",
+    "TASKS",
+    "TASK_IDS",
+    "grade_task",
+    "SCORE_MIN",
+    "SCORE_MAX",
+]

sql_env/env_core.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""SQLite-backed environment state for SQL repair tasks.
+The env exposes a minimal Gym-like API:
+  reset(task_id) -> observation dict
+  step(action)   -> {observation, reward, done, info}
+Per-task state is held in this single instance for simplicity. The
+validator only needs one parallel run.
+"""
+from __future__ import annotations
+import sqlite3
+from typing import Any, Dict, List, Optional
+from .tasks import TASKS, TASK_IDS
+MAX_STEPS = 6
+def _new_db(task_id: str) -> sqlite3.Connection:
+    """Build a fresh in-memory DB for the given task."""
+    if task_id not in TASKS:
+        raise KeyError(f"Unknown task_id: {task_id}")
+    conn = sqlite3.connect(":memory:")
+    cur = conn.cursor()
+    for stmt in TASKS[task_id]["schema"]:
+        cur.execute(stmt)
+    conn.commit()
+    return conn
+def _run_query(task_id: str, query: str) -> Dict[str, Any]:
+    """Execute a query against a fresh DB; return rows or error info."""
+    conn = _new_db(task_id)
+    try:
+        cur = conn.execute(query)
+        rows = cur.fetchall()
+        col_names = [d[0] for d in cur.description] if cur.description else []
+        return {"ok": True, "rows": rows, "columns": col_names, "error": None}
+    except Exception as exc:
+        return {"ok": False, "rows": None, "columns": [], "error": str(exc)}
+    finally:
+        conn.close()
+def _expected_rows(task_id: str) -> List[tuple]:
+    """Compute the canonical (expected) result set for a task."""
+    res = _run_query(task_id, TASKS[task_id]["canonical_query"])
+    if not res["ok"]:
+        # Should never happen — canonical queries are vetted in tests.
+        raise RuntimeError(
+            f"Canonical query for {task_id} failed: {res['error']}"
+        )
+    return res["rows"]
+class EnvState:
+    """Mutable per-session env state. One instance handles all tasks."""
+    def __init__(self) -> None:
+        self.task_id: Optional[str] = None
+        self.step_count: int = 0
+        self.last_query: Optional[str] = None
+        self.last_error: Optional[str] = None
+        self.last_result: Optional[List[tuple]] = None
+        self.solved: bool = False
+        self.expected_rows: List[tuple] = []
+        self.expected_columns: int = 0
+    # ------------------------------------------------------------------
+    def reset(self, task_id: Optional[str] = None) -> Dict[str, Any]:
+        tid = task_id or "task_1"
+        if tid not in TASKS:
+            tid = "task_1"
+        task = TASKS[tid]
+        self.task_id = tid
+        self.step_count = 0
+        self.last_query = None
+        self.last_error = None
+        self.last_result = None
+        self.solved = False
+        self.expected_rows = _expected_rows(tid)
+        self.expected_columns = (
+            len(self.expected_rows[0]) if self.expected_rows else 0
+        )
+        # Surface what the broken query actually does, so the agent has
+        # an error message and a canonical "what went wrong" hint.
+        baseline = _run_query(tid, task["broken_query"])
+        return {
+            "task_id": tid,
+            "name": task["name"],
+            "difficulty": task["difficulty"],
+            "schema_sql": "\n".join(task["schema"]),
+            "broken_query": task["broken_query"],
+            "broken_query_error": baseline["error"],
+            "broken_query_executes": baseline["ok"],
+            "hint": task["hint"],
+            "expected_row_count": len(self.expected_rows),
+            "expected_column_count": self.expected_columns,
+            "step_count": 0,
+            "max_steps": MAX_STEPS,
+            "remaining_steps": MAX_STEPS,
+        }
+    # ------------------------------------------------------------------
+    def step(self, action: Dict[str, Any]) -> Dict[str, Any]:
+        if self.task_id is None:
+            return {
+                "observation": {"error": "No active task. Call /reset first."},
+                "reward": 0.0,
+                "done": True,
+                "info": {"solved": False, "no_active_task": True},
+            }
+        self.step_count += 1
+        action_type = (action or {}).get("action_type", "submit_query")
+        query = ((action or {}).get("query") or "").strip()
+        self.last_query = query
+        reward = 0.0
+        result_rows: Optional[List[tuple]] = None
+        error: Optional[str] = None
+        if action_type != "submit_query":
+            error = f"Unsupported action_type: {action_type}"
+            reward = -0.05
+        elif not query:
+            error = "Empty query string."
+            reward = -0.05
+        else:
+            res = _run_query(self.task_id, query)
+            if res["ok"]:
+                result_rows = res["rows"]
+                self.last_result = result_rows
+                self.last_error = None
+                if result_rows == self.expected_rows:
+                    reward = 1.0
+                    self.solved = True
+                else:
+                    # executed but wrong rows — small positive reward
+                    reward = 0.4
+            else:
+                error = res["error"]
+                self.last_error = error
+                self.last_result = None
+                reward = -0.10
+        done = self.solved or self.step_count >= MAX_STEPS
+        observation = {
+            "task_id": self.task_id,
+            "step_count": self.step_count,
+            "submitted_query": query,
+            "error": error,
+            "executed": error is None and result_rows is not None,
+            "matches_expected": (
+                result_rows == self.expected_rows if result_rows is not None else False
+            ),
+            "result_row_count": len(result_rows) if result_rows is not None else 0,
+            "expected_row_count": len(self.expected_rows),
+            "result_preview": result_rows[:3] if result_rows else None,
+            "expected_preview": self.expected_rows[:3],
+            "remaining_steps": max(0, MAX_STEPS - self.step_count),
+        }
+        return {
+            "observation": observation,
+            "reward": float(reward),
+            "done": bool(done),
+            "info": {"solved": self.solved},
+        }

sql_env/grader.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""Strict (0, 1) grader for SQL repair tasks.
+Phase 2 hard requirement: scores MUST be in the OPEN interval (0, 1).
+Validator rejects exactly 0.0 and exactly 1.0. NaN/inf are also rejected,
+so we coerce them to 0.5 (a neutral, in-range fallback).
+"""
+from __future__ import annotations
+import math
+from typing import Any
+# Module-level constants — also used by inference.py for consistency.
+SCORE_MIN: float = 1e-3   # 0.001 — strictly > 0
+SCORE_MAX: float = 0.999  # strictly < 1
+def strict_clamp(value: Any) -> float:
+    """Coerce any input into a float strictly inside (0, 1).
+    NaN, inf, -inf, and non-numeric inputs all collapse to 0.5.
+    """
+    try:
+        s = float(value)
+    except (TypeError, ValueError):
+        return 0.5
+    if math.isnan(s) or math.isinf(s):
+        return 0.5
+    if s <= 0.0:
+        return SCORE_MIN
+    if s >= 1.0:
+        return SCORE_MAX
+    return round(s, 4)
+def grade_task(state, task_id: str) -> float:
+    """Score the current state of an EnvState for the given task.
+    Score components (sum, then strict_clamp):
+      - 0.05  : agent submitted at least one query
+      - 0.25  : last query executed without error
+      - 0.60  : result rows matched expected rows
+      - 0.09  : efficiency bonus (faster solves score higher)
+    Worst case (no submission):    0.000  -> clamped to 0.001
+    Best case (1-step solve):      0.99   -> clamped to 0.99
+    Wrong-result executes:         0.30   -> in range
+    """
+    from .env_core import MAX_STEPS  # local import avoids circular
+    if state.task_id != task_id:
+        return SCORE_MIN
+    raw = 0.0
+    if state.last_query:
+        raw += 0.05
+    if state.last_error is None and state.last_result is not None:
+        raw += 0.25
+    if state.last_result == state.expected_rows and state.expected_rows:
+        raw += 0.60
+    if state.solved and state.step_count > 0:
+        bonus = 0.09 * max(0, MAX_STEPS - state.step_count) / MAX_STEPS
+        raw += bonus
+    return strict_clamp(raw)

sql_env/tasks.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""Task definitions for SQL Repair env.
+Each task gives the agent:
+  - schema     : list of CREATE/INSERT statements (executed verbatim)
+  - broken     : a SQL query that errors or returns the wrong rows
+  - canonical  : the reference fix used to compute expected_rows
+  - hint       : short natural-language pointer
+Difficulty is tuned so even a vanilla LLM agent (Nemotron-class) can solve
+task_1 reliably, task_2 with effort, and task_3 about half the time —
+ensuring score variance across tasks (Phase 2 likely checks for this).
+"""
+from typing import Dict, List
+TASKS: Dict[str, dict] = {
+    "task_1": {
+        "id": "task_1",
+        "name": "Missing commas in SELECT list",
+        "difficulty": "easy",
+        "schema": [
+            "CREATE TABLE products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, price REAL NOT NULL);",
+            "INSERT INTO products VALUES (1, 'Apple', 0.50);",
+            "INSERT INTO products VALUES (2, 'Bread', 2.50);",
+            "INSERT INTO products VALUES (3, 'Cheese', 5.00);",
+            "INSERT INTO products VALUES (4, 'Milk', 1.50);",
+            "INSERT INTO products VALUES (5, 'Eggs', 3.00);",
+        ],
+        "broken_query": "SELECT id name price FROM products ORDER BY id",
+        "canonical_query": "SELECT id, name, price FROM products ORDER BY id",
+        "hint": "The SELECT list is missing commas between column names.",
+    },
+    "task_2": {
+        "id": "task_2",
+        "name": "Wrong column reference in JOIN",
+        "difficulty": "medium",
+        "schema": [
+            "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT NOT NULL, country TEXT);",
+            "CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER NOT NULL, total REAL NOT NULL);",
+            "INSERT INTO users VALUES (1, 'Aarav', 'IN');",
+            "INSERT INTO users VALUES (2, 'Bea',   'US');",
+            "INSERT INTO users VALUES (3, 'Chen',  'CN');",
+            "INSERT INTO orders VALUES (10, 1,  99.00);",
+            "INSERT INTO orders VALUES (11, 1,  49.50);",
+            "INSERT INTO orders VALUES (12, 2, 200.00);",
+            "INSERT INTO orders VALUES (13, 3,  25.00);",
+        ],
+        "broken_query": (
+            "SELECT u.username, o.total "
+            "FROM users u JOIN orders o ON u.id = o.user "
+            "ORDER BY o.id"
+        ),
+        "canonical_query": (
+            "SELECT u.name, o.total "
+            "FROM users u JOIN orders o ON u.id = o.user_id "
+            "ORDER BY o.id"
+        ),
+        "hint": "Two columns are misspelled — check the schema for the real names.",
+    },
+    "task_3": {
+        "id": "task_3",
+        "name": "Aggregate without GROUP BY",
+        "difficulty": "hard",
+        "schema": [
+            "CREATE TABLE sales (id INTEGER PRIMARY KEY, region TEXT NOT NULL, amount REAL NOT NULL);",
+            "INSERT INTO sales VALUES (1, 'north', 100.00);",
+            "INSERT INTO sales VALUES (2, 'north',  50.00);",
+            "INSERT INTO sales VALUES (3, 'south', 200.00);",
+            "INSERT INTO sales VALUES (4, 'south',  75.00);",
+            "INSERT INTO sales VALUES (5, 'east',  150.00);",
+            "INSERT INTO sales VALUES (6, 'east',   25.00);",
+        ],
+        "broken_query": "SELECT region, SUM(amount) AS total FROM sales ORDER BY region",
+        "canonical_query": (
+            "SELECT region, SUM(amount) AS total FROM sales "
+            "GROUP BY region ORDER BY region"
+        ),
+        "hint": "You SELECT a non-aggregate column with an aggregate — add GROUP BY.",
+    },
+}
+TASK_IDS: List[str] = list(TASKS.keys())

tests/__init__.py ADDED Viewed

File without changes

tests/test_smoke.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""Smoke tests for the SQL Repair env.
+Run with: python -m pytest tests/ -q
+"""
+from __future__ import annotations
+import math
+from sql_env.env_core import EnvState, MAX_STEPS
+from sql_env.grader import SCORE_MAX, SCORE_MIN, grade_task, strict_clamp
+from sql_env.tasks import TASK_IDS, TASKS
+# ---------------------------------------------------------------------------
+# Strict (0, 1) clamp invariants
+# ---------------------------------------------------------------------------
+def test_strict_clamp_handles_extremes():
+    assert strict_clamp(0.0) == SCORE_MIN
+    assert strict_clamp(-1.0) == SCORE_MIN
+    assert strict_clamp(1.0) == SCORE_MAX
+    assert strict_clamp(2.0) == SCORE_MAX
+    assert strict_clamp(float("nan")) == 0.5
+    assert strict_clamp(float("inf")) == 0.5
+    assert strict_clamp(float("-inf")) == 0.5
+    assert strict_clamp("not a number") == 0.5
+    assert strict_clamp(None) == 0.5
+def test_strict_clamp_passes_through_in_range():
+    for v in [0.001, 0.1, 0.5, 0.7234, 0.999]:
+        out = strict_clamp(v)
+        assert SCORE_MIN <= out <= SCORE_MAX
+        assert 0.0 < out < 1.0
+# ---------------------------------------------------------------------------
+# Each canonical query reproduces the expected rows
+# ---------------------------------------------------------------------------
+def test_canonical_queries_solve_their_tasks():
+    for tid in TASK_IDS:
+        s = EnvState()
+        s.reset(tid)
+        result = s.step(
+            {"action_type": "submit_query", "query": TASKS[tid]["canonical_query"]}
+        )
+        assert result["info"]["solved"] is True, f"{tid} canonical did not solve"
+        assert result["reward"] == 1.0
+        score = grade_task(s, tid)
+        assert SCORE_MIN <= score <= SCORE_MAX
+        assert score >= 0.85, f"{tid} canonical scored too low: {score}"
+# ---------------------------------------------------------------------------
+# Broken queries do not solve and grade in (0, 1)
+# ---------------------------------------------------------------------------
+def test_broken_queries_score_in_range_but_not_solved():
+    for tid in TASK_IDS:
+        s = EnvState()
+        s.reset(tid)
+        result = s.step(
+            {"action_type": "submit_query", "query": TASKS[tid]["broken_query"]}
+        )
+        assert result["info"]["solved"] is False
+        score = grade_task(s, tid)
+        assert SCORE_MIN <= score <= SCORE_MAX
+        assert 0.0 < score < 1.0
+# ---------------------------------------------------------------------------
+# A do-nothing run still produces an in-range score
+# ---------------------------------------------------------------------------
+def test_no_submission_scores_in_range():
+    for tid in TASK_IDS:
+        s = EnvState()
+        s.reset(tid)
+        score = grade_task(s, tid)
+        assert SCORE_MIN <= score <= SCORE_MAX
+        assert 0.0 < score < 1.0
+# ---------------------------------------------------------------------------
+# Step limit terminates
+# ---------------------------------------------------------------------------
+def test_step_limit_done():
+    s = EnvState()
+    s.reset("task_1")
+    for _ in range(MAX_STEPS):
+        result = s.step({"action_type": "submit_query", "query": "SELECT 1"})
+    assert result["done"] is True
+# ---------------------------------------------------------------------------
+# Reset accepts unknown task_id by falling back to task_1
+# ---------------------------------------------------------------------------
+def test_reset_unknown_task_falls_back():
+    s = EnvState()
+    obs = s.reset("nonexistent_task")
+    assert obs["task_id"] == "task_1"
+# ---------------------------------------------------------------------------
+# Empty action does not crash
+# ---------------------------------------------------------------------------
+def test_empty_action_handled():
+    s = EnvState()
+    s.reset("task_1")
+    result = s.step({})
+    assert "observation" in result
+    assert result["reward"] <= 0  # negative or zero reward
+    assert result["observation"]["error"]

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff