Spaces:

visheshrathi
/

dataops-env

Sleeping

App Files Files Community

visheshrathi commited on Apr 8

Commit

f89b1ac

verified ·

1 Parent(s): 8afce53

Upload folder using huggingface_hub

Browse files

Files changed (23) hide show

Dockerfile +41 -0
README.md +248 -6
__init__.py +14 -0
client.py +93 -0
data/__init__.py +0 -0
data/init_db.py +130 -0
env_loader.py +82 -0
inference.py +589 -0
models.py +113 -0
openenv.yaml +36 -0
pyproject.toml +42 -0
server/__init__.py +5 -0
server/__main__.py +6 -0
server/app.py +530 -0
server/dataops_env_environment.py +839 -0
server/grading.py +557 -0
server/requirements.txt +10 -0
server/safe_exec.py +195 -0
server/session_manager.py +128 -0
server/task_specs.py +773 -0
tests/test_grading.py +577 -0
tests/test_inference_api.py +408 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,41 @@

+FROM python:3.12-slim AS builder
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+WORKDIR /app
+COPY pyproject.toml uv.lock README.md ./
+RUN uv sync --frozen --no-install-project --no-dev
+COPY __init__.py client.py env_loader.py inference.py models.py openenv.yaml ./
+COPY server ./server
+COPY data ./data
+RUN uv sync --frozen --no-dev
+FROM python:3.12-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    HOST=0.0.0.0 \
+    PORT=7860 \
+    PATH="/app/.venv/bin:$PATH"
+WORKDIR /app
+RUN useradd -m appuser
+COPY --from=builder --chown=appuser:appuser /app /app
+RUN mkdir -p /app/workspace && chown -R appuser:appuser /app
+USER appuser
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+CMD python -c "import os, urllib.request; urllib.request.urlopen('http://127.0.0.1:' + os.getenv('PORT', '7860') + '/health')" || exit 1
+EXPOSE 7860
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "-m", "server"]

README.md CHANGED Viewed

@@ -1,10 +1,252 @@
 ---
-title: Dataops Env
-emoji: 🐨
-colorFrom: pink
-colorTo: green
 sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DataOpsEnv
+emoji: 🧩
+colorFrom: blue
+colorTo: indigo
 sdk: docker
+app_port: 7860
+short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks.
+tags:
+  - openenv
+base_path: /web
 ---
+# DataOpsEnv
+[Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests)
+## Environment description and motivation
+**DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**.
+Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.
+**Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`).
+---
+## Action space
+Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).
+| `action_type` | Payload fields                                                                                       | Role                                                             |
+| ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
+| `ExecuteSQL`  | `query` (string, 1–2000 chars)                                                                       | Run task-scoped SQL against the episode SQLite DB.               |
+| `ReadFile`    | `filepath` (string, 1–255 chars)                                                                     | Read an allowed file from the episode workspace.                 |
+| `WriteFile`   | `filepath`, `content` (content ≤ 1M chars)                                                           | Overwrite an allowed workspace file.                             |
+| `RunScript`   | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. |
+| `SendEmail`   | `to_email`, `subject`, `body`                                                                        | Queue a mock email (used for the hard task).                     |
+Machine-readable schema: **`GET /schema`** → `action`, or **`GET /tasks`** → `action_schema`.
+---
+## Observation space
+Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`.
+| Field                   | Type                     | Meaning                                                           |
+| ----------------------- | ------------------------ | ----------------------------------------------------------------- |
+| `done`                  | boolean                  | Whether the episode has ended (step limit or terminal condition). |
+| `reward`                | number \| null           | Shaped **step reward** after this transition (trajectory signal). |
+| `metadata`              | object                   | OpenEnv extension bucket (usually empty).                         |
+| `status`                | `"success"` \| `"error"` | Whether the action executed successfully.                         |
+| `message`               | string                   | Short human-readable summary.                                     |
+| `stdout`                | string \| null           | Captured stdout (e.g. script or file read).                       |
+| `stderr`                | string \| null           | Captured stderr.                                                  |
+| `sql_results`           | list of objects \| null  | Row dicts for successful `SELECT`-style outcomes.                 |
+| `email_delivery_status` | string \| null           | Mock send confirmation when applicable.                           |
+| `step_count`            | integer                  | Steps taken in the episode.                                       |
+| `max_steps`             | integer                  | Episode step budget.                                              |
+**Terminal evaluation:** The **grader score** in **[0.0, 1.0]** is returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) and reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone.
+Machine-readable schema: **`GET /schema`** → `observation`.
+---
+## Tasks (descriptions and expected difficulty)
+| Task ID                | Expected difficulty | Description                                                                                                                                                                                                                                                                                                                               |
+| ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `task_1_easy_anomaly`  | **Easy**            | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments.                                                                                          |
+| `task_2_medium_syntax` | **Medium**          | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches.                                              |
+| `task_3_hard_e2e`      | **Hard**            | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. |
+Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**.
+---
+## Setup and usage
+**Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**.
+```bash
+uv sync
+cp .env.example .env.dev
+printf 'ENV_FILE=.env.dev\n' > .env
+```
+Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.
+**Run the server** from the repository root so **`HOST`**, **`PORT`**, and **`DEBUG`** from the active env file are honored:
+```bash
+uv run python -m server
+```
+Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**.
+**OpenEnv packaging:**
+```bash
+uv run openenv validate
+```
+**Docker:**
+```bash
+printf 'ENV_FILE=.env.dev\n' > .env
+bash build_and_run_image.sh
+```
+The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host.
+**Baseline inference (local):**
+| Variable               | Purpose                                                                                |
+| ---------------------- | -------------------------------------------------------------------------------------- |
+| `ENV_BASE_URL`         | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) |
+| `API_KEY` / `HF_TOKEN` | Exactly one model access credential source                                             |
+| `API_BASE_URL`         | Optional model provider base URL override                                              |
+| `MODEL_NAME`           | Optional Chat model ID                                                                 |
+```bash
+export ENV_BASE_URL=http://127.0.0.1:7860
+uv run python inference.py --seed 7 --max-turns 12
+```
+If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**.
+Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`.
+**`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically.
+Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.
+**Minimal HTTP smoke test:**
+```bash
+curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
+  -H 'Content-Type: application/json' \
+  -d '{"seed": 7}'
+curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
+  -H 'Content-Type: application/json' \
+  -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'
+curl -b cookies.txt 'http://127.0.0.1:7860/grader'
+```
+By default **`/grader`** returns **`task_id`** and **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`.
+---
+## Baseline scores
+All figures are **terminal grader** scores in **[0.0, 1.0]**. Scores depend on provider, model revision, temperature, and `seed`.
+### Null baseline (no agent actions)
+| Condition                                            | `task_1` | `task_2` | `task_3` | Avg  |
+| ---------------------------------------------------- | -------- | -------- | -------- | ---- |
+| `reset` only (`seed=7`), then grader; **no** `/step` | 0.00     | 0.00     | 0.00     | 0.00 |
+### Reference tool-calling baseline
+`[END] success=true` in the harness logs means the terminal grader reached **1.0** for that task.
+| Model                           | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average |
+| ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- |
+| `gemini-3.1-flash-lite-preview` | 7    | 1.00                  | 1.00                   | 1.00              | 1.00    |
+**Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run:
+```bash
+export MODEL_NAME=gemini-3.1-flash-lite-preview
+export ENV_BASE_URL=http://127.0.0.1:7860
+uv run python inference.py --seed 7 --max-turns 12 --json-scores
+```
+The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**.
+---
+## Hugging Face Spaces
+There are two methods for running the baseline against a deployed Hugging Face Space:
+1. Running **`inference.py`** externally against the public Space URL:
+```bash
+export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
+uv run python inference.py --seed 7 --max-turns 12 --json-scores
+```
+In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space.
+2. Hitting **`/baseline`** API with a `POST` request:
+```bash
+curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
+  -H 'Content-Type: application/json' \
+  -d '{"seed": 7, "max_turns": 12}'
+```
+In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**.
+---
+## API reference
+| Method | Path                 | Purpose                                                       |
+| ------ | -------------------- | ------------------------------------------------------------- |
+| GET    | `/health`            | Liveness                                                      |
+| GET    | `/metadata`          | Name, description, version, task count                        |
+| GET    | `/schema`            | JSON Schemas: action, observation, state                      |
+| GET    | `/tasks`             | Tasks + action/observation/state schemas                      |
+| POST   | `/mcp`               | Minimal JSON-RPC tool-list compatibility stub                 |
+| POST   | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id`            |
+| POST   | `/step`              | One action; optional `timeout_s`                              |
+| GET    | `/state`             | Episode state (`task_id`, `seed`, …)                          |
+| GET    | `/grader`            | Terminal score for active task                                |
+| GET    | `/grader/{task_id}`  | Same; `task_id` must match the active task                    |
+| POST   | `/baseline`          | Subprocess baseline (see [Setup and usage](#setup-and-usage)) |
+| WS     | `/ws`                | OpenEnv WebSocket session                                     |
+---
+## Environment variables (server / container)
+| Variable                 | Purpose                                                                                       |
+| ------------------------ | --------------------------------------------------------------------------------------------- |
+| `HOST`                   | Listen host used by `python -m server` and the container entrypoint                           |
+| `PORT`                   | Listen port used by `python -m server` and the container entrypoint                           |
+| `DEBUG`                  | Enables reload for local `python -m server` runs                                              |
+| `ENV_FILE`               | Repo-relative dotenv loaded after `.env` (override)                                           |
+| `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child                           |
+| `MAX_HTTP_SESSIONS`      | Concurrent HTTP sessions cap                                                                  |
+| `MAX_WS_SESSIONS`        | Concurrent WebSocket sessions cap                                                             |
+| `ADMIN_API_KEY`          | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details |
+| `PUBLIC_GRADER_DETAILS`  | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`**   |
+| `COOKIE_SECURE`          | Set `Secure` on session cookies (HTTPS)                                                       |
+| `CORS_ALLOW_ORIGINS`     | Comma-separated origins; empty disables permissive CORS (recommended default)                 |
+---
+## Tests
+```bash
+uv sync --extra dev
+uv run pytest -q
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""DataOps Environment — OpenEnv-compliant enterprise data pipeline remediation environment."""
+try:
+    from .client import DataOpsEnv
+    from .models import DataOpsAction, DataOpsObservation
+except ImportError:  # pragma: no cover — flat imports when loaded as top-level __init__ (e.g. pytest)
+    from client import DataOpsEnv
+    from models import DataOpsAction, DataOpsObservation
+__all__ = [
+    "DataOpsAction",
+    "DataOpsObservation",
+    "DataOpsEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Typed clients for the DataOpsEnv environment."""
+from typing import Optional
+import requests
+from openenv.core.client_types import StepResult
+from openenv.core.env_client import EnvClient
+from models import DataOpsAction, DataOpsObservation, DataOpsState
+class DataOpsEnv(EnvClient[DataOpsAction, DataOpsObservation, DataOpsState]):
+    """Native OpenEnv WebSocket client for persistent sessions."""
+    def _step_payload(self, action: DataOpsAction) -> dict:
+        return action.model_dump()
+    def _parse_result(self, payload: dict) -> StepResult[DataOpsObservation]:
+        observation = DataOpsObservation(
+            **payload.get("observation", {}),
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: dict) -> DataOpsState:
+        return DataOpsState(**payload)
+class DataOpsEnvClient:
+    """Compatibility HTTP client for the validator-facing REST API."""
+    def __init__(
+        self, base_url: str = "http://127.0.0.1:7860", timeout: float = 30.0
+    ) -> None:
+        self.base_url = base_url.rstrip("/")
+        self.timeout = timeout
+        self._session = requests.Session()
+    @staticmethod
+    def _parse_observation(payload: dict) -> DataOpsObservation:
+        observation_payload = dict(payload.get("observation", {}))
+        if "reward" in payload:
+            observation_payload["reward"] = payload["reward"]
+        if "done" in payload:
+            observation_payload["done"] = payload["done"]
+        return DataOpsObservation(**observation_payload)
+    def reset(
+        self, task_id: str = "task_1_easy_anomaly", seed: Optional[int] = None,
+    ) -> DataOpsObservation:
+        resp = self._session.post(
+            f"{self.base_url}/reset",
+            params={"task_id": task_id},
+            json={"seed": seed},
+            timeout=self.timeout,
+        )
+        resp.raise_for_status()
+        return self._parse_observation(resp.json())
+    def step(self, action: DataOpsAction) -> DataOpsObservation:
+        resp = self._session.post(
+            f"{self.base_url}/step",
+            json={"action": action.model_dump()},
+            timeout=self.timeout,
+        )
+        resp.raise_for_status()
+        return self._parse_observation(resp.json())
+    def state(self) -> DataOpsState:
+        resp = self._session.get(f"{self.base_url}/state", timeout=self.timeout)
+        resp.raise_for_status()
+        return DataOpsState(**resp.json())
+    def grade(self, task_id: Optional[str] = None) -> dict:
+        url = f"{self.base_url}/grader/{task_id}" if task_id else f"{self.base_url}/grader"
+        resp = self._session.get(url, timeout=self.timeout)
+        resp.raise_for_status()
+        return resp.json()
+    def close(self) -> None:
+        self._session.close()
+    def __enter__(self) -> "DataOpsEnvClient":
+        return self
+    def __exit__(self, *args: object) -> None:
+        self.close()

data/__init__.py ADDED Viewed

File without changes

data/init_db.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import logging
+import os
+import shutil
+import sqlite3
+from server.task_specs import TaskScenarioBundle, build_task_scenario
+logger = logging.getLogger(__name__)
+WORKSPACE_ROOT = os.path.join(os.path.dirname(os.path.dirname(__file__)), "workspace")
+WORKSPACE_DIR = WORKSPACE_ROOT
+def setup_workspace(
+    workspace_dir: str | None = None, *, scenario: TaskScenarioBundle | None = None
+) -> str:
+    """Initialise an isolated episode workspace from the seeded scenario."""
+    target_workspace = workspace_dir or WORKSPACE_DIR
+    target_db_path = os.path.join(target_workspace, "mock_warehouse.db")
+    resolved_scenario = scenario or build_task_scenario("task_1_easy_anomaly", seed=0)
+    os.makedirs(target_workspace, exist_ok=True)
+    _clear_workspace(target_workspace)
+    _init_database(target_db_path, resolved_scenario)
+    _write_seeded_files(target_workspace, resolved_scenario)
+    logger.info(
+        "Workspace reset complete: task=%s seed=%s db=%s",
+        resolved_scenario.task_id,
+        resolved_scenario.seed,
+        target_db_path,
+    )
+    return target_db_path
+def _clear_workspace(workspace_dir: str) -> None:
+    for entry in os.listdir(workspace_dir):
+        path = os.path.join(workspace_dir, entry)
+        try:
+            if os.path.isdir(path):
+                shutil.rmtree(path)
+            else:
+                os.remove(path)
+        except FileNotFoundError:
+            continue
+def _init_database(db_path: str, scenario: TaskScenarioBundle) -> None:
+    conn = sqlite3.connect(db_path)
+    try:
+        c = conn.cursor()
+        c.execute(
+            """
+            CREATE TABLE transactions (
+                id INTEGER PRIMARY KEY,
+                user_id INTEGER NOT NULL,
+                amount REAL,
+                status TEXT NOT NULL
+            )
+            """
+        )
+        c.execute(
+            """
+            CREATE TABLE daily_reports (
+                id INTEGER PRIMARY KEY,
+                report_date TEXT NOT NULL,
+                department TEXT NOT NULL,
+                revenue REAL NOT NULL,
+                expenses REAL NOT NULL,
+                headcount INTEGER NOT NULL
+            )
+            """
+        )
+        if scenario.task_1:
+            c.executemany(
+                "INSERT INTO transactions VALUES (?, ?, ?, ?)",
+                [
+                    (row["id"], row["user_id"], row["amount"], row["status"])
+                    for row in scenario.task_1.all_rows
+                ],
+            )
+        else:
+            c.executemany(
+                "INSERT INTO transactions VALUES (?, ?, ?, ?)",
+                [(1, 9000, 100.0, "success")],
+            )
+        if scenario.task_3:
+            c.executemany(
+                "INSERT INTO daily_reports VALUES (?, ?, ?, ?, ?, ?)",
+                [
+                    (
+                        row["id"],
+                        row["report_date"],
+                        row["department"],
+                        row["revenue"],
+                        row["expenses"],
+                        row["headcount"],
+                    )
+                    for row in scenario.task_3.all_rows
+                ],
+            )
+        conn.commit()
+    finally:
+        conn.close()
+def _write_seeded_files(workspace_dir: str, scenario: TaskScenarioBundle) -> None:
+    if scenario.task_2:
+        with open(
+            os.path.join(workspace_dir, "broken_pipeline.py"),
+            "w",
+            encoding="utf-8",
+        ) as f:
+            f.write(scenario.task_2.broken_script)
+    if scenario.task_3:
+        with open(
+            os.path.join(workspace_dir, "format_report.py"),
+            "w",
+            encoding="utf-8",
+        ) as f:
+            f.write(scenario.task_3.broken_script)
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    setup_workspace()

env_loader.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""Load local env files, while allowing externally injected container env vars."""
+import logging
+import os
+from pathlib import Path
+from dotenv import dotenv_values, load_dotenv
+logger = logging.getLogger(__name__)
+_PROJECT_ROOT = Path(__file__).resolve().parent
+_RUNTIME_ENV_KEYS = (
+    "ENV_FILE",
+    "HOST",
+    "PORT",
+    "DEBUG",
+    "ENV_BASE_URL",
+    "ADMIN_API_KEY",
+    "PUBLIC_GRADER_DETAILS",
+    "COOKIE_SECURE",
+    "HTTP_SESSION_TIMEOUT_S",
+    "CORS_ALLOW_ORIGINS",
+    "MAX_HTTP_SESSIONS",
+    "MAX_WS_SESSIONS",
+    "API_KEY",
+    "HF_TOKEN",
+    "MODEL_NAME",
+    "API_BASE_URL",
+)
+def _has_external_runtime_config() -> bool:
+    return any(bool(os.getenv(key, "").strip()) for key in _RUNTIME_ENV_KEYS)
+def _resolve_env_file(env_file_name: str) -> Path:
+    env_path = (_PROJECT_ROOT / env_file_name).resolve()
+    try:
+        env_path.relative_to(_PROJECT_ROOT)
+    except ValueError as exc:
+        raise ValueError(
+            f"ENV_FILE '{env_file_name}' must resolve inside the project root."
+        ) from exc
+    return env_path
+def load_env() -> None:
+    """Read repo-root .env to locate the active secondary env file."""
+    dot_env = _PROJECT_ROOT / ".env"
+    if not dot_env.exists():
+        if _has_external_runtime_config():
+            logger.debug(
+                ".env not found at %s — assuming runtime env vars were injected externally",
+                dot_env,
+            )
+            return
+        logger.warning(
+            ".env not found at %s — local runs expect it to define ENV_FILE for the active env file",
+            dot_env,
+        )
+        return
+    load_dotenv(dot_env, override=False)
+    env_file_name = str(
+        (dotenv_values(dot_env).get("ENV_FILE") or os.getenv("ENV_FILE") or "")
+    ).strip()
+    if not env_file_name:
+        logger.debug(".env did not specify ENV_FILE — no secondary file loaded")
+        return
+    try:
+        env_file = _resolve_env_file(env_file_name)
+    except ValueError as exc:
+        logger.warning("%s", exc)
+        return
+    if not env_file.exists():
+        logger.warning("ENV_FILE '%s' not found at %s", env_file_name, env_file)
+        return
+    load_dotenv(env_file, override=True)
+    logger.debug("Loaded environment variables from %s", env_file)

inference.py ADDED Viewed

	@@ -0,0 +1,589 @@

+"""
+DataOps benchmark runner: drives the sandbox over HTTP (`/reset`, `/step`, `/grader`) with an OpenAI
+tool-calling loop. Tool schemas are task-scoped (e.g. send_email only for the hard E2E task).
+Flow per task: reset → chat completions (prefer `tool_choice="required"`) → validate tool args → POST each action →
+append tool/observation messages until the env reports `done` or `max_turns` → GET grader score. Success is
+derived from the score vs `SUCCESS_SCORE_THRESHOLD`.
+Stdout is the harness protocol only: one `[START]`, one `[STEP]` per env step, one `[END]` (always). Use
+`--json-scores` to append a single JSON object (scores, average, metadata) for `/baseline` ingestion.
+CLI: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores`. The environment HTTP base URL comes from
+`ENV_BASE_URL`, or if unset `http://127.0.0.1:$PORT` (default port 7860). Auth uses either `API_KEY` or
+`HF_TOKEN`. `API_BASE_URL` is optional: when omitted, the runner defaults to Google's OpenAI-compatible Gemini
+endpoint for `API_KEY` and Hugging Face's router for `HF_TOKEN`.
+Library logging is disabled so parsers see only these lines.
+"""
+from __future__ import annotations
+import argparse
+import asyncio
+import json
+import logging
+import os
+import re
+import sys
+import zlib
+from datetime import datetime, timezone
+from typing import Any, Optional, Type
+import requests
+from openai import BadRequestError, OpenAI
+from openai.types.chat import ChatCompletionMessageParam, ChatCompletionToolParam
+from pydantic import BaseModel, ValidationError
+from env_loader import load_env
+from models import (
+    ExecuteSQLPayload,
+    ReadFilePayload,
+    RunScriptPayload,
+    SendEmailPayload,
+    WriteFilePayload,
+)
+from server.task_specs import TASK_IDS, TASK_METADATA
+# Silence all library logging (httpx, openai, urllib3, env_loader, etc.).
+logging.disable(logging.CRITICAL)
+load_env()
+DEFAULT_GOOGLE_OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
+DEFAULT_HF_OPENAI_BASE_URL = "https://router.huggingface.co/v1"
+_DEFAULT_PORT = int(os.getenv("PORT", "7860"))
+ENV_BASE_URL = os.getenv("ENV_BASE_URL") or f"http://127.0.0.1:{_DEFAULT_PORT}"
+MODEL_NAME = os.getenv("MODEL_NAME") or "gemini-3.1-flash-lite-preview"
+BENCHMARK = "dataops_env"
+MAX_TURNS = 12
+SUCCESS_SCORE_THRESHOLD = 1.0
+_TOOL_HELP: dict[str, str] = {
+    "execute_sql": "execute_sql — SQL over the task warehouse (field: query).",
+    "read_file": "read_file — read a workspace file (field: filepath).",
+    "write_file": "write_file — overwrite a file (fields: filepath, content).",
+    "invoke_python": "invoke_python — run a Python script (fields: filepath, optional args).",
+    "send_email": "send_email — send email (fields: to_email, subject, body).",
+}
+_ACTION_TO_TOOL: dict[str, str] = {
+    "ExecuteSQL": "execute_sql",
+    "ReadFile": "read_file",
+    "WriteFile": "write_file",
+    "RunScript": "invoke_python",
+    "SendEmail": "send_email",
+}
+def _allowed_tool_names_csv(task_id: str) -> str:
+    order = (
+        "execute_sql",
+        "read_file",
+        "write_file",
+        "invoke_python",
+        "send_email",
+    )
+    allowed = {_ACTION_TO_TOOL[a] for a in TASK_METADATA[task_id].allowed_actions}
+    return ", ".join(t for t in order if t in allowed)
+def _system_prompt_for_task(task_id: str) -> str:
+    lines = [
+        _TOOL_HELP[t]
+        for t in (
+            "execute_sql",
+            "read_file",
+            "write_file",
+            "invoke_python",
+            "send_email",
+        )
+        if t in {_ACTION_TO_TOOL[a] for a in TASK_METADATA[task_id].allowed_actions}
+    ]
+    tools_block = "\n".join(f"    - {line}" for line in lines)
+    return f"""\
+You are an expert DataOps agent in a task-scoped benchmark. Only the tools listed below exist for this task — do not assume other actions are available.
+Available tools:
+{tools_block}
+Rules:
+- Always read files before modifying them when read_file is available.
+- After writing a fix, run the script to verify it works when invoke_python is available.
+- Be precise. Do not drop tables. Do not guess — inspect first.
+- For tasks that include send_email, match subject and body to the task description exactly.
+"""
+TASK_PROMPTS = {
+    "task_1_easy_anomaly": (
+        "Solve the seeded cleanup task carefully. Inspect before mutating. Only NULL-amount rows are corrupted; preserve every non-null row exactly, including legitimate zero or negative adjustments."
+    ),
+    "task_2_medium_syntax": (
+        "Solve the seeded script-repair task. Read the file, make the minimal correct fix, and verify with execution."
+    ),
+    "task_3_hard_e2e": (
+        "Solve the seeded incident task end to end. Use SQL for the exact slice, write the exact JSON file, "
+        "repair the formatter, execute it, and email the exact generated report."
+    ),
+}
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(
+    step: int, action: str, reward: float, done: bool, error: Optional[str]
+) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def _public_grader_details_enabled() -> bool:
+    return os.getenv("PUBLIC_GRADER_DETAILS", "").strip().lower() in {"1", "true", "yes"}
+def _emit_grader_details_to_stderr(grade: dict[str, Any]) -> None:
+    if not _public_grader_details_enabled():
+        return
+    if "details" not in grade:
+        return
+    print(json.dumps(grade, ensure_ascii=False), file=sys.stderr, flush=True)
+def _request_json(
+    http: requests.Session,
+    method: str,
+    path: str,
+    *,
+    timeout: float,
+    **kwargs: Any,
+) -> dict[str, Any]:
+    response = http.request(method, f"{ENV_BASE_URL}{path}", timeout=timeout, **kwargs)
+    response.raise_for_status()
+    return response.json()
+def _build_tools(task_id: str) -> list[ChatCompletionToolParam]:
+    defs: dict[str, tuple[str, Type[BaseModel]]] = {
+        "execute_sql": (
+            "Run a task-scoped SQL query against the SQLite warehouse DB.",
+            ExecuteSQLPayload,
+        ),
+        "read_file": ("Read a file in the workspace.", ReadFilePayload),
+        "write_file": ("Overwrite a file with new content.", WriteFilePayload),
+        "invoke_python": (
+            "Execute a Python script in the workspace (optional args).",
+            RunScriptPayload,
+        ),
+        "send_email": ("Send a formatted email notification.", SendEmailPayload),
+    }
+    allowed_names = {_ACTION_TO_TOOL[a] for a in TASK_METADATA[task_id].allowed_actions}
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": name,
+                "description": defs[name][0],
+                "parameters": defs[name][1].model_json_schema(),
+            },
+        }
+        for name in (
+            "execute_sql",
+            "read_file",
+            "write_file",
+            "invoke_python",
+            "send_email",
+        )
+        if name in allowed_names
+    ]
+def _tool_call_to_action(name: str, arguments: str) -> dict[str, Any]:
+    if name == "run_script":
+        name = "invoke_python"
+    mapping: dict[str, tuple[str, Type[BaseModel]]] = {
+        "execute_sql": ("ExecuteSQL", ExecuteSQLPayload),
+        "read_file": ("ReadFile", ReadFilePayload),
+        "write_file": ("WriteFile", WriteFilePayload),
+        "invoke_python": ("RunScript", RunScriptPayload),
+        "send_email": ("SendEmail", SendEmailPayload),
+    }
+    if name not in mapping:
+        raise ValueError(f"Unknown tool: {name}")
+    action_type, model = mapping[name]
+    data = json.loads(arguments) if (arguments or "").strip() else {}
+    payload = model.model_validate(data).model_dump()
+    return {"action_type": action_type, "payload": payload}
+_MALFORMED_TOOL = re.compile(
+    r"^([a-zA-Z_][a-zA-Z0-9_]*)[\s,=\(]+(\{.*\})\)?\s*$", re.DOTALL
+)
+def _normalize_tool_name_and_args(name: str, arguments: str) -> tuple[str, str]:
+    name = (name or "").strip()
+    arguments = (arguments or "").strip()
+    m = _MALFORMED_TOOL.match(name)
+    if m:
+        base, embedded = m.group(1).strip(), m.group(2).strip()
+        if not arguments:
+            return base, embedded
+    return name, arguments
+def _action_from_tool_call(tc: Any) -> dict[str, Any]:
+    name, arguments = _normalize_tool_name_and_args(
+        tc.function.name or "", tc.function.arguments or ""
+    )
+    return _tool_call_to_action(name, arguments)
+def _action_str(action_payload: dict[str, Any]) -> str:
+    at = action_payload.get("action_type", "")
+    pl = action_payload.get("payload") or {}
+    raw = f"{at}({json.dumps(pl, ensure_ascii=False)})"
+    if len(raw) > 1200:
+        return raw[:600] + "..." + raw[-550:]
+    return raw
+def _obs_error(obs: dict[str, Any]) -> Optional[str]:
+    if obs.get("status") != "error":
+        return None
+    msg = obs.get("message")
+    if isinstance(msg, str) and msg.strip():
+        return msg.replace("\n", " ").strip()
+    return None
+def _resolve_api_base_url() -> str:
+    explicit = os.getenv("API_BASE_URL", "").strip()
+    if explicit:
+        return explicit
+    if os.getenv("HF_TOKEN", "").strip():
+        return DEFAULT_HF_OPENAI_BASE_URL
+    return DEFAULT_GOOGLE_OPENAI_BASE_URL
+API_BASE_URL = _resolve_api_base_url()
+def _openai_client() -> OpenAI:
+    key = (os.getenv("HF_TOKEN") or os.getenv("API_KEY") or "").strip()
+    if not key:
+        print(
+            "[inference] Missing API_KEY or HF_TOKEN for model access.",
+            file=sys.stderr,
+            flush=True,
+        )
+        sys.exit(1)
+    return OpenAI(api_key=key, base_url=API_BASE_URL)
+def _llm_seed(env_seed: int | None, task_id: str) -> int | None:
+    if env_seed is None:
+        return None
+    mixed = (int(env_seed) * 1_000_003) ^ (zlib.crc32(task_id.encode()) & 0xFFFFFFFF)
+    return mixed & 0x7FFFFFFF
+def _create_chat_completion(
+    client: OpenAI,
+    messages: list[ChatCompletionMessageParam],
+    tools: list[ChatCompletionToolParam],
+    *,
+    task_id: str,
+    env_seed: int | None,
+) -> Any:
+    """Prefer tool_choice=required so the model cannot end a turn without a tool call."""
+    kwargs: dict[str, Any] = {
+        "model": MODEL_NAME,
+        "messages": messages,
+        "tools": tools,
+        "parallel_tool_calls": False,
+        "temperature": 0,
+        "top_p": 1.0,
+    }
+    llm_seed = _llm_seed(env_seed, task_id)
+    if llm_seed is not None:
+        kwargs["seed"] = llm_seed
+    def _call(tool_choice: str) -> Any:
+        return client.chat.completions.create(**kwargs, tool_choice=tool_choice)
+    try:
+        return _call("required")
+    except BadRequestError as e:
+        err = str(e).lower()
+        if "seed" in err and llm_seed is not None:
+            kwargs.pop("seed", None)
+            try:
+                return _call("required")
+            except BadRequestError as e2:
+                err = str(e2).lower()
+        if not any(x in err for x in ("tool_choice", "required", "unsupported")):
+            raise
+        return _call("auto")
+def run_task(
+    client: OpenAI,
+    http: requests.Session,
+    task_id: str,
+    *,
+    max_turns: int,
+    seed: int | None,
+) -> float:
+    rewards: list[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        tools = _build_tools(task_id)
+        names_csv = _allowed_tool_names_csv(task_id)
+        reset_resp = _request_json(
+            http,
+            "POST",
+            "/reset",
+            timeout=10,
+            params={"task_id": task_id},
+            json={} if seed is None else {"seed": seed},
+        )
+        reset_obs = reset_resp.get("observation", reset_resp)
+        messages: list[ChatCompletionMessageParam] = [
+            {"role": "system", "content": _system_prompt_for_task(task_id)},
+            {
+                "role": "user",
+                "content": TASK_PROMPTS[task_id]
+                + f"\n\nEnvironment says: {reset_obs['message']}",
+            },
+        ]
+        done = False
+        step_num = 0
+        no_tool_streak = 0
+        for turn in range(1, max_turns + 1):
+            try:
+                response = _create_chat_completion(
+                    client,
+                    messages,
+                    tools,
+                    task_id=task_id,
+                    env_seed=seed,
+                )
+            except BadRequestError as e:
+                err_str = str(e).lower()
+                if "tool" not in err_str and "function" not in err_str:
+                    raise
+                if messages and messages[-1].get("role") == "assistant":  # type: ignore[union-attr]
+                    messages.pop()
+                messages.append(
+                    {
+                        "role": "user",
+                        "content": (
+                            "IMPORTANT: Call tools using ONLY these exact names: "
+                            f"{names_csv}. "
+                            "Put ALL parameters inside the tool's JSON arguments field. "
+                            "Do NOT embed parameters in the tool name itself."
+                        ),
+                    }
+                )
+                try:
+                    response = _create_chat_completion(
+                        client,
+                        messages,
+                        tools,
+                        task_id=task_id,
+                        env_seed=seed,
+                    )
+                except BadRequestError:
+                    break
+            msg = response.choices[0].message
+            if not msg.tool_calls:
+                no_tool_streak += 1
+                if no_tool_streak > 3:
+                    break
+                messages.append(msg)  # type: ignore[arg-type]
+                messages.append(
+                    {
+                        "role": "user",
+                        "content": (
+                            f"You must respond with exactly one tool call ({names_csv}). "
+                            "Do not reply with plain text only."
+                        ),
+                    }
+                )
+                continue
+            no_tool_streak = 0
+            messages.append(msg)  # type: ignore[arg-type]
+            for tc in msg.tool_calls:
+                try:
+                    action_payload = _action_from_tool_call(tc)
+                except (json.JSONDecodeError, ValidationError, ValueError) as e:
+                    messages.append(
+                        {
+                            "role": "tool",
+                            "tool_call_id": tc.id,
+                            "content": f"Invalid tool arguments: {e}",
+                        }
+                    )
+                    continue
+                step_num += 1
+                step_resp = _request_json(
+                    http,
+                    "POST",
+                    "/step",
+                    timeout=30,
+                    json={"action": action_payload},
+                )
+                obs = step_resp.get("observation", step_resp)
+                reward_raw = step_resp.get("reward")
+                reward = 0.0 if reward_raw is None else float(reward_raw)
+                done = step_resp.get("done", False)
+                rewards.append(reward)
+                steps_taken = step_num
+                err = _obs_error(obs if isinstance(obs, dict) else {})
+                log_step(
+                    step=step_num,
+                    action=_action_str(action_payload),
+                    reward=reward,
+                    done=done,
+                    error=err,
+                )
+                messages.append(
+                    {"role": "tool", "tool_call_id": tc.id, "content": json.dumps(obs)}
+                )
+                if done:
+                    break
+            if done:
+                break
+        grade = _request_json(http, "GET", f"/grader/{task_id}", timeout=10)
+        _emit_grader_details_to_stderr(grade)
+        score = float(grade["score"])
+        score = min(max(score, 0.0), 1.0)
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    except Exception as exc:
+        print(
+            f"[inference] task={task_id} failed: {exc!r}", file=sys.stderr, flush=True
+        )
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+def _parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(
+        description="DataOpsEnv inference (OpenAI client; protocol lines on stdout)."
+    )
+    p.add_argument(
+        "--task",
+        action="append",
+        choices=TASK_IDS,
+        dest="tasks",
+        help="Run only the selected task(s). Defaults to all tasks.",
+    )
+    p.add_argument(
+        "--seed",
+        type=int,
+        default=None,
+        help="Environment seed for /reset; also used for LLM seed when the API supports it.",
+    )
+    p.add_argument(
+        "--max-turns",
+        type=int,
+        default=MAX_TURNS,
+        help=f"Maximum tool-using turns per task (default: {MAX_TURNS}).",
+    )
+    p.add_argument(
+        "--json-scores",
+        action="store_true",
+        help="Print a final JSON object with scores to stdout (for POST /baseline).",
+    )
+    return p.parse_args()
+def _run_inference_sync(args: argparse.Namespace) -> None:
+    client = _openai_client()
+    scores: dict[str, float] = {}
+    grades: dict[str, dict[str, Any]] = {}
+    task_ids = args.tasks or list(TASK_PROMPTS)
+    with requests.Session() as http:
+        for task_id in task_ids:
+            scores[task_id] = run_task(
+                client,
+                http,
+                task_id,
+                max_turns=max(1, int(args.max_turns)),
+                seed=args.seed,
+            )
+            if args.json_scores:
+                try:
+                    grades[task_id] = _request_json(
+                        http,
+                        "GET",
+                        f"/grader/{task_id}",
+                        timeout=10,
+                    )
+                except Exception:
+                    grades[task_id] = {
+                        "task_id": task_id,
+                        "score": scores[task_id],
+                    }
+    if args.json_scores:
+        avg = sum(scores.values()) / len(scores)
+        payload = {
+            "scores": scores,
+            "grades": grades,
+            "average": round(avg, 4),
+            "model": MODEL_NAME,
+            "metadata": {
+                "env_base_url": ENV_BASE_URL,
+                "seed": args.seed,
+                "max_turns": max(1, int(args.max_turns)),
+                "tasks": task_ids,
+                "generated_at_utc": datetime.now(timezone.utc).isoformat(),
+                "model_base_url": str(getattr(client, "base_url", "")),
+            },
+        }
+        print(json.dumps(payload), flush=True)
+async def main() -> None:
+    args = _parse_args()
+    await asyncio.to_thread(_run_inference_sync, args)
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import re
+from typing import Any, Dict, List, Literal, Optional
+from openenv.core.env_server import (
+    Action as BaseAction,
+)
+from openenv.core.env_server import (
+    Observation as BaseObservation,
+)
+from openenv.core.env_server import (
+    State as BaseState,
+)
+from pydantic import BaseModel, Field, field_validator
+# ── Action Payload Models (Pydantic-validated) ─────────────────────
+class ExecuteSQLPayload(BaseModel):
+    query: str = Field(..., min_length=1, max_length=2000)
+class ReadFilePayload(BaseModel):
+    filepath: str = Field(..., min_length=1, max_length=255)
+class WriteFilePayload(BaseModel):
+    filepath: str = Field(..., min_length=1, max_length=255)
+    content: str = Field(..., max_length=1_000_000)
+class RunScriptPayload(BaseModel):
+    filepath: str = Field(..., min_length=1, max_length=255)
+    args: List[str] = Field(default_factory=list, max_length=20)
+    @field_validator("filepath")
+    @classmethod
+    def must_be_safe_script_name(cls, v: str) -> str:
+        basename = v.rsplit("/", 1)[-1]
+        if not re.match(r"^[a-zA-Z0-9_\-]+\.py$", basename):
+            raise ValueError("Script name must be alphanumeric with .py extension.")
+        return v
+    @field_validator("args")
+    @classmethod
+    def args_must_be_safe(cls, v: list[str]) -> list[str]:
+        for arg in v:
+            if not isinstance(arg, str) or len(arg) > 500:
+                raise ValueError("Each arg must be a string under 500 chars.")
+        return v
+class SendEmailPayload(BaseModel):
+    to_email: str = Field(..., max_length=320)
+    subject: str = Field(..., min_length=1, max_length=500)
+    body: str = Field(..., min_length=1, max_length=100_000)
+    @field_validator("to_email")
+    @classmethod
+    def must_look_like_email(cls, v: str) -> str:
+        if not re.match(r"^[^@\s]+@[^@\s]+\.[^@\s]+$", v):
+            raise ValueError("Invalid email format.")
+        return v
+ACTION_TYPE = Literal["ExecuteSQL", "ReadFile", "WriteFile", "RunScript", "SendEmail"]
+PAYLOAD_MODELS: dict[str, type[BaseModel]] = {
+    "ExecuteSQL": ExecuteSQLPayload,
+    "ReadFile": ReadFilePayload,
+    "WriteFile": WriteFilePayload,
+    "RunScript": RunScriptPayload,
+    "SendEmail": SendEmailPayload,
+}
+# ── Action Model (extends OpenEnv Action) ──────────────────────────
+class DataOpsAction(BaseAction):
+    action_type: ACTION_TYPE = Field(
+        ..., description="One of: ExecuteSQL, ReadFile, WriteFile, RunScript, SendEmail"
+    )
+    payload: Dict[str, Any] = Field(
+        ..., description="Parameters for the chosen action type."
+    )
+# ── Observation Model (extends OpenEnv Observation) ────────────────
+class DataOpsObservation(BaseObservation):
+    status: Literal["success", "error"] = "error"
+    message: str = ""
+    stdout: Optional[str] = None
+    stderr: Optional[str] = None
+    sql_results: Optional[List[Dict[str, Any]]] = None
+    email_delivery_status: Optional[str] = None
+    step_count: int = 0
+    max_steps: int = 0
+# ── State Model (extends OpenEnv State) ────────────────────────────
+class DataOpsState(BaseState):
+    task_id: str = ""
+    task_description: str = ""
+    seed: int = 0
+    max_steps: int = 15
+    done: bool = False
+    cumulative_reward: float = 0.0
+    actions_taken: List[str] = Field(default_factory=list)
+    emails_sent: int = 0

openenv.yaml ADDED Viewed

	@@ -0,0 +1,36 @@

+spec_version: 1
+name: dataops_env
+version: 1.0.0
+description: Seeded enterprise DataOps benchmark with isolated sessions, deterministic graders, and three escalating remediation tasks.
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
+tasks:
+  - id: task_1_easy_anomaly
+    name: Delete Corrupted Transaction Rows
+    difficulty: easy
+    description: Inspect a seeded transaction table and remove only the rows whose amount is NULL, preserving legitimate non-null edge values.
+    benchmark_focus: Careful data cleanup without collateral damage.
+    allowed_actions:
+      - ExecuteSQL
+  - id: task_2_medium_syntax
+    name: Repair Seeded Pipeline Script
+    difficulty: medium
+    description: Repair a seeded ETL normalization script and verify it against visible and hidden seeded batches.
+    benchmark_focus: Code reading, precise repair, and generalization beyond the demo batch.
+    allowed_actions:
+      - ReadFile
+      - WriteFile
+      - RunScript
+  - id: task_3_hard_e2e
+    name: Resolve Revenue Reporting Incident
+    difficulty: hard
+    description: Extract a seeded reporting slice, repair the formatter, and send the exact generated report.
+    benchmark_focus: End-to-end data extraction, file repair, and communication with provenance.
+    allowed_actions:
+      - ExecuteSQL
+      - ReadFile
+      - WriteFile
+      - RunScript
+      - SendEmail

pyproject.toml ADDED Viewed

	@@ -0,0 +1,42 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-dataops_env"
+version = "1.0.0"
+description = "Enterprise data pipeline remediation environment for training AI agents (OpenEnv-compliant)."
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "starlette>=0.46.0,<0.52.0",
+    "uvicorn[standard]>=0.34.0",
+    "pydantic>=2.10.0",
+    "pyyaml>=6.0.2",
+    "openai>=1.60.0",
+    "requests>=2.32.0",
+    "wsproto>=1.3.2",
+    "python-dotenv>=1.0.0",
+    "huggingface-hub>=1.8.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "dataops_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["dataops_env", "dataops_env.server"]
+package-dir = { "dataops_env" = ".", "dataops_env.server" = "server" }
+[tool.pytest.ini_options]
+pythonpath = ["."]
+testpaths = ["tests"]
+addopts = "--import-mode=importlib"

server/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""DataOps environment server components."""
+from .dataops_env_environment import DataOpsEnvironment
+__all__ = ["DataOpsEnvironment"]

server/__main__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Run the API server from the repo root: ``uv run python -m server``."""
+from server.app import main
+if __name__ == "__main__":
+    main()

server/app.py ADDED Viewed

	@@ -0,0 +1,530 @@

+import asyncio
+import json
+import logging
+import os
+import subprocess
+import sys
+from collections.abc import AsyncIterator
+from contextlib import asynccontextmanager
+from pathlib import Path
+import yaml
+import uvicorn
+from fastapi import (
+    Body,
+    FastAPI,
+    HTTPException,
+    Query,
+    Request,
+    Response,
+    WebSocket,
+    WebSocketDisconnect,
+)
+from fastapi.middleware.cors import CORSMiddleware
+from openenv.core.env_server.http_server import serialize_observation
+from openenv.core.env_server.types import (
+    HealthResponse,
+    HealthStatus,
+    ResetRequest,
+    ResetResponse,
+    StepRequest,
+    StepResponse,
+    WSCloseMessage,
+    WSErrorCode,
+    WSErrorResponse,
+    WSObservationResponse,
+    WSResetMessage,
+    WSStateMessage,
+    WSStateResponse,
+    WSStepMessage,
+)
+from pydantic import ValidationError
+from env_loader import load_env
+from models import DataOpsAction, DataOpsObservation, DataOpsState
+from server.dataops_env_environment import DataOpsEnvironment
+from server.grading import evaluate_task
+from server.session_manager import EnvironmentSessionManager
+from server.task_specs import TASK_IDS, task_manifest_entries
+# Repo root must be on sys.path (e.g. run `uv run python -m server.app` or uvicorn from project root).
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+SERVER_DIR = Path(__file__).resolve().parent
+logging.basicConfig(level=logging.INFO, format="%(levelname)s | %(name)s | %(message)s")
+logger = logging.getLogger(__name__)
+load_env()
+SESSION_COOKIE_NAME = "dataops_session_id"
+SESSION_HEADER_NAME = "X-Session-ID"
+MAX_HTTP_SESSIONS = int(os.getenv("MAX_HTTP_SESSIONS", "128"))
+HTTP_SESSION_TIMEOUT_S = float(os.getenv("HTTP_SESSION_TIMEOUT_S", "1200"))
+MAX_WS_SESSIONS = max(1, int(os.getenv("MAX_WS_SESSIONS", "64")))
+ADMIN_API_KEY = os.getenv("ADMIN_API_KEY", "").strip()
+COOKIE_SECURE = os.getenv("COOKIE_SECURE", "").lower() in {"1", "true", "yes"}
+def _public_grader_details_enabled() -> bool:
+    """Read at request time so env / tests can control visibility without stale import-time state."""
+    v = os.getenv("PUBLIC_GRADER_DETAILS", "").strip().lower()
+    return v in {"1", "true", "yes"}
+_ws_active_sessions = 0
+_ws_session_lock = asyncio.Lock()
+session_manager = EnvironmentSessionManager(
+    max_sessions=MAX_HTTP_SESSIONS,
+    session_timeout_s=HTTP_SESSION_TIMEOUT_S,
+)
+@asynccontextmanager
+async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
+    logger.info("DataOpsEnv starting.")
+    yield
+    session_manager.close_all()
+    logger.info("DataOpsEnv shutting down.")
+app = FastAPI(
+    title="DataOpsEnv",
+    description="Enterprise data pipeline remediation environment for training AI agents (OpenEnv-compliant).",
+    version="1.0.0",
+    lifespan=lifespan,
+)
+def _cors_allow_origins() -> list[str]:
+    configured = os.getenv("CORS_ALLOW_ORIGINS", "").strip()
+    if not configured:
+        return []
+    if configured == "*":
+        return ["*"]
+    return [item.strip() for item in configured.split(",") if item.strip()]
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=_cors_allow_origins(),
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+def _load_manifest() -> dict:
+    yaml_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "openenv.yaml")
+    try:
+        with open(yaml_path, encoding="utf-8") as f:
+            return yaml.safe_load(f) or {}
+    except FileNotFoundError:
+        return {}
+def _load_yaml_tasks() -> list[dict]:
+    manifest = _load_manifest()
+    tasks = manifest.get("tasks")
+    if isinstance(tasks, list) and tasks:
+        manifest_ids = [str(item.get("id", "")) for item in tasks]
+        if manifest_ids == TASK_IDS:
+            return tasks
+    return task_manifest_entries()
+def _wrap_obs(obs: DataOpsObservation) -> dict:
+    """Serialise an observation to the standard OpenEnv response dict."""
+    return obs.model_dump()
+def _get_session_id(request: Request) -> str | None:
+    header_value = request.headers.get(SESSION_HEADER_NAME)
+    if header_value:
+        return header_value.strip() or None
+    cookie_value = request.cookies.get(SESSION_COOKIE_NAME)
+    if cookie_value:
+        return cookie_value.strip() or None
+    return None
+def _attach_session(response: Response, session_id: str) -> None:
+    response.set_cookie(
+        key=SESSION_COOKIE_NAME,
+        value=session_id,
+        httponly=True,
+        samesite="lax",
+        secure=COOKIE_SECURE,
+        max_age=int(HTTP_SESSION_TIMEOUT_S),
+    )
+    response.headers[SESSION_HEADER_NAME] = session_id
+def _require_active_env(request: Request) -> tuple[str, DataOpsEnvironment]:
+    session_id, env = session_manager.get_session(_get_session_id(request))
+    if session_id is None or env is None:
+        raise HTTPException(400, "No active episode. Call /reset first.")
+    return session_id, env
+def _ws_error_payload(message: str, code: WSErrorCode) -> str:
+    return WSErrorResponse(
+        data={
+            "message": message,
+            "code": code.value,
+        }
+    ).model_dump_json()
+def _require_admin(request: Request) -> None:
+    if not ADMIN_API_KEY:
+        return
+    if request.headers.get("X-Admin-Key", "") != ADMIN_API_KEY:
+        raise HTTPException(403, "Missing or invalid admin key.")
+def _request_is_admin(request: Request) -> bool:
+    return bool(ADMIN_API_KEY) and request.headers.get("X-Admin-Key", "") == ADMIN_API_KEY
+def _format_grader_response(grade: dict, request: Request) -> dict:
+    if _public_grader_details_enabled() or _request_is_admin(request):
+        return grade
+    return {"task_id": grade.get("task_id"), "score": grade.get("score")}
+async def _try_acquire_ws_slot() -> bool:
+    global _ws_active_sessions
+    async with _ws_session_lock:
+        if _ws_active_sessions >= MAX_WS_SESSIONS:
+            return False
+        _ws_active_sessions += 1
+        return True
+async def _release_ws_slot() -> None:
+    global _ws_active_sessions
+    async with _ws_session_lock:
+        _ws_active_sessions = max(0, _ws_active_sessions - 1)
+@app.get("/health", response_model=HealthResponse)
+def health_endpoint():
+    return HealthResponse(status=HealthStatus.HEALTHY)
+@app.get("/metadata")
+def metadata_endpoint():
+    manifest = _load_manifest()
+    return {
+        "name": manifest.get("name", "dataops_env"),
+        "description": manifest.get(
+            "description",
+            (
+                "Enterprise data pipeline remediation environment. "
+                "Agents debug data streams, fix scripts, and send email reports."
+            ),
+        ),
+        "version": manifest.get("version", "1.0.0"),
+        "task_count": len(_load_yaml_tasks()),
+    }
+@app.get("/schema")
+def schema_endpoint():
+    return {
+        "action": DataOpsAction.model_json_schema(),
+        "observation": DataOpsObservation.model_json_schema(),
+        "state": DataOpsState.model_json_schema(),
+    }
+@app.post("/mcp")
+def mcp_endpoint(body: dict = Body(default_factory=dict)):
+    method = body.get("method", "")
+    req_id = body.get("id")
+    if method == "tools/list":
+        tools = [
+            {"name": atype, "description": f"Execute a {atype} action."}
+            for atype in [
+                "ExecuteSQL",
+                "ReadFile",
+                "WriteFile",
+                "RunScript",
+                "SendEmail",
+            ]
+        ]
+        return {"jsonrpc": "2.0", "id": req_id, "result": {"tools": tools}}
+    return {
+        "jsonrpc": "2.0",
+        "id": req_id,
+        "error": {"code": -32601, "message": "Method not found"},
+    }
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+    await websocket.accept()
+    acquired_slot = await _try_acquire_ws_slot()
+    if not acquired_slot:
+        await websocket.send_text(
+            _ws_error_payload(
+                "WebSocket session capacity reached.",
+                WSErrorCode.CAPACITY_REACHED,
+            )
+        )
+        await websocket.close(code=1013)
+        return
+    env = DataOpsEnvironment()
+    try:
+        while True:
+            raw_message = await websocket.receive_text()
+            try:
+                message_dict = json.loads(raw_message)
+            except json.JSONDecodeError:
+                await websocket.send_text(
+                    _ws_error_payload("Invalid JSON payload.", WSErrorCode.INVALID_JSON)
+                )
+                continue
+            message_type = message_dict.get("type", "")
+            try:
+                if message_type == "reset":
+                    message = WSResetMessage(**message_dict)
+                    observation = env.reset(**message.data)
+                    response = WSObservationResponse(
+                        data=serialize_observation(observation)
+                    )
+                elif message_type == "step":
+                    message = WSStepMessage(**message_dict)
+                    action = DataOpsAction(**message.data)
+                    observation = env.step(action)
+                    response = WSObservationResponse(
+                        data=serialize_observation(observation)
+                    )
+                elif message_type == "state":
+                    WSStateMessage(**message_dict)
+                    response = WSStateResponse(data=env.state.model_dump())
+                elif message_type == "close":
+                    WSCloseMessage(**message_dict)
+                    break
+                else:
+                    await websocket.send_text(
+                        _ws_error_payload(
+                            f"Unknown message type: {message_type}",
+                            WSErrorCode.UNKNOWN_TYPE,
+                        )
+                    )
+                    continue
+                await websocket.send_text(response.model_dump_json())
+            except ValidationError:
+                await websocket.send_text(
+                    _ws_error_payload(
+                        "Validation error while handling the WebSocket message.",
+                        WSErrorCode.VALIDATION_ERROR,
+                    )
+                )
+            except Exception:
+                logger.exception("WebSocket execution error")
+                await websocket.send_text(
+                    _ws_error_payload(
+                        "Execution error while handling the WebSocket message.",
+                        WSErrorCode.EXECUTION_ERROR,
+                    )
+                )
+    except WebSocketDisconnect:
+        logger.debug("WebSocket client disconnected.")
+    finally:
+        env.close()
+        await _release_ws_slot()
+@app.post("/reset", response_model=ResetResponse)
+def reset_endpoint(
+    request: Request,
+    response: Response,
+    task_id: str = Query("task_1_easy_anomaly", description="Task to initialise."),
+    body: ResetRequest = Body(default_factory=ResetRequest),
+):
+    if task_id not in TASK_IDS:
+        raise HTTPException(400, f"Invalid task_id. Choose from: {TASK_IDS}")
+    session_id = _get_session_id(request)
+    resolved_session_id, _env, obs = session_manager.reset_session(
+        task_id=task_id,
+        seed=body.seed,
+        episode_id=body.episode_id,
+        session_id=session_id,
+    )
+    _attach_session(response, resolved_session_id)
+    return ResetResponse(observation=_wrap_obs(obs), reward=obs.reward, done=obs.done)
+@app.post("/step", response_model=StepResponse)
+def step_endpoint(request: Request, response: Response, body: StepRequest):
+    try:
+        action = DataOpsAction(**body.action)
+    except ValidationError as e:
+        raise HTTPException(422, f"Invalid action: {e}") from e
+    session_id, env = _require_active_env(request)
+    _attach_session(response, session_id)
+    obs = env.step(action, timeout_s=body.timeout_s)
+    return StepResponse(observation=_wrap_obs(obs), reward=obs.reward, done=obs.done)
+@app.get("/state", response_model=DataOpsState)
+def state_endpoint(request: Request, response: Response):
+    session_id, env = _require_active_env(request)
+    _attach_session(response, session_id)
+    return env.state
+@app.get("/tasks")
+def tasks_endpoint():
+    return {
+        "tasks": _load_yaml_tasks(),
+        "action_schema": DataOpsAction.model_json_schema(),
+        "observation_schema": DataOpsObservation.model_json_schema(),
+        "state_schema": DataOpsState.model_json_schema(),
+    }
+@app.get("/grader")
+def grader_current_endpoint(request: Request, response: Response):
+    """Grade the current episode (uses active task_id from state)."""
+    session_id, env = _require_active_env(request)
+    _attach_session(response, session_id)
+    task_id = env.state.task_id
+    if not task_id:
+        raise HTTPException(400, "No active episode. Call /reset first.")
+    return _format_grader_response(evaluate_task(task_id, env), request)
+@app.get("/grader/{task_id}")
+def grader_endpoint(task_id: str, request: Request, response: Response):
+    if task_id not in TASK_IDS:
+        raise HTTPException(404, f"Unknown task: {task_id}")
+    session_id, env = _require_active_env(request)
+    _attach_session(response, session_id)
+    active_task_id = env.state.task_id
+    if active_task_id and active_task_id != task_id:
+        raise HTTPException(
+            400,
+            f"Active episode belongs to task '{active_task_id}'. Reset the requested task first.",
+        )
+    return _format_grader_response(evaluate_task(task_id, env), request)
+@app.post("/baseline")
+def baseline_endpoint(request: Request, body: dict = Body(default_factory=dict)):
+    """Run inference.py (OpenAI tool-calling agent) against all tasks; same entrypoint as local baseline."""
+    _require_admin(request)
+    if not (
+        os.environ.get("API_KEY", "").strip()
+        or os.environ.get("HF_TOKEN", "").strip()
+    ):
+        raise HTTPException(
+            503,
+            "API_KEY or HF_TOKEN must be set on the server process to run POST /baseline.",
+        )
+    script_path = PROJECT_ROOT / "inference.py"
+    if not script_path.is_file():
+        raise HTTPException(500, "inference.py missing from project root.")
+    port = int(os.getenv("PORT", "7860"))
+    timeout_s = HTTP_SESSION_TIMEOUT_S
+    env = {
+        **os.environ,
+        "ENV_BASE_URL": os.getenv("ENV_BASE_URL", f"http://127.0.0.1:{port}"),
+    }
+    command = [sys.executable, str(script_path), "--json-scores"]
+    if body.get("seed") is not None:
+        command.extend(["--seed", str(int(body["seed"]))])
+    if body.get("max_turns") is not None:
+        command.extend(["--max-turns", str(int(body["max_turns"]))])
+    for task_id in body.get("task_ids", []) or []:
+        if task_id in TASK_IDS:
+            command.extend(["--task", str(task_id)])
+    try:
+        proc = subprocess.run(
+            command,
+            cwd=str(PROJECT_ROOT),
+            capture_output=True,
+            text=True,
+            timeout=timeout_s,
+            env=env,
+        )
+    except subprocess.TimeoutExpired:
+        raise HTTPException(
+            504, f"Baseline exceeded HTTP_SESSION_TIMEOUT_S ({timeout_s}s)."
+        ) from None
+    if proc.returncode != 0:
+        tail = (proc.stderr or proc.stdout or "")[-6000:]
+        logger.error("inference.py failed rc=%s stderr=%s", proc.returncode, tail[:500])
+        raise HTTPException(
+            502,
+            {"message": "inference.py exited with an error.", "detail": tail},
+        )
+    lines = [ln.strip() for ln in (proc.stdout or "").splitlines() if ln.strip()]
+    parsed = None
+    for line in reversed(lines):
+        try:
+            parsed = json.loads(line)
+            break
+        except json.JSONDecodeError:
+            continue
+    if not isinstance(parsed, dict) or "scores" not in parsed:
+        raise HTTPException(
+            502,
+            {
+                "message": "Could not parse JSON scores from inference.py stdout.",
+                "stdout_tail": "\n".join(lines[-5:]),
+            },
+        )
+    return {
+        "message": "Model baseline completed via inference.py.",
+        "stdout": proc.stdout,
+        "stderr": proc.stderr,
+        "scores": parsed["scores"],
+        "grades": parsed.get("grades"),
+        "average": parsed.get("average"),
+        "model": parsed.get("model"),
+        "metadata": parsed.get("metadata"),
+    }
+def main():
+    """Entry point for `dataops-env` script and `openenv serve`."""
+    host = os.getenv("HOST", "0.0.0.0")
+    port = int(os.getenv("PORT", "7860"))
+    reload = os.getenv("DEBUG", "").lower() in ("1", "true")
+    cwd = Path.cwd().resolve()
+    app_target = "app:app" if cwd == SERVER_DIR else "server.app:app"
+    app_dir = str(SERVER_DIR if app_target == "app:app" else PROJECT_ROOT)
+    uvicorn.run(
+        app_target,
+        host=host,
+        port=port,
+        reload=reload,
+        reload_dirs=[str(PROJECT_ROOT)] if reload else None,
+        ws="wsproto",
+        app_dir=app_dir,
+    )
+if __name__ == "__main__":
+    main()

server/dataops_env_environment.py ADDED Viewed

	@@ -0,0 +1,839 @@

+from __future__ import annotations
+import json
+import logging
+import os
+import re
+import shutil
+import sqlite3
+import textwrap
+import threading
+import time
+import uuid
+from copy import deepcopy
+from typing import Any, Optional
+from openenv.core.env_server import Environment
+from pydantic import ValidationError
+from data.init_db import WORKSPACE_ROOT, setup_workspace
+from models import (
+    PAYLOAD_MODELS,
+    DataOpsAction,
+    DataOpsObservation,
+    DataOpsState,
+    ExecuteSQLPayload,
+    ReadFilePayload,
+    RunScriptPayload,
+    SendEmailPayload,
+    WriteFilePayload,
+)
+from .safe_exec import PythonRunResult, run_python_code, run_python_script
+from .task_specs import (
+    TASK_ALLOWED_READ_FILES,
+    TASK_ALLOWED_RUN_FILES,
+    TASK_ALLOWED_WRITE_FILES,
+    TASK_EMAIL_ENABLED,
+    TASK_IDS,
+    TASK_SQL_POLICIES,
+    TaskScenarioBundle,
+    build_task_scenario,
+    normalize_task_3_rows,
+    report_matches_expected,
+    task_3_data_matches_expected,
+)
+logger = logging.getLogger(__name__)
+_SQL_COMMENT_RE = re.compile(r"(--[^\n]*|/\*.*?\*/)", re.DOTALL)
+_SQL_STRING_RE = re.compile(r"'(?:''|[^'])*'|\"(?:\"\"|[^\"])*\"")
+_SQL_TABLE_REF_RE = re.compile(
+    r"\b(?:from|join|update|into|delete\s+from)\s+([a-zA-Z_][a-zA-Z0-9_]*)",
+    re.IGNORECASE,
+)
+_SQL_CTE_NAME_RE = re.compile(
+    r"(?:\bwith\b|,)\s*([a-zA-Z_][a-zA-Z0-9_]*)\s+as\s*\(",
+    re.IGNORECASE,
+)
+MAX_STEPS = 15
+MAX_SQL_ROWS = 500
+MAX_FILE_SIZE = 1_000_000
+DEFAULT_ACTION_TIMEOUT_S = 10.0
+MAX_ACTION_TIMEOUT_S = 30.0
+MAX_STDOUT_CHARS = 50_000
+MAX_STDERR_CHARS = 10_000
+PENALTY_FAILURE = -0.03
+PENALTY_DESTRUCTIVE = -0.20
+PENALTY_REPEAT = -0.08
+PENALTY_DISALLOWED_TOOL_UNIT = -0.04
+# Keep milestone bonuses small so the terminal grader remains the dominant signal.
+REWARD_EVENT_VALUES = {
+    "t1_inspected_corruption": 0.05,
+    "t1_exact_cleanup": 0.04,
+    "t2_read_source": 0.04,
+    "t2_candidate_compiles": 0.02,
+    "t2_verified_fix": 0.03,
+    "t3_nonempty_select": 0.03,
+    "t3_matching_sql": 0.03,
+    "t3_read_formatter_source": 0.02,
+    "t3_report_data_verified": 0.03,
+    "t3_formatter_compiles": 0.02,
+    "t3_report_generated": 0.03,
+    "t3_email_verified": 0.02,
+}
+PENALTY_EVENTS = {
+    "destructive_sql": PENALTY_DESTRUCTIVE,
+    "multiple_emails": -0.08,
+    "t2_run_before_read": -0.05,
+    "t2_write_before_read": -0.05,
+}
+class DataOpsEnvironment(Environment[DataOpsAction, DataOpsObservation, DataOpsState]):
+    """Enterprise data pipeline remediation environment (OpenEnv-compliant)."""
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self) -> None:
+        self._workspace_dir = os.path.join(WORKSPACE_ROOT, "sessions", uuid.uuid4().hex)
+        self._db_path = os.path.join(self._workspace_dir, "mock_warehouse.db")
+        self._state = DataOpsState()
+        self._scenario: TaskScenarioBundle = build_task_scenario(
+            "task_1_easy_anomaly", seed=0
+        )
+        self._evidence: dict[str, Any] = {}
+        self._pending_events: list[str] = []
+        self.email_outbox: list[dict[str, str]] = []
+        self._last_action_key: Optional[str] = None
+        self._milestones: set[str] = set()
+        self._grader_score = 0.0
+        self._disallowed_tool_attempts = 0
+        self._lock = threading.Lock()
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> DataOpsObservation:
+        task_id: str = kwargs.get("task_id", "task_1_easy_anomaly")
+        if task_id not in TASK_IDS:
+            raise ValueError(f"Unknown task_id: {task_id}")
+        with self._lock:
+            self._scenario = build_task_scenario(task_id, seed=seed)
+            self._db_path = setup_workspace(
+                self._workspace_dir,
+                scenario=self._scenario,
+            )
+            self.email_outbox.clear()
+            self._last_action_key = None
+            self._milestones.clear()
+            self._pending_events = []
+            self._disallowed_tool_attempts = 0
+            self._evidence = self._initial_evidence()
+            self._state = DataOpsState(
+                episode_id=episode_id or str(uuid.uuid4()),
+                step_count=0,
+                task_id=task_id,
+                task_description=self._scenario.description,
+                max_steps=MAX_STEPS,
+                seed=self._scenario.seed,
+            )
+            self._grader_score = self._current_task_score()
+        return DataOpsObservation(
+            status="success",
+            done=False,
+            reward=0.0,
+            message=f"Environment reset. Task: {self._scenario.description}",
+            step_count=0,
+            max_steps=MAX_STEPS,
+        )
+    def step(
+        self, action: DataOpsAction, timeout_s: Optional[float] = None, **kwargs: Any
+    ) -> DataOpsObservation:
+        del kwargs
+        with self._lock:
+            return self._step_locked(action, timeout_s)
+    @property
+    def state(self) -> DataOpsState:
+        return self._state.model_copy()
+    @property
+    def scenario(self) -> TaskScenarioBundle:
+        return self._scenario
+    @property
+    def evidence(self) -> dict[str, Any]:
+        return deepcopy(self._evidence)
+    @property
+    def workspace_dir(self) -> str:
+        return self._workspace_dir
+    @property
+    def db_path(self) -> str:
+        return self._db_path
+    def close(self) -> None:
+        if os.path.isdir(self._workspace_dir):
+            shutil.rmtree(self._workspace_dir, ignore_errors=True)
+    def _step_locked(
+        self, action: DataOpsAction, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        if self._state.done:
+            return self._obs(
+                "error", "Episode is over. Call /reset to start a new one.", done=True
+            )
+        model_cls = PAYLOAD_MODELS.get(action.action_type)
+        if not model_cls:
+            return self._obs("error", f"Unknown action_type: {action.action_type}")
+        try:
+            payload = model_cls(**action.payload)
+        except ValidationError as exc:
+            return self._obs(
+                "error",
+                f"Invalid payload: {exc.error_count()} validation error(s).",
+            )
+        self._pending_events = []
+        obs = self._dispatch(action.action_type, payload, timeout_s)
+        reward = self._compute_reward(action, obs)
+        self._state.step_count += 1
+        self._state.cumulative_reward += reward
+        self._state.actions_taken.append(action.action_type)
+        self._state.emails_sent = len(self.email_outbox)
+        done = self._state.step_count >= MAX_STEPS or self._task_completed()
+        self._state.done = done
+        obs.reward = round(reward, 4)
+        obs.done = done
+        obs.step_count = self._state.step_count
+        obs.max_steps = MAX_STEPS
+        return obs
+    def _dispatch(
+        self, action_type: str, payload: Any, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        handlers = {
+            "ExecuteSQL": self._handle_sql,
+            "ReadFile": self._handle_read,
+            "WriteFile": self._handle_write,
+            "RunScript": self._handle_run,
+            "SendEmail": self._handle_email,
+        }
+        return handlers[action_type](payload, timeout_s)
+    def _handle_sql(
+        self, payload: ExecuteSQLPayload, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        query = payload.query.strip()
+        while True:
+            q = query.rstrip()
+            if not q.endswith(";"):
+                break
+            query = q[:-1].rstrip()
+        statement_type = self._statement_type(query)
+        validation_error = self._validate_sql_action(query, statement_type)
+        if validation_error:
+            return self._obs("error", validation_error)
+        timeout = self._resolve_timeout(timeout_s)
+        deadline = time.monotonic() + timeout
+        try:
+            with sqlite3.connect(self._db_path) as conn:
+                conn.set_progress_handler(
+                    lambda: 1 if time.monotonic() >= deadline else 0,
+                    1_000,
+                )
+                conn.row_factory = sqlite3.Row
+                cursor = conn.cursor()
+                cursor.execute(query)
+                if statement_type in {"SELECT", "WITH"}:
+                    cols = [c[0] for c in cursor.description or []]
+                    rows_raw = cursor.fetchmany(MAX_SQL_ROWS + 1)
+                    if len(rows_raw) > MAX_SQL_ROWS:
+                        return self._obs(
+                            "error",
+                            f"Result exceeds {MAX_SQL_ROWS} rows. Add a LIMIT clause.",
+                        )
+                    rows = [dict(zip(cols, row)) for row in rows_raw]
+                    self._record_sql_select(query, rows)
+                    return DataOpsObservation(
+                        status="success",
+                        sql_results=rows,
+                        message=f"Query returned {len(rows)} rows.",
+                    )
+                conn.commit()
+                self._record_sql_mutation(query, cursor.rowcount)
+                return self._obs("success", f"Rows affected: {cursor.rowcount}")
+        except sqlite3.Error as exc:
+            if "interrupted" in str(exc).lower():
+                return self._obs(
+                    "error", f"SQL execution timed out ({timeout:.1f}s limit)."
+                )
+            logger.warning("SQL error: %s", exc)
+            msg = "SQL execution error. Check your query syntax."
+            if self._state.task_id == "task_3_hard_e2e" and re.search(
+                r"\bdate\b", query, re.IGNORECASE
+            ):
+                if "report_date" not in query.lower():
+                    msg += " Hint: table `daily_reports` uses column `report_date` for the calendar date."
+            return self._obs("error", msg)
+    def _handle_read(
+        self, payload: ReadFilePayload, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        del timeout_s
+        basename = os.path.basename(payload.filepath)
+        if not self._is_allowed_file(TASK_ALLOWED_READ_FILES, basename):
+            return self._obs(
+                "error", f"Reading {basename} is not allowed for this task."
+            )
+        safe_path = self._resolve_workspace_path(basename)
+        if safe_path is None:
+            return self._obs("error", "Resolved file path escapes the workspace.")
+        if not os.path.isfile(safe_path):
+            return self._obs("error", f"File not found: {basename}")
+        if os.path.getsize(safe_path) > MAX_FILE_SIZE:
+            return self._obs("error", "File too large to read.")
+        try:
+            with open(safe_path, encoding="utf-8") as f:
+                content = f.read(MAX_FILE_SIZE)
+        except OSError:
+            return self._obs("error", "Failed to read file.")
+        if (
+            self._state.task_id == "task_2_medium_syntax"
+            and basename == "broken_pipeline.py"
+        ):
+            self._evidence["task_2"]["read_source"] = True
+            self._record_event("t2_read_source")
+        if self._state.task_id == "task_3_hard_e2e" and basename == "format_report.py":
+            self._evidence["task_3"]["read_formatter_source"] = True
+            self._record_event("t3_read_formatter_source")
+        return DataOpsObservation(
+            status="success",
+            stdout=content,
+            message=f"Read {len(content)} chars from {basename}",
+        )
+    def _handle_write(
+        self, payload: WriteFilePayload, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        del timeout_s
+        basename = os.path.basename(payload.filepath)
+        if not self._is_allowed_file(TASK_ALLOWED_WRITE_FILES, basename):
+            return self._obs(
+                "error", f"Writing {basename} is not allowed for this task."
+            )
+        if (
+            self._state.task_id == "task_2_medium_syntax"
+            and basename == "broken_pipeline.py"
+        ):
+            if not self._evidence["task_2"]["read_source"]:
+                self._pending_events.append("t2_write_before_read")
+        safe_path = self._resolve_workspace_path(basename)
+        if safe_path is None:
+            return self._obs("error", "Resolved file path escapes the workspace.")
+        try:
+            with open(safe_path, "w", encoding="utf-8") as f:
+                f.write(payload.content)
+        except OSError:
+            return self._obs("error", "Failed to write file.")
+        self._record_write_evidence(basename, payload.content)
+        return self._obs("success", f"Wrote {len(payload.content)} chars to {basename}")
+    def _handle_run(
+        self, payload: RunScriptPayload, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        basename = os.path.basename(payload.filepath)
+        if not self._is_allowed_file(TASK_ALLOWED_RUN_FILES, basename):
+            return self._obs(
+                "error", f"Executing {basename} is not allowed for this task."
+            )
+        script_path = self._resolve_workspace_path(basename)
+        if script_path is None:
+            return self._obs("error", "Resolved script path escapes the workspace.")
+        if not os.path.isfile(script_path):
+            return self._obs("error", f"Script not found: {basename}")
+        if (
+            self._state.task_id == "task_2_medium_syntax"
+            and basename == "broken_pipeline.py"
+        ):
+            if not self._evidence["task_2"]["read_source"]:
+                self._pending_events.append("t2_run_before_read")
+        timeout = self._resolve_timeout(timeout_s)
+        try:
+            result = run_python_script(
+                basename,
+                cwd=self._workspace_dir,
+                args=list(payload.args),
+                timeout_s=timeout,
+                stdout_limit=MAX_STDOUT_CHARS,
+                stderr_limit=MAX_STDERR_CHARS,
+            )
+        except OSError:
+            return self._obs("error", "Failed to execute script.")
+        if result.timed_out:
+            return self._obs("error", f"Script timed out ({timeout:.1f}s limit).")
+        self._record_run_evidence(basename, payload.args, result)
+        status = "success" if result.returncode == 0 else "error"
+        return DataOpsObservation(
+            status=status,
+            stdout=(result.stdout or "")[:MAX_STDOUT_CHARS],
+            stderr=(result.stderr or "")[:MAX_STDERR_CHARS],
+            message=f"Exit code: {result.returncode}",
+        )
+    def _handle_email(
+        self, payload: SendEmailPayload, timeout_s: Optional[float]
+    ) -> DataOpsObservation:
+        del timeout_s
+        if self._state.task_id not in TASK_EMAIL_ENABLED:
+            self._disallowed_tool_attempts += 1
+            self._pending_events.append("disallowed_tool")
+            return self._obs(
+                "error",
+                "Email is not available for this task. Use read_file, write_file, and invoke_python only.",
+            )
+        email = {
+            "to_email": payload.to_email,
+            "subject": payload.subject,
+            "body": payload.body,
+        }
+        self.email_outbox.append(email)
+        self._record_email_evidence(email)
+        return DataOpsObservation(
+            status="success",
+            email_delivery_status=f"Queued for {payload.to_email}",
+            message=f"Email queued for delivery to {payload.to_email}",
+        )
+    def _compute_reward(self, action: DataOpsAction, obs: DataOpsObservation) -> float:
+        current_score = self._current_task_score()
+        reward = current_score - self._grader_score
+        self._grader_score = current_score
+        if obs.status != "success":
+            reward += PENALTY_FAILURE
+        action_key = (
+            f"{action.action_type}:"
+            f"{json.dumps(action.payload, sort_keys=True, ensure_ascii=True)}"
+        )
+        if action_key == self._last_action_key:
+            reward += PENALTY_REPEAT
+        self._last_action_key = action_key
+        for event in self._pending_events:
+            if event == "disallowed_tool":
+                reward += PENALTY_DISALLOWED_TOOL_UNIT * min(
+                    self._disallowed_tool_attempts, 12
+                )
+                continue
+            if event in PENALTY_EVENTS:
+                reward += PENALTY_EVENTS[event]
+                continue
+            reward += self._award_event(event)
+        return reward
+    def _award_event(self, event: str) -> float:
+        if event in self._milestones:
+            return 0.0
+        self._milestones.add(event)
+        return REWARD_EVENT_VALUES.get(event, 0.0)
+    def _initial_evidence(self) -> dict[str, Any]:
+        return {
+            "task_1": {
+                "inspected_corrupted_rows": False,
+                "exact_cleanup": False,
+                "destructive_sql_attempted": False,
+            },
+            "task_2": {
+                "read_source": False,
+                "candidate_compiles": False,
+                "verified_fix": False,
+            },
+            "task_3": {
+                "matching_sql_executed": False,
+                "last_matching_sql_rows": [],
+                "read_formatter_source": False,
+                "report_data_matches_sql": False,
+                "formatter_compiles": False,
+                "format_output_matches_expected": False,
+                "last_formatter_output": "",
+                "email_matches_formatter_output": False,
+                "single_email_sent": True,
+            },
+        }
+    def _record_sql_select(self, query: str, rows: list[dict[str, Any]]) -> None:
+        if self._scenario.task_1 and self._state.task_id == "task_1_easy_anomaly":
+            row_ids = {int(row.get("id")) for row in rows if row.get("id") is not None}
+            corrupted = set(self._scenario.task_1.corrupted_row_ids)
+            if row_ids & corrupted:
+                self._evidence["task_1"]["inspected_corrupted_rows"] = True
+                self._record_event("t1_inspected_corruption")
+        if self._scenario.task_3 and self._state.task_id == "task_3_hard_e2e":
+            normalised_rows = normalize_task_3_rows(rows, require_headcount=True)
+            expected_rows = list(self._scenario.task_3.expected_rows)
+            if task_3_data_matches_expected(
+                normalised_rows,
+                expected_rows,
+                require_headcount=True,
+            ):
+                self._evidence["task_3"]["matching_sql_executed"] = True
+                self._evidence["task_3"]["last_matching_sql_rows"] = normalised_rows
+                self._record_event("t3_matching_sql")
+            elif rows:
+                self._record_event("t3_nonempty_select")
+    def _record_sql_mutation(self, query: str, rowcount: int) -> None:
+        del rowcount
+        if self._scenario.task_1 and self._state.task_id == "task_1_easy_anomaly":
+            exact_rows = self._current_transactions_rows()
+            expected_rows = list(self._scenario.task_1.expected_rows)
+            expected_by_id = {row["id"]: row for row in expected_rows}
+            actual_by_id = {row["id"]: row for row in exact_rows}
+            valid_rows_lost = any(
+                row_id not in actual_by_id for row_id in expected_by_id
+            )
+            valid_rows_changed = any(
+                actual_by_id[row_id] != expected_row
+                for row_id, expected_row in expected_by_id.items()
+                if row_id in actual_by_id
+            )
+            if exact_rows == expected_rows:
+                self._evidence["task_1"]["exact_cleanup"] = True
+                self._record_event("t1_exact_cleanup")
+            elif valid_rows_lost or valid_rows_changed:
+                self._evidence["task_1"]["destructive_sql_attempted"] = True
+                self._pending_events.append("destructive_sql")
+    def _record_write_evidence(self, basename: str, content: str) -> None:
+        if (
+            self._state.task_id == "task_2_medium_syntax"
+            and basename == "broken_pipeline.py"
+        ):
+            compiles = self._script_compiles(content, basename)
+            self._evidence["task_2"]["candidate_compiles"] = compiles
+            if compiles:
+                self._record_event("t2_candidate_compiles")
+            return
+        if not self._scenario.task_3 or self._state.task_id != "task_3_hard_e2e":
+            return
+        task_3 = self._evidence["task_3"]
+        if basename == "report_data.json":
+            try:
+                payload = json.loads(content)
+            except json.JSONDecodeError:
+                task_3["report_data_matches_sql"] = False
+                return
+            if not isinstance(payload, list):
+                task_3["report_data_matches_sql"] = False
+                return
+            normalised_rows = normalize_task_3_rows(payload, require_headcount=True)
+            expected_rows = list(self._scenario.task_3.expected_rows)
+            last_sql_rows = task_3.get("last_matching_sql_rows", [])
+            matches_sql = bool(last_sql_rows) and normalised_rows == last_sql_rows
+            matches_expected = task_3_data_matches_expected(
+                normalised_rows,
+                expected_rows,
+                require_headcount=True,
+            )
+            task_3["report_data_matches_sql"] = matches_sql and matches_expected
+            if task_3["report_data_matches_sql"]:
+                self._record_event("t3_report_data_verified")
+            return
+        if basename == "format_report.py":
+            compiles = self._script_compiles(content, basename)
+            task_3["formatter_compiles"] = compiles
+            if compiles:
+                self._record_event("t3_formatter_compiles")
+    def _record_run_evidence(
+        self,
+        basename: str,
+        args: list[str],
+        result: PythonRunResult,
+    ) -> None:
+        if (
+            self._state.task_id == "task_2_medium_syntax"
+            and basename == "broken_pipeline.py"
+        ):
+            if result.returncode == 0 and self._task_2_candidate_is_functional():
+                self._evidence["task_2"]["verified_fix"] = True
+                self._record_event("t2_verified_fix")
+            return
+        if not self._scenario.task_3 or self._state.task_id != "task_3_hard_e2e":
+            return
+        if basename != "format_report.py":
+            return
+        task_3 = self._evidence["task_3"]
+        stdout = (result.stdout or "").strip()
+        if (
+            result.returncode == 0
+            and self._task_3_args_reference_report_data(args)
+            and task_3.get("report_data_matches_sql")
+            and report_matches_expected(
+                stdout,
+                self._scenario.task_3.expected_rows,
+                self._scenario.task_3.target_date,
+            )
+        ):
+            task_3["format_output_matches_expected"] = True
+            task_3["last_formatter_output"] = stdout
+            self._record_event("t3_report_generated")
+    def _record_email_evidence(self, email: dict[str, str]) -> None:
+        if not self._scenario.task_3 or self._state.task_id != "task_3_hard_e2e":
+            return
+        task_3 = self._evidence["task_3"]
+        if len(self.email_outbox) > 1:
+            task_3["single_email_sent"] = False
+            self._pending_events.append("multiple_emails")
+        if (
+            task_3.get("format_output_matches_expected")
+            and task_3.get("single_email_sent")
+            and email.get("to_email") == self._scenario.task_3.recipient
+            and email.get("subject") == self._scenario.task_3.subject
+            and email.get("body", "").strip()
+            == str(task_3.get("last_formatter_output", "")).strip()
+        ):
+            task_3["email_matches_formatter_output"] = True
+            self._record_event("t3_email_verified")
+    def _task_2_candidate_is_functional(self) -> bool:
+        if not self._scenario.task_2:
+            return False
+        wrapper = textwrap.dedent(
+            f"""
+            import importlib.util
+            import json
+            spec = importlib.util.spec_from_file_location("candidate_pipeline", "broken_pipeline.py")
+            module = importlib.util.module_from_spec(spec)
+            assert spec.loader is not None
+            spec.loader.exec_module(module)
+            cases = {json.dumps(self._scenario.task_2.hidden_cases)}
+            results = [module.process_data_stream(case) for case in cases]
+            print("__RESULT__=" + json.dumps(results))
+        """
+        )
+        try:
+            result = run_python_code(
+                wrapper,
+                cwd=self._workspace_dir,
+                timeout_s=DEFAULT_ACTION_TIMEOUT_S,
+                stdout_limit=MAX_STDOUT_CHARS,
+                stderr_limit=MAX_STDERR_CHARS,
+            )
+        except Exception:
+            return False
+        payload = next(
+            (
+                line[len("__RESULT__=") :]
+                for line in result.stdout.splitlines()
+                if line.startswith("__RESULT__=")
+            ),
+            "",
+        )
+        try:
+            parsed = json.loads(payload) if payload else None
+        except json.JSONDecodeError:
+            parsed = None
+        expected = [list(batch) for batch in self._scenario.task_2.hidden_expected]
+        return result.returncode == 0 and parsed == expected
+    def _task_3_args_reference_report_data(self, args: list[str]) -> bool:
+        if len(args) != 1:
+            return False
+        expected_path = self._resolve_workspace_path("report_data.json")
+        if expected_path is None:
+            return False
+        candidate = args[0]
+        if os.path.isabs(candidate):
+            resolved = os.path.realpath(candidate)
+        else:
+            resolved = os.path.realpath(os.path.join(self._workspace_dir, candidate))
+        return resolved == expected_path
+    def _current_task_score(self) -> float:
+        if not self._state.task_id:
+            return 0.0
+        try:
+            from .grading import evaluate_task
+            return float(evaluate_task(self._state.task_id, self).get("score", 0.0))
+        except Exception:
+            logger.exception(
+                "Failed to compute current grader score for reward shaping."
+            )
+            return self._grader_score
+    def _script_compiles(self, content: str, filename: str) -> bool:
+        try:
+            compile(content, filename, "exec")
+        except SyntaxError:
+            return False
+        return True
+    def _task_completed(self) -> bool:
+        if self._state.task_id == "task_1_easy_anomaly" and self._scenario.task_1:
+            return self._current_transactions_rows() == list(
+                self._scenario.task_1.expected_rows
+            )
+        if self._state.task_id == "task_2_medium_syntax":
+            # Terminal grader can be <1.0 even when verified_fix (visible/hidden/provenance split).
+            return self._grader_score >= 1.0
+        if self._state.task_id == "task_3_hard_e2e":
+            # Evidence flags can be partially true while component-weighted grader is still <1.0.
+            return self._grader_score >= 1.0
+        return False
+    def _current_transactions_rows(self) -> list[dict[str, Any]]:
+        with sqlite3.connect(self._db_path) as conn:
+            conn.row_factory = sqlite3.Row
+            rows = conn.execute(
+                "SELECT id, user_id, amount, status FROM transactions ORDER BY id"
+            ).fetchall()
+        return [
+            {
+                "id": int(row["id"]),
+                "user_id": int(row["user_id"]),
+                "amount": None
+                if row["amount"] is None
+                else round(float(row["amount"]), 2),
+                "status": str(row["status"]),
+            }
+            for row in rows
+        ]
+    def _record_event(self, event: str) -> None:
+        self._pending_events.append(event)
+    def _resolve_timeout(self, timeout_s: Optional[float]) -> float:
+        if timeout_s is None:
+            return DEFAULT_ACTION_TIMEOUT_S
+        return max(0.1, min(float(timeout_s), MAX_ACTION_TIMEOUT_S))
+    def _is_allowed_file(
+        self, allowed_registry: dict[str, frozenset[str]], basename: str
+    ) -> bool:
+        return basename in allowed_registry.get(self._state.task_id, frozenset())
+    def _resolve_workspace_path(self, basename: str) -> Optional[str]:
+        workspace_root = os.path.realpath(self._workspace_dir)
+        candidate = os.path.realpath(os.path.join(self._workspace_dir, basename))
+        if candidate == workspace_root:
+            return None
+        if not candidate.startswith(f"{workspace_root}{os.sep}"):
+            return None
+        return candidate
+    def _statement_type(self, query: str) -> str:
+        parts = query.split(None, 1)
+        return parts[0].upper() if parts else ""
+    def _validate_sql_action(self, query: str, statement_type: str) -> Optional[str]:
+        if not query:
+            return "SQL query cannot be empty."
+        policy = TASK_SQL_POLICIES.get(self._state.task_id)
+        if policy is None:
+            return "SQL is not available for the active task."
+        if statement_type not in policy.allowed_commands:
+            allowed = ", ".join(sorted(policy.allowed_commands))
+            return f"Only {allowed} statements are allowed for this task."
+        sanitized = self._strip_sql_literals_and_comments(query)
+        normalized = " ".join(sanitized.split())
+        lowered = normalized.lower()
+        if ";" in normalized:
+            return "Only a single SQL statement is allowed."
+        if any(
+            token in lowered
+            for token in ("pragma", "attach", "detach", "sqlite_", "alter ", "drop ")
+        ):
+            return "Query contains disallowed SQL constructs."
+        if statement_type == "DELETE" and not re.match(
+            rf"^delete\s+from\s+{re.escape(policy.required_table)}\s+where\b",
+            lowered,
+        ):
+            return f"DELETE statements must target '{policy.required_table}' with an explicit WHERE clause."
+        cte_names = self._extract_cte_names(normalized)
+        table_refs = self._extract_sql_table_refs(normalized)
+        if policy.required_table not in table_refs:
+            return f"Query must target the '{policy.required_table}' table."
+        allowed_refs = {policy.required_table, *cte_names}
+        disallowed = sorted(ref for ref in table_refs if ref not in allowed_refs)
+        if disallowed:
+            return f"Query references disallowed table(s): {', '.join(disallowed)}."
+        return None
+    def _strip_sql_literals_and_comments(self, query: str) -> str:
+        without_comments = _SQL_COMMENT_RE.sub(" ", query)
+        return _SQL_STRING_RE.sub("''", without_comments)
+    def _extract_cte_names(self, query: str) -> set[str]:
+        lowered = query.lower().lstrip()
+        if not lowered.startswith("with "):
+            return set()
+        return {match.group(1).lower() for match in _SQL_CTE_NAME_RE.finditer(query)}
+    def _extract_sql_table_refs(self, query: str) -> set[str]:
+        return {match.group(1).lower() for match in _SQL_TABLE_REF_RE.finditer(query)}
+    def _obs(
+        self, status: str, message: str, *, done: bool = False
+    ) -> DataOpsObservation:
+        return DataOpsObservation(
+            status=status,
+            message=message,
+            step_count=self._state.step_count,
+            max_steps=MAX_STEPS,
+            done=done,
+        )

server/grading.py ADDED Viewed

	@@ -0,0 +1,557 @@

+"""Terminal graders for the seeded DataOpsEnv benchmark."""
+from __future__ import annotations
+import ast
+import json
+import logging
+import os
+import sqlite3
+from typing import Any
+from server.dataops_env_environment import DataOpsEnvironment
+from server.safe_exec import run_python_code, run_python_script
+from server.task_specs import (
+    build_task_3_report,
+    normalize_task_2_output_rows,
+    normalize_task_3_rows,
+    report_matches_expected,
+    task_3_data_matches_expected,
+    task_3_semantic_match_fraction_rows,
+    task_3_semantic_match_fraction_text,
+)
+logger = logging.getLogger(__name__)
+SCRIPT_TIMEOUT_S = 10
+INTERNAL_STDOUT_LIMIT = 50_000
+INTERNAL_STDERR_LIMIT = 10_000
+def evaluate_task(task_id: str, env: DataOpsEnvironment) -> dict[str, Any]:
+    graders = {
+        "task_1_easy_anomaly": _grade_task_1,
+        "task_2_medium_syntax": _grade_task_2,
+        "task_3_hard_e2e": _grade_task_3,
+    }
+    grader = graders.get(task_id)
+    if grader is None:
+        return {"task_id": task_id, "score": 0.0, "details": {"error": "Unknown task"}}
+    score, details = grader(env)
+    return {"task_id": task_id, "score": round(score, 2), "details": details}
+def _grade_task_1(env: DataOpsEnvironment) -> tuple[float, dict[str, Any]]:
+    if env.scenario.task_1 is None:
+        return 0.0, {"error": "Task 1 scenario missing."}
+    try:
+        actual_rows = _current_transactions_rows(env.db_path)
+    except Exception:
+        logger.exception("Task 1 grading error")
+        return 0.0, {"error": "Internal grading error."}
+    expected_rows = list(env.scenario.task_1.expected_rows)
+    corrupted_ids = set(env.scenario.task_1.corrupted_row_ids)
+    actual_ids = {row["id"] for row in actual_rows}
+    expected_ids = {row["id"] for row in expected_rows}
+    corrupted_remaining = sorted(actual_ids & corrupted_ids)
+    rewritten_corrupted = [
+        row for row in actual_rows if row["id"] in corrupted_ids and row["amount"] is not None
+    ]
+    valid_rows_intact = all(
+        any(actual == expected for actual in actual_rows) for expected in expected_rows
+    )
+    details: dict[str, Any] = {
+        "expected_row_ids": sorted(expected_ids),
+        "actual_row_ids": sorted(actual_ids),
+        "corrupted_row_ids": sorted(corrupted_ids),
+        "corrupted_remaining": corrupted_remaining,
+        "valid_rows_intact": valid_rows_intact,
+    }
+    if actual_rows == expected_rows:
+        details["reason"] = "Perfect - corrupted rows were deleted and all valid rows were preserved."
+        details["components"] = {
+            "exact_cleanup": {"score": 1.0, "max": 1.0, "passed": True},
+        }
+        return 1.0, details
+    if rewritten_corrupted:
+        details["reason"] = "Corrupted rows were rewritten instead of being deleted."
+        details["components"] = {
+            "exact_cleanup": {"score": 0.0, "max": 1.0, "passed": False},
+        }
+        return 0.0, details
+    if valid_rows_intact and corrupted_remaining:
+        fraction_removed = 1.0 - (len(corrupted_remaining) / max(len(corrupted_ids), 1))
+        score = round(0.25 * max(fraction_removed, 0.0), 4)
+        details["reason"] = "Some corrupted rows were removed, but cleanup is incomplete."
+        details["components"] = {
+            "partial_cleanup": {"score": score, "max": 0.25, "passed": False},
+        }
+        return score, details
+    details["reason"] = "The transaction table does not match the required cleaned state."
+    details["components"] = {
+        "exact_cleanup": {"score": 0.0, "max": 1.0, "passed": False},
+    }
+    return 0.0, details
+def _grade_task_2(env: DataOpsEnvironment) -> tuple[float, dict[str, Any]]:
+    if env.scenario.task_2 is None:
+        return 0.0, {"error": "Task 2 scenario missing."}
+    script = os.path.join(env.workspace_dir, "broken_pipeline.py")
+    if not os.path.isfile(script):
+        return 0.0, {
+            "reason": "broken_pipeline.py not found.",
+            "components": {
+                "script_present": {"score": 0.0, "max": 1.0, "passed": False},
+            },
+        }
+    try:
+        with open(script, encoding="utf-8") as f:
+            source = f.read()
+        static = _inspect_task_2_source(source)
+        main_result = run_python_script(
+            "broken_pipeline.py",
+            cwd=env.workspace_dir,
+            args=[],
+            timeout_s=SCRIPT_TIMEOUT_S,
+            stdout_limit=INTERNAL_STDOUT_LIMIT,
+            stderr_limit=INTERNAL_STDERR_LIMIT,
+        )
+        visible_result = _run_task_2_case_check(
+            env.workspace_dir,
+            env.scenario.task_2.visible_batch,
+            env.scenario.task_2.visible_expected,
+        )
+        hidden_result = _run_task_2_hidden_tests(
+            env.workspace_dir,
+            env.scenario.task_2.hidden_cases,
+            env.scenario.task_2.hidden_expected,
+        )
+    except Exception:
+        logger.exception("Task 2 grading error")
+        return 0.0, {"error": "Internal grading error."}
+    if main_result.timed_out or visible_result["timed_out"] or hidden_result["timed_out"]:
+        return 0.0, {"reason": "Script timed out.", "components": {}}
+    hidden_score = round(0.60 * hidden_result["pass_fraction"], 4)
+    visible_score = 0.25 if visible_result["passed"] and main_result.returncode == 0 else 0.0
+    execution_score = 0.15 if env.evidence.get("task_2", {}).get("verified_fix") else 0.0
+    components: dict[str, Any] = {
+        "hidden_functional": {
+            "score": hidden_score,
+            "max": 0.60,
+            "passed": hidden_result["passed"],
+        },
+        "visible_pipeline": {
+            "score": visible_score,
+            "max": 0.25,
+            "passed": visible_result["passed"] and main_result.returncode == 0,
+        },
+        "execution_provenance": {
+            "score": execution_score,
+            "max": 0.15,
+            "passed": bool(env.evidence.get("task_2", {}).get("verified_fix")),
+        },
+    }
+    score = round(sum(component["score"] for component in components.values()), 4)
+    details = {
+        "main_exit_code": main_result.returncode,
+        "main_stdout": main_result.stdout[:500],
+        "main_stderr": main_result.stderr[:500],
+        "visible_batch_ok": visible_result["passed"],
+        "hidden_tests_passed": hidden_result["passed"],
+        "hidden_pass_fraction": hidden_result["pass_fraction"],
+        "hidden_case_passes": hidden_result["case_passes"],
+        "static_checks": static,
+        "components": components,
+    }
+    if score == 1.0:
+        details["reason"] = "Seeded hidden tests and the visible verification run both pass."
+    elif hidden_result["passed"] and main_result.returncode == 0:
+        details["reason"] = "The ETL transform is correct, but the agent never verified it through the run action."
+    elif hidden_result["pass_fraction"] > 0 and main_result.returncode == 0:
+        details["reason"] = "The repair improves the ETL transform, but it still fails some seeded cases."
+    elif hidden_result["pass_fraction"] > 0:
+        details["reason"] = "The core transform improved, but the runnable script entrypoint still drifts."
+    elif main_result.returncode == 0:
+        details["reason"] = "The script runs, but it does not yet produce the required normalized records."
+    else:
+        details["reason"] = "The repair is still incorrect or incomplete."
+    return score, details
+def _grade_task_3(env: DataOpsEnvironment) -> tuple[float, dict[str, Any]]:
+    if env.scenario.task_3 is None:
+        return 0.0, {"error": "Task 3 scenario missing."}
+    scenario = env.scenario.task_3
+    evidence = env.evidence.get("task_3", {})
+    expected_rows = list(scenario.expected_rows)
+    expected_report = build_task_3_report(expected_rows, scenario.target_date)
+    report_data = _load_task_3_data(env.workspace_dir, expected_rows)
+    formatter = _run_task_3_formatter(env.workspace_dir, expected_rows, scenario.target_date)
+    email = _score_task_3_email(env, expected_report)
+    report_exact_and_proven = bool(
+        report_data["matches_expected"] and evidence.get("report_data_matches_sql")
+    )
+    formatter_exact_and_proven = bool(
+        formatter["matches_expected"] and evidence.get("format_output_matches_expected")
+    )
+    components: dict[str, Any] = {
+        "sql_provenance": {
+            "score": 0.20 if evidence.get("matching_sql_executed") else 0.0,
+            "max": 0.20,
+            "passed": bool(evidence.get("matching_sql_executed")),
+        },
+        "report_data": {
+            "score": 0.20 if report_exact_and_proven else 0.05 if report_data["matches_expected"] else 0.0,
+            "max": 0.20,
+            "passed": report_exact_and_proven,
+        },
+        "formatter": {
+            "score": 0.25 if formatter_exact_and_proven else 0.05 if formatter["runs"] else 0.0,
+            "max": 0.25,
+            "passed": formatter_exact_and_proven,
+        },
+        "email": {
+            "score": email["score"],
+            "max": 0.35,
+            "passed": email["passed"],
+        },
+    }
+    score = round(sum(component["score"] for component in components.values()), 4)
+    details: dict[str, Any] = {
+        "target_date": scenario.target_date,
+        "expected_recipient": scenario.recipient,
+        "expected_subject": scenario.subject,
+        "report_data": report_data["details"],
+        "formatter": formatter["details"],
+        "email": email["details"],
+        "evidence": evidence,
+        "components": components,
+    }
+    if score == 1.0:
+        details["reason"] = "Perfect - the seeded SQL slice, JSON output, formatter run, and final email all align."
+    elif score >= 0.55:
+        details["reason"] = "Strong progress - some of the seeded workflow is correct, but provenance is incomplete."
+    elif score > 0:
+        details["reason"] = "Partial progress - artifacts exist, but the end-to-end incident workflow is not proven."
+    else:
+        details["reason"] = "The seeded hard task is still unsolved."
+    return score, details
+def _inspect_task_2_source(source: str) -> dict[str, Any]:
+    try:
+        tree = ast.parse(source)
+    except SyntaxError as exc:
+        return {"passed": False, "error": str(exc), "has_function": False}
+    functions = [node for node in tree.body if isinstance(node, ast.FunctionDef)]
+    target = next((node for node in functions if node.name == "process_data_stream"), None)
+    passed = target is not None and len(target.args.args) == 1
+    return {"passed": passed, "has_function": target is not None}
+def _run_task_2_case_check(
+    workspace_dir: str,
+    batch: tuple[dict[str, Any], ...],
+    expected: tuple[dict[str, Any], ...],
+) -> dict[str, Any]:
+    wrapper = f"""
+import importlib.util
+import json
+spec = importlib.util.spec_from_file_location("candidate_pipeline", "broken_pipeline.py")
+module = importlib.util.module_from_spec(spec)
+assert spec.loader is not None
+spec.loader.exec_module(module)
+batch = {json.dumps(list(batch))}
+results = module.process_data_stream(batch)
+print("__RESULT__=" + json.dumps(results))
+"""
+    result = run_python_code(
+        wrapper,
+        cwd=workspace_dir,
+        timeout_s=SCRIPT_TIMEOUT_S,
+        stdout_limit=INTERNAL_STDOUT_LIMIT,
+        stderr_limit=INTERNAL_STDERR_LIMIT,
+    )
+    payload = next(
+        (
+            line[len("__RESULT__=") :]
+            for line in result.stdout.splitlines()
+            if line.startswith("__RESULT__=")
+        ),
+        "",
+    )
+    try:
+        parsed = json.loads(payload) if payload else None
+    except json.JSONDecodeError:
+        parsed = None
+    normalised = normalize_task_2_output_rows(parsed)
+    ok = result.returncode == 0 and normalised == list(expected)
+    return {
+        "passed": ok,
+        "timed_out": result.timed_out,
+        "stdout": result.stdout[:500],
+        "stderr": result.stderr[:500],
+        "actual": normalised,
+    }
+def _run_task_2_hidden_tests(
+    workspace_dir: str,
+    hidden_cases: tuple[tuple[dict[str, Any], ...], ...],
+    hidden_expected: tuple[tuple[dict[str, Any], ...], ...],
+) -> dict[str, Any]:
+    wrapper = f"""
+import importlib.util
+import json
+spec = importlib.util.spec_from_file_location("candidate_pipeline", "broken_pipeline.py")
+module = importlib.util.module_from_spec(spec)
+assert spec.loader is not None
+spec.loader.exec_module(module)
+cases = {json.dumps([list(batch) for batch in hidden_cases])}
+results = [module.process_data_stream(case) for case in cases]
+print("__RESULT__=" + json.dumps(results))
+"""
+    result = run_python_code(
+        wrapper,
+        cwd=workspace_dir,
+        timeout_s=SCRIPT_TIMEOUT_S,
+        stdout_limit=INTERNAL_STDOUT_LIMIT,
+        stderr_limit=INTERNAL_STDERR_LIMIT,
+    )
+    payload = next(
+        (
+            line[len("__RESULT__=") :]
+            for line in result.stdout.splitlines()
+            if line.startswith("__RESULT__=")
+        ),
+        "",
+    )
+    try:
+        parsed = json.loads(payload) if payload else None
+    except json.JSONDecodeError:
+        parsed = None
+    if not isinstance(parsed, list):
+        parsed = []
+    actual_batches = [
+        normalize_task_2_output_rows(batch)
+        for batch in parsed
+    ]
+    expected = [list(batch) for batch in hidden_expected]
+    case_passes = [
+        actual == expected_case
+        for actual, expected_case in zip(actual_batches, expected, strict=False)
+    ]
+    if len(case_passes) < len(expected):
+        case_passes.extend([False] * (len(expected) - len(case_passes)))
+    pass_fraction = (
+        sum(1 for passed in case_passes if passed) / len(expected)
+        if expected
+        else 0.0
+    )
+    return {
+        "passed": result.returncode == 0 and len(actual_batches) == len(expected) and all(case_passes),
+        "timed_out": result.timed_out,
+        "stdout": result.stdout[:500],
+        "stderr": result.stderr[:500],
+        "actual": actual_batches,
+        "case_passes": case_passes,
+        "pass_fraction": round(pass_fraction, 4),
+    }
+def _load_task_3_data(
+    workspace_dir: str, expected_rows: list[dict[str, Any]]
+) -> dict[str, Any]:
+    report_json = os.path.join(workspace_dir, "report_data.json")
+    if not os.path.isfile(report_json):
+        return {
+            "matches_expected": False,
+            "details": {"exists": False, "reason": "report_data.json not found."},
+        }
+    try:
+        with open(report_json, encoding="utf-8") as f:
+            payload = json.load(f)
+    except (OSError, json.JSONDecodeError) as exc:
+        return {
+            "matches_expected": False,
+            "details": {"exists": True, "reason": str(exc)},
+        }
+    if not isinstance(payload, list):
+        return {
+            "matches_expected": False,
+            "details": {
+                "exists": True,
+                "reason": "report_data.json must contain a JSON list.",
+            },
+        }
+    rows = normalize_task_3_rows(payload, require_headcount=True)
+    matches_expected = bool(rows) and task_3_data_matches_expected(
+        rows,
+        expected_rows,
+        require_headcount=True,
+    )
+    semantic_fraction = task_3_semantic_match_fraction_rows(rows, expected_rows)
+    return {
+        "matches_expected": matches_expected,
+        "details": {
+            "exists": True,
+            "rows_valid": bool(rows),
+            "rows_match_expected": matches_expected,
+            "semantic_fraction": round(semantic_fraction, 4),
+        },
+    }
+def _run_task_3_formatter(
+    workspace_dir: str,
+    expected_rows: list[dict[str, Any]],
+    target_date: str,
+) -> dict[str, Any]:
+    script = os.path.join(workspace_dir, "format_report.py")
+    if not os.path.isfile(script):
+        return {
+            "runs": False,
+            "matches_expected": False,
+            "details": {"reason": "format_report.py not found."},
+        }
+    try:
+        result = run_python_script(
+            "format_report.py",
+            cwd=workspace_dir,
+            args=["report_data.json"],
+            timeout_s=SCRIPT_TIMEOUT_S,
+            stdout_limit=INTERNAL_STDOUT_LIMIT,
+            stderr_limit=INTERNAL_STDERR_LIMIT,
+        )
+    except Exception as exc:
+        return {
+            "runs": False,
+            "matches_expected": False,
+            "details": {"reason": str(exc)},
+        }
+    if result.timed_out:
+        return {
+            "runs": False,
+            "matches_expected": False,
+            "details": {"reason": "Formatter timed out."},
+        }
+    stdout = (result.stdout or "").strip()
+    matches_expected = result.returncode == 0 and report_matches_expected(
+        stdout,
+        expected_rows,
+        target_date,
+    )
+    return {
+        "runs": result.returncode == 0,
+        "matches_expected": matches_expected,
+        "details": {
+            "exit_code": result.returncode,
+            "stdout": stdout[:500],
+            "stderr": (result.stderr or "")[:500],
+            "semantic_fraction": round(
+                task_3_semantic_match_fraction_text(stdout, expected_rows, target_date),
+                4,
+            ),
+        },
+    }
+def _score_task_3_email(
+    env: DataOpsEnvironment, expected_report: str
+) -> dict[str, Any]:
+    scenario = env.scenario.task_3
+    assert scenario is not None
+    evidence = env.evidence.get("task_3", {})
+    outbox = env.email_outbox
+    if not outbox:
+        return {
+            "score": 0.0,
+            "passed": False,
+            "details": {"reason": "No email sent."},
+        }
+    email = outbox[-1]
+    recipient_ok = email.get("to_email") == scenario.recipient
+    subject_ok = email.get("subject") == scenario.subject
+    body = str(email.get("body", "")).strip()
+    body_ok = body == expected_report.strip()
+    proven = bool(evidence.get("email_matches_formatter_output")) and len(outbox) == 1
+    score = 0.0
+    if recipient_ok:
+        score += 0.05
+    if subject_ok:
+        score += 0.05
+    if body_ok and proven:
+        score += 0.25
+    return {
+        "score": score,
+        "passed": score == 0.35,
+        "details": {
+            "emails_sent": len(outbox),
+            "recipient_ok": recipient_ok,
+            "subject_ok": subject_ok,
+            "body_ok": body_ok,
+            "proven": proven,
+            "semantic_fraction": round(
+                task_3_semantic_match_fraction_text(
+                    body,
+                    list(scenario.expected_rows),
+                    scenario.target_date,
+                ),
+                4,
+            ),
+        },
+    }
+def _current_transactions_rows(db_path: str) -> list[dict[str, Any]]:
+    with sqlite3.connect(db_path) as conn:
+        conn.row_factory = sqlite3.Row
+        table_exists = conn.execute(
+            "SELECT name FROM sqlite_master WHERE type='table' AND name='transactions'"
+        ).fetchone()
+        if not table_exists:
+            return []
+        rows = conn.execute(
+            "SELECT id, user_id, amount, status FROM transactions ORDER BY id"
+        ).fetchall()
+    return [
+        {
+            "id": int(row["id"]),
+            "user_id": int(row["user_id"]),
+            "amount": None if row["amount"] is None else round(float(row["amount"]), 2),
+            "status": str(row["status"]),
+        }
+        for row in rows
+    ]

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+openenv-core[core]>=0.2.2
+fastapi>=0.115.0
+starlette>=0.46.0,<0.52.0
+uvicorn[standard]>=0.34.0
+pydantic>=2.10.0
+pyyaml>=6.0.2
+openai>=1.60.0
+requests>=2.32.0
+wsproto>=1.3.2
+python-dotenv>=1.0.0

server/safe_exec.py ADDED Viewed

	@@ -0,0 +1,195 @@

+from __future__ import annotations
+import json
+import math
+import os
+import signal
+import subprocess
+import sys
+import tempfile
+from dataclasses import dataclass
+DEFAULT_ADDRESS_SPACE_BYTES = 512 * 1024 * 1024
+DEFAULT_FILE_BYTES = 2 * 1024 * 1024
+DEFAULT_OPEN_FILES = 64
+DEFAULT_PROCESSES = 32
+_RUNNER_BOOTSTRAP = r"""
+import json
+import runpy
+import sys
+try:
+    import resource
+except ImportError:  # pragma: no cover
+    resource = None
+def _set_limit(name, value):
+    if resource is None or not hasattr(resource, name):
+        return
+    limit = int(value)
+    try:
+        _, current_hard = resource.getrlimit(getattr(resource, name))
+        soft = min(limit, current_hard) if current_hard >= 0 else limit
+        resource.setrlimit(getattr(resource, name), (soft, current_hard))
+    except (OSError, ValueError):
+        return
+config = json.loads(sys.argv[1])
+_set_limit("RLIMIT_CORE", 0)
+_set_limit("RLIMIT_CPU", config["cpu_seconds"])
+_set_limit("RLIMIT_FSIZE", config["file_bytes"])
+_set_limit("RLIMIT_NOFILE", config["open_files"])
+_set_limit("RLIMIT_AS", config["address_space_bytes"])
+_set_limit("RLIMIT_NPROC", config["processes"])
+mode = config["mode"]
+if mode == "script":
+    script = sys.argv[2]
+    sys.argv = sys.argv[2:]
+    runpy.run_path(script, run_name="__main__")
+elif mode == "code":
+    sys.argv = ["-c"]
+    exec(config["code"], {"__name__": "__main__"})
+else:  # pragma: no cover
+    raise SystemExit(f"Unsupported execution mode: {mode}")
+"""
+@dataclass(frozen=True)
+class PythonRunResult:
+    returncode: int
+    stdout: str
+    stderr: str
+    timed_out: bool = False
+def _safe_env(workspace_dir: str) -> dict[str, str]:
+    return {
+        "HOME": workspace_dir,
+        "TMPDIR": workspace_dir,
+        "LANG": "C.UTF-8",
+        "LC_ALL": "C.UTF-8",
+        "PATH": "",
+        "PYTHONDONTWRITEBYTECODE": "1",
+        "PYTHONHASHSEED": "0",
+        "PYTHONIOENCODING": "utf-8",
+        "PYTHONNOUSERSITE": "1",
+    }
+def _limit_config(timeout_s: float) -> dict[str, int]:
+    return {
+        "cpu_seconds": max(1, int(math.ceil(timeout_s)) + 1),
+        "file_bytes": DEFAULT_FILE_BYTES,
+        "open_files": DEFAULT_OPEN_FILES,
+        "address_space_bytes": DEFAULT_ADDRESS_SPACE_BYTES,
+        "processes": DEFAULT_PROCESSES,
+    }
+def _read_limited_text(handle, limit: int) -> str:
+    handle.seek(0)
+    data = handle.read(limit + 1)
+    if isinstance(data, bytes):
+        return data.decode("utf-8", errors="replace")[:limit]
+    return str(data)[:limit]
+def _terminate_process(proc: subprocess.Popen[bytes]) -> None:
+    if proc.poll() is not None:
+        return
+    if os.name != "nt":
+        try:
+            os.killpg(proc.pid, signal.SIGKILL)
+            return
+        except ProcessLookupError:
+            return
+    proc.kill()
+def _run_python_command(
+    config: dict[str, object],
+    *,
+    cwd: str,
+    argv: list[str],
+    timeout_s: float,
+    stdout_limit: int,
+    stderr_limit: int,
+) -> PythonRunResult:
+    command = [
+        sys.executable,
+        "-I",
+        "-B",
+        "-c",
+        _RUNNER_BOOTSTRAP,
+        json.dumps(config, ensure_ascii=True),
+        *argv,
+    ]
+    start_new_session = os.name != "nt"
+    with tempfile.TemporaryFile() as stdout_file, tempfile.TemporaryFile() as stderr_file:
+        proc = subprocess.Popen(
+            command,
+            cwd=cwd,
+            env=_safe_env(cwd),
+            stdin=subprocess.DEVNULL,
+            stdout=stdout_file,
+            stderr=stderr_file,
+            start_new_session=start_new_session,
+        )
+        timed_out = False
+        try:
+            proc.wait(timeout=timeout_s)
+        except subprocess.TimeoutExpired:
+            timed_out = True
+            _terminate_process(proc)
+            proc.wait()
+        return PythonRunResult(
+            returncode=proc.returncode if proc.returncode is not None else -1,
+            stdout=_read_limited_text(stdout_file, stdout_limit),
+            stderr=_read_limited_text(stderr_file, stderr_limit),
+            timed_out=timed_out,
+        )
+def run_python_script(
+    script_name: str,
+    *,
+    cwd: str,
+    args: list[str],
+    timeout_s: float,
+    stdout_limit: int,
+    stderr_limit: int,
+) -> PythonRunResult:
+    config = {"mode": "script", **_limit_config(timeout_s)}
+    return _run_python_command(
+        config,
+        cwd=cwd,
+        argv=[script_name, *args],
+        timeout_s=timeout_s,
+        stdout_limit=stdout_limit,
+        stderr_limit=stderr_limit,
+    )
+def run_python_code(
+    code: str,
+    *,
+    cwd: str,
+    timeout_s: float,
+    stdout_limit: int,
+    stderr_limit: int,
+) -> PythonRunResult:
+    config = {"mode": "code", "code": code, **_limit_config(timeout_s)}
+    return _run_python_command(
+        config,
+        cwd=cwd,
+        argv=[],
+        timeout_s=timeout_s,
+        stdout_limit=stdout_limit,
+        stderr_limit=stderr_limit,
+    )

server/session_manager.py ADDED Viewed

	@@ -0,0 +1,128 @@

+from __future__ import annotations
+import time
+import threading
+import uuid
+from dataclasses import dataclass
+from typing import Optional
+from server.dataops_env_environment import DataOpsEnvironment
+@dataclass
+class SessionRecord:
+    env: DataOpsEnvironment
+    last_access_at: float
+class EnvironmentSessionManager:
+    """Small in-memory session store for isolated environment instances."""
+    def __init__(
+        self,
+        *,
+        max_sessions: int = 128,
+        session_timeout_s: float = 1800.0,
+    ) -> None:
+        self._lock = threading.Lock()
+        self._sessions: dict[str, SessionRecord] = {}
+        self._max_sessions = max(1, max_sessions)
+        self._session_timeout_s = max(1.0, session_timeout_s)
+    def reset_session(
+        self,
+        *,
+        task_id: str,
+        seed: Optional[int],
+        episode_id: Optional[str],
+        session_id: Optional[str],
+    ) -> tuple[str, DataOpsEnvironment, object]:
+        now = time.monotonic()
+        to_close: list[DataOpsEnvironment] = []
+        with self._lock:
+            to_close.extend(self._collect_expired_envs_locked(now))
+            record = self._sessions.get(session_id) if session_id else None
+            if record is None:
+                resolved_session_id = str(uuid.uuid4())
+                to_close.extend(self._evict_if_full_locked(now))
+                env = DataOpsEnvironment()
+                self._sessions[resolved_session_id] = SessionRecord(
+                    env=env,
+                    last_access_at=now,
+                )
+            else:
+                resolved_session_id = session_id or str(uuid.uuid4())
+                record.last_access_at = now
+                env = record.env
+        self._close_envs(to_close)
+        obs = env.reset(seed=seed, episode_id=episode_id, task_id=task_id)
+        return resolved_session_id, env, obs
+    def get_session(
+        self, session_id: Optional[str]
+    ) -> tuple[Optional[str], Optional[DataOpsEnvironment]]:
+        now = time.monotonic()
+        to_close: list[DataOpsEnvironment] = []
+        with self._lock:
+            to_close.extend(self._collect_expired_envs_locked(now))
+            if session_id:
+                record = self._sessions.get(session_id)
+                if record is not None:
+                    record.last_access_at = now
+                    env = record.env
+                else:
+                    env = None
+                result = (session_id, env)
+            else:
+                result = (None, None)
+        self._close_envs(to_close)
+        return result
+    def close_all(self) -> None:
+        with self._lock:
+            records = list(self._sessions.values())
+            self._sessions.clear()
+        self._close_envs([record.env for record in records])
+    def _collect_expired_envs_locked(self, now: float) -> list[DataOpsEnvironment]:
+        expired_ids = [
+            session_id
+            for session_id, record in self._sessions.items()
+            if now - record.last_access_at > self._session_timeout_s
+        ]
+        return self._remove_sessions_locked(expired_ids)
+    def _evict_if_full_locked(self, now: float) -> list[DataOpsEnvironment]:
+        if len(self._sessions) < self._max_sessions:
+            return []
+        oldest_session_id = min(
+            self._sessions,
+            key=lambda session_id: self._sessions[session_id].last_access_at,
+        )
+        return self._remove_sessions_locked([oldest_session_id])
+    def _remove_sessions_locked(self, session_ids: list[str]) -> list[DataOpsEnvironment]:
+        removed: list[DataOpsEnvironment] = []
+        for session_id in session_ids:
+            record = self._sessions.pop(session_id, None)
+            if record is not None:
+                removed.append(record.env)
+        return removed
+    def _close_envs(self, envs: list[DataOpsEnvironment]) -> None:
+        for env in envs:
+            env.close()
+    def __del__(self) -> None:
+        try:
+            self.close_all()
+        except Exception:
+            pass

server/task_specs.py ADDED Viewed

	@@ -0,0 +1,773 @@

+"""Seeded task metadata and deterministic scenario builders for DataOpsEnv."""
+from __future__ import annotations
+import random
+import re
+import textwrap
+from dataclasses import dataclass
+from datetime import date, timedelta
+from typing import Any, Iterable
+TASK_IDS = [
+    "task_1_easy_anomaly",
+    "task_2_medium_syntax",
+    "task_3_hard_e2e",
+]
+@dataclass(frozen=True)
+class SQLPolicy:
+    allowed_commands: frozenset[str]
+    required_table: str
+@dataclass(frozen=True)
+class TaskMetadata:
+    task_id: str
+    name: str
+    difficulty: str
+    short_description: str
+    benchmark_focus: str
+    allowed_actions: tuple[str, ...]
+@dataclass(frozen=True)
+class Task1Scenario:
+    description: str
+    all_rows: tuple[dict[str, Any], ...]
+    expected_rows: tuple[dict[str, Any], ...]
+    corrupted_row_ids: tuple[int, ...]
+@dataclass(frozen=True)
+class Task2Scenario:
+    description: str
+    visible_batch: tuple[dict[str, Any], ...]
+    visible_expected: tuple[dict[str, Any], ...]
+    hidden_cases: tuple[tuple[dict[str, Any], ...], ...]
+    hidden_expected: tuple[tuple[dict[str, Any], ...], ...]
+    broken_script: str
+@dataclass(frozen=True)
+class Task3Scenario:
+    description: str
+    target_date: str
+    recipient: str
+    subject: str
+    report_title: str
+    all_rows: tuple[dict[str, Any], ...]
+    expected_rows: tuple[dict[str, Any], ...]
+    broken_script: str
+@dataclass(frozen=True)
+class TaskScenarioBundle:
+    task_id: str
+    seed: int
+    description: str
+    task_1: Task1Scenario | None = None
+    task_2: Task2Scenario | None = None
+    task_3: Task3Scenario | None = None
+TASK_METADATA = {
+    "task_1_easy_anomaly": TaskMetadata(
+        task_id="task_1_easy_anomaly",
+        name="Delete Corrupted Transaction Rows",
+        difficulty="easy",
+        short_description=(
+            "Inspect a transaction table and remove only the seeded rows with NULL amounts while preserving legitimate non-null edge values."
+        ),
+        benchmark_focus="Careful data cleanup without collateral damage.",
+        allowed_actions=("ExecuteSQL",),
+    ),
+    "task_2_medium_syntax": TaskMetadata(
+        task_id="task_2_medium_syntax",
+        name="Repair Seeded Pipeline Script",
+        difficulty="medium",
+        short_description=(
+            "Repair a seeded ETL normalization script and verify it on visible and hidden seeded batches."
+        ),
+        benchmark_focus="Code reading, precise repair, and generalization beyond the demo batch.",
+        allowed_actions=("ReadFile", "WriteFile", "RunScript"),
+    ),
+    "task_3_hard_e2e": TaskMetadata(
+        task_id="task_3_hard_e2e",
+        name="Resolve Revenue Reporting Incident",
+        difficulty="hard",
+        short_description=(
+            "Extract a seeded reporting slice, repair the formatter, and send the exact generated report."
+        ),
+        benchmark_focus="End-to-end data extraction, file repair, and communication with provenance.",
+        allowed_actions=("ExecuteSQL", "ReadFile", "WriteFile", "RunScript", "SendEmail"),
+    ),
+}
+TASK_DESCRIPTIONS = {
+    task_id: metadata.short_description for task_id, metadata in TASK_METADATA.items()
+}
+TASK_ALLOWED_WRITE_FILES = {
+    "task_1_easy_anomaly": frozenset(),
+    "task_2_medium_syntax": frozenset({"broken_pipeline.py"}),
+    "task_3_hard_e2e": frozenset({"format_report.py", "report_data.json"}),
+}
+TASK_ALLOWED_RUN_FILES = {
+    "task_1_easy_anomaly": frozenset(),
+    "task_2_medium_syntax": frozenset({"broken_pipeline.py"}),
+    "task_3_hard_e2e": frozenset({"format_report.py"}),
+}
+TASK_EMAIL_ENABLED = frozenset({"task_3_hard_e2e"})
+TASK_ALLOWED_READ_FILES = {
+    "task_1_easy_anomaly": frozenset(),
+    "task_2_medium_syntax": frozenset({"broken_pipeline.py"}),
+    "task_3_hard_e2e": frozenset({"format_report.py", "report_data.json"}),
+}
+TASK_SQL_POLICIES = {
+    "task_1_easy_anomaly": SQLPolicy(
+        allowed_commands=frozenset({"SELECT", "DELETE"}),
+        required_table="transactions",
+    ),
+    "task_3_hard_e2e": SQLPolicy(
+        allowed_commands=frozenset({"SELECT", "WITH"}),
+        required_table="daily_reports",
+    ),
+}
+_REPORT_RECORD_RE = re.compile(
+    r"Department:\s*(?P<department>[^\n]+)\n"
+    r"\s*Revenue:\s*\$(?P<revenue>-?\d+(?:\.\d+)?)\n"
+    r"\s*Expenses:\s*\$(?P<expenses>-?\d+(?:\.\d+)?)\n"
+    r"\s*Net:\s*\$(?P<net>-?\d+(?:\.\d+)?)",
+    re.MULTILINE,
+)
+_REPORT_TOTAL_RE = re.compile(r"Total Revenue:\s*\$(?P<total>-?\d+(?:\.\d+)?)")
+_TASK_1_VALID_STATUSES = ("success", "settled", "approved", "completed")
+_TASK_1_CORRUPTED_STATUSES = ("pending", "retrying", "failed", "queued")
+_TASK_2_READY_STATUS = "ready"
+_TASK_2_NON_READY_STATUSES = ("queued", "hold", "failed")
+_TASK_2_REGIONS = ("us-east", "eu-west", "ap-south", "sa-east")
+_TASK_3_RECIPIENTS = (
+    "bhavik@example.com",
+    "marta@example.com",
+    "ops-lead@example.com",
+    "finance-review@example.com",
+)
+_TASK_3_DEPARTMENTS = (
+    "Engineering",
+    "Sales",
+    "Marketing",
+    "Operations",
+    "Support",
+    "Finance",
+)
+def task_manifest_entries() -> list[dict[str, Any]]:
+    return [
+        {
+            "id": metadata.task_id,
+            "name": metadata.name,
+            "difficulty": metadata.difficulty,
+            "description": metadata.short_description,
+            "benchmark_focus": metadata.benchmark_focus,
+            "allowed_actions": list(metadata.allowed_actions),
+        }
+        for metadata in TASK_METADATA.values()
+    ]
+def build_task_scenario(task_id: str, seed: int | None = None) -> TaskScenarioBundle:
+    resolved_seed = 0 if seed is None else int(seed)
+    if task_id == "task_1_easy_anomaly":
+        task = _build_task_1_scenario(resolved_seed)
+        return TaskScenarioBundle(
+            task_id=task_id,
+            seed=resolved_seed,
+            description=task.description,
+            task_1=task,
+        )
+    if task_id == "task_2_medium_syntax":
+        task = _build_task_2_scenario(resolved_seed)
+        return TaskScenarioBundle(
+            task_id=task_id,
+            seed=resolved_seed,
+            description=task.description,
+            task_2=task,
+        )
+    if task_id == "task_3_hard_e2e":
+        task = _build_task_3_scenario(resolved_seed)
+        return TaskScenarioBundle(
+            task_id=task_id,
+            seed=resolved_seed,
+            description=task.description,
+            task_3=task,
+        )
+    raise KeyError(f"Unknown task_id: {task_id}")
+def normalize_task_3_rows(
+    rows: Iterable[dict[str, Any]], *, require_headcount: bool = False
+) -> list[dict[str, Any]]:
+    """Normalise extracted rows for deterministic comparison."""
+    normalised: list[dict[str, Any]] = []
+    for row in rows:
+        try:
+            hc_raw = row.get("headcount")
+            if hc_raw is None or hc_raw == "":
+                if require_headcount:
+                    return []
+                headcount: int | None = None
+            else:
+                headcount = int(hc_raw)
+            normalised.append(
+                {
+                    "department": str(row["department"]),
+                    "revenue": round(float(row["revenue"]), 2),
+                    "expenses": round(float(row["expenses"]), 2),
+                    "headcount": headcount,
+                }
+            )
+        except (KeyError, TypeError, ValueError):
+            return []
+    return sorted(normalised, key=lambda item: item["department"])
+def normalize_task_2_output_rows(rows: Any) -> list[dict[str, Any]]:
+    """Normalise Task 2 ETL output rows while preserving list order for sort checks."""
+    if not isinstance(rows, list):
+        return []
+    normalised: list[dict[str, Any]] = []
+    for row in rows:
+        if not isinstance(row, dict):
+            return []
+        try:
+            order_id = str(row["order_id"])
+            region = str(row["region"])
+            amount_usd = round(float(row["amount_usd"]), 2)
+            priority_band = str(row["priority_band"])
+        except (KeyError, TypeError, ValueError):
+            return []
+        if priority_band not in {"high", "normal"}:
+            return []
+        normalised.append(
+            {
+                "order_id": order_id,
+                "region": region,
+                "amount_usd": amount_usd,
+                "priority_band": priority_band,
+            }
+        )
+    return normalised
+def build_task_2_expected(
+    batch: Iterable[dict[str, Any]]
+) -> list[dict[str, Any]]:
+    processed: list[dict[str, Any]] = []
+    for record in batch:
+        try:
+            status = str(record["status"])
+            amount_cents = int(record["amount_cents"])
+            priority = int(record["priority"])
+            amount_usd = round(amount_cents / 100.0, 2)
+            if status != _TASK_2_READY_STATUS or amount_cents <= 0:
+                continue
+            processed.append(
+                {
+                    "order_id": str(record["order_id"]),
+                    "region": str(record["region"]),
+                    "amount_usd": amount_usd,
+                    "priority_band": "high"
+                    if priority >= 8 or amount_usd >= 500.0
+                    else "normal",
+                }
+            )
+        except (KeyError, TypeError, ValueError):
+            return []
+    processed.sort(key=lambda item: (-item["amount_usd"], item["order_id"]))
+    return processed
+def task_3_data_matches_expected(
+    rows: list[dict[str, Any]],
+    expected_rows: Iterable[dict[str, Any]],
+    *,
+    require_headcount: bool,
+) -> bool:
+    expected = normalize_task_3_rows(expected_rows, require_headcount=require_headcount)
+    return rows == expected
+def task_3_headcount_fully_matches(
+    rows: list[dict[str, Any]], expected_rows: Iterable[dict[str, Any]]
+) -> bool:
+    expected = normalize_task_3_rows(expected_rows, require_headcount=True)
+    return rows == expected
+def build_task_3_report(rows: Iterable[dict[str, Any]], target_date: str) -> str:
+    report_rows = normalize_task_3_rows(rows, require_headcount=True)
+    lines = [f"=== Daily Revenue Report ({target_date}) ===", ""]
+    total_revenue = 0.0
+    for row in report_rows:
+        revenue = float(row["revenue"])
+        expenses = float(row["expenses"])
+        net = revenue - expenses
+        lines.append(f"Department: {row['department']}")
+        lines.append(f"  Revenue:  ${revenue:.2f}")
+        lines.append(f"  Expenses: ${expenses:.2f}")
+        lines.append(f"  Net:      ${net:.2f}")
+        lines.append("")
+        total_revenue += revenue
+    lines.append(f"Total Revenue: ${total_revenue:.2f}")
+    lines.append("=== End of Report ===")
+    return "\n".join(lines)
+def extract_task_3_report_block(text: str, target_date: str) -> str | None:
+    raw = text.replace("\r\n", "\n")
+    start_marker = f"=== Daily Revenue Report ({target_date}) ==="
+    start = raw.find(start_marker)
+    end_marker = "=== End of Report ==="
+    end = raw.find(end_marker)
+    if start == -1 or end == -1 or end < start:
+        return None
+    return raw[start : end + len(end_marker)].strip()
+def parse_task_3_report(text: str, target_date: str) -> dict[str, Any] | None:
+    block = extract_task_3_report_block(text, target_date)
+    if block is None:
+        return None
+    records: list[dict[str, Any]] = []
+    for match in _REPORT_RECORD_RE.finditer(block):
+        revenue = round(float(match.group("revenue")), 2)
+        expenses = round(float(match.group("expenses")), 2)
+        net = round(float(match.group("net")), 2)
+        records.append(
+            {
+                "department": match.group("department").strip(),
+                "revenue": revenue,
+                "expenses": expenses,
+                "headcount": None,
+                "net": net,
+            }
+        )
+    total_match = _REPORT_TOTAL_RE.search(block)
+    if not total_match:
+        return None
+    return {
+        "records": sorted(records, key=lambda item: item["department"]),
+        "total_revenue": round(float(total_match.group("total")), 2),
+    }
+def report_matches_expected(
+    text: str, expected_rows: Iterable[dict[str, Any]], target_date: str
+) -> bool:
+    parsed = parse_task_3_report(text, target_date)
+    if parsed is None:
+        return False
+    expected = normalize_task_3_rows(expected_rows, require_headcount=True)
+    expected_records = [
+        {
+            "department": row["department"],
+            "revenue": row["revenue"],
+            "expenses": row["expenses"],
+            "headcount": None,
+            "net": round(float(row["revenue"]) - float(row["expenses"]), 2),
+        }
+        for row in expected
+    ]
+    expected_total = round(sum(float(row["revenue"]) for row in expected), 2)
+    return (
+        parsed["records"] == expected_records
+        and parsed["total_revenue"] == expected_total
+    )
+def task_3_semantic_match_fraction_rows(
+    rows: list[dict[str, Any]], expected_rows: Iterable[dict[str, Any]]
+) -> float:
+    if not rows:
+        return 0.0
+    expected = normalize_task_3_rows(expected_rows, require_headcount=False)
+    exp_by_dept = {row["department"]: row for row in expected}
+    matched = 0
+    for row in rows:
+        department = row.get("department")
+        if department not in exp_by_dept:
+            continue
+        expected_row = exp_by_dept[department]
+        if (
+            row.get("revenue") == expected_row["revenue"]
+            and row.get("expenses") == expected_row["expenses"]
+        ):
+            matched += 1
+    return matched / len(expected) if expected else 0.0
+def task_3_semantic_match_fraction_parsed(
+    parsed: dict[str, Any] | None, expected_rows: Iterable[dict[str, Any]]
+) -> float:
+    if not parsed or not parsed.get("records"):
+        return 0.0
+    expected = normalize_task_3_rows(expected_rows, require_headcount=False)
+    exp_by_dept = {row["department"]: row for row in expected}
+    matched = 0
+    for record in parsed["records"]:
+        department = record.get("department")
+        if department not in exp_by_dept:
+            continue
+        expected_row = exp_by_dept[department]
+        if (
+            record.get("revenue") == expected_row["revenue"]
+            and record.get("expenses") == expected_row["expenses"]
+        ):
+            matched += 1
+    return matched / len(expected) if expected else 0.0
+def task_3_semantic_match_fraction_text(
+    text: str, expected_rows: Iterable[dict[str, Any]], target_date: str
+) -> float:
+    return task_3_semantic_match_fraction_parsed(
+        parse_task_3_report(text, target_date), expected_rows
+    )
+def _build_task_1_scenario(seed: int) -> Task1Scenario:
+    rng = random.Random(f"task-1:{seed}")
+    valid_count = 3 + rng.randrange(3)
+    corrupted_count = 2 + rng.randrange(2)
+    combined_rows: list[dict[str, Any]] = []
+    valid_templates = []
+    for index in range(valid_count):
+        valid_templates.append(
+            {
+                "kind": "valid",
+                "user_id": 1000 + seed * 10 + index,
+                "amount": round(rng.uniform(75.0, 975.0), 2),
+                "status": rng.choice(_TASK_1_VALID_STATUSES),
+            }
+        )
+    if valid_templates:
+        valid_templates[0]["amount"] = 0.0
+        valid_templates[0]["status"] = "settled"
+    if len(valid_templates) > 1:
+        valid_templates[1]["amount"] = -round(float(valid_templates[1]["amount"]) / 10.0, 2)
+        valid_templates[1]["status"] = "approved"
+    corrupted_templates = []
+    for index in range(corrupted_count):
+        corrupted_templates.append(
+            {
+                "kind": "corrupted",
+                "user_id": 2000 + seed * 10 + index,
+                "amount": None,
+                "status": rng.choice(_TASK_1_CORRUPTED_STATUSES),
+            }
+        )
+    templates = valid_templates + corrupted_templates
+    rng.shuffle(templates)
+    expected_rows: list[dict[str, Any]] = []
+    corrupted_row_ids: list[int] = []
+    for row_id, template in enumerate(templates, start=1):
+        row = {
+            "id": row_id,
+            "user_id": int(template["user_id"]),
+            "amount": template["amount"],
+            "status": str(template["status"]),
+        }
+        combined_rows.append(row)
+        if template["kind"] == "valid":
+            expected_rows.append(row)
+        else:
+            corrupted_row_ids.append(row_id)
+    description = (
+        "Find and delete all corrupted records (rows with NULL amounts) from the "
+        f"'transactions' table. This seeded episode contains {corrupted_count} corrupted "
+        f"rows mixed with {valid_count} valid rows. Only NULL amounts are corrupted; "
+        "legitimate zero-value reconciliations and negative refund adjustments may also "
+        "appear and must be preserved exactly."
+    )
+    return Task1Scenario(
+        description=description,
+        all_rows=tuple(combined_rows),
+        expected_rows=tuple(expected_rows),
+        corrupted_row_ids=tuple(sorted(corrupted_row_ids)),
+    )
+def _build_task_2_scenario(seed: int) -> Task2Scenario:
+    rng = random.Random(f"task-2:{seed}")
+    visible_batch = _sample_task_2_batch(rng, batch_index=0)
+    hidden_cases = tuple(
+        _sample_task_2_batch(rng, batch_index=index + 1)
+        for index in range(6)
+    )
+    visible_expected = tuple(build_task_2_expected(visible_batch))
+    hidden_expected = tuple(
+        tuple(build_task_2_expected(batch)) for batch in hidden_cases
+    )
+    description = (
+        "The script 'broken_pipeline.py' prepares downstream billing candidates from "
+        "seeded order records. Repair it so it keeps only ready records with positive "
+        "amounts, converts cents to USD, flags high priority when priority >= 8 or "
+        "amount_usd >= 500.00, and returns rows sorted by amount_usd descending then "
+        "order_id ascending. The grader checks the visible demo batch and additional "
+        "unseen seeded batches."
+    )
+    return Task2Scenario(
+        description=description,
+        visible_batch=visible_batch,
+        visible_expected=visible_expected,
+        hidden_cases=hidden_cases,
+        hidden_expected=hidden_expected,
+        broken_script=_render_broken_pipeline_script(visible_batch),
+    )
+def _build_task_3_scenario(seed: int) -> Task3Scenario:
+    rng = random.Random(f"task-3:{seed}")
+    base_date = date(2025, 3, 25) + timedelta(days=rng.randrange(0, 7))
+    target_date = base_date.isoformat()
+    recipient = rng.choice(_TASK_3_RECIPIENTS)
+    subject = f"Daily Revenue Report - {target_date}"
+    report_title = f"Daily Revenue Report ({target_date})"
+    selected_departments = sorted(rng.sample(_TASK_3_DEPARTMENTS, k=4))
+    expected_rows: list[dict[str, Any]] = []
+    warehouse_rows: list[dict[str, Any]] = []
+    row_id = 1
+    for offset in (-2, -1, 0, 1):
+        report_date = (base_date + timedelta(days=offset)).isoformat()
+        for department in selected_departments:
+            if offset == 0:
+                revenue = round(rng.uniform(12_000.0, 95_000.0), 2)
+                expenses = round(rng.uniform(8_000.0, revenue + 18_000.0), 2)
+                headcount = rng.randint(8, 48)
+                seeded_row = {
+                    "department": department,
+                    "revenue": revenue,
+                    "expenses": expenses,
+                    "headcount": headcount,
+                }
+                expected_rows.append(seeded_row)
+            else:
+                revenue = round(rng.uniform(9_000.0, 90_000.0), 2)
+                expenses = round(rng.uniform(7_000.0, revenue + 14_000.0), 2)
+                headcount = rng.randint(8, 48)
+            warehouse_rows.append(
+                {
+                    "id": row_id,
+                    "report_date": report_date,
+                    "department": department,
+                    "revenue": revenue,
+                    "expenses": expenses,
+                    "headcount": headcount,
+                }
+            )
+            row_id += 1
+    description = (
+        f"Extract the daily report for date '{target_date}' from the 'daily_reports' table, "
+        "repair the broken 'format_report.py' script, save the exact extracted rows to "
+        f"'report_data.json', run the script with that file, and send the generated report "
+        f"to '{recipient}' with subject '{subject}'. The grader expects the exact seeded slice, "
+        "including headcount."
+    )
+    return Task3Scenario(
+        description=description,
+        target_date=target_date,
+        recipient=recipient,
+        subject=subject,
+        report_title=report_title,
+        all_rows=tuple(warehouse_rows),
+        expected_rows=tuple(
+            normalize_task_3_rows(expected_rows, require_headcount=True)
+        ),
+        broken_script=_render_broken_format_report_script(target_date),
+    )
+def _sample_task_2_batch(
+    rng: random.Random, *, batch_index: int
+) -> tuple[dict[str, Any], ...]:
+    def make_record(
+        suffix: str,
+        *,
+        status: str,
+        amount_cents: int,
+        priority: int,
+    ) -> dict[str, Any]:
+        return {
+            "order_id": f"ORD-{batch_index:02d}-{suffix}",
+            "status": status,
+            "amount_cents": amount_cents,
+            "priority": priority,
+            "region": rng.choice(_TASK_2_REGIONS),
+        }
+    records = [
+        make_record(
+            "normal",
+            status=_TASK_2_READY_STATUS,
+            amount_cents=rng.randrange(12_125, 28_975, 25),
+            priority=rng.randint(2, 6),
+        ),
+        make_record(
+            "priority",
+            status=_TASK_2_READY_STATUS,
+            amount_cents=rng.randrange(13_175, 32_775, 25),
+            priority=rng.randint(8, 10),
+        ),
+        make_record(
+            "amount",
+            status=_TASK_2_READY_STATUS,
+            amount_cents=rng.randrange(50_025, 88_975, 25),
+            priority=rng.randint(2, 6),
+        ),
+        make_record(
+            "queued",
+            status=rng.choice(_TASK_2_NON_READY_STATUSES[:2]),
+            amount_cents=rng.randrange(18_125, 42_975, 25),
+            priority=rng.randint(4, 9),
+        ),
+        make_record(
+            "drop",
+            status=_TASK_2_READY_STATUS,
+            amount_cents=-rng.randrange(125, 2_975, 25),
+            priority=rng.randint(8, 10),
+        ),
+    ]
+    if batch_index % 2 == 0:
+        records.append(
+            make_record(
+                "hold",
+                status=rng.choice(_TASK_2_NON_READY_STATUSES),
+                amount_cents=rng.randrange(24_125, 48_975, 25),
+                priority=rng.randint(1, 7),
+            )
+        )
+    rng.shuffle(records)
+    return tuple(records)
+def _render_broken_pipeline_script(
+    visible_batch: tuple[dict[str, Any], ...]
+) -> str:
+    return textwrap.dedent(
+        f'''\
+        import json
+        def process_data_stream(payloads):
+            """
+            Normalize downstream billing candidates.
+            Keep only records whose status is "ready" and whose amount_cents is positive.
+            Convert amount_cents to amount_usd rounded to 2 decimals.
+            Mark priority_band as "high" when priority >= 8 or amount_usd >= 500.00.
+            Return rows sorted by amount_usd descending, then order_id ascending.
+            """
+            processed_records = []
+            for payload in payloads:
+                if payload["status"] == "failed" or payload["amount_cents"] <= 0:
+                    continue
+                amount_usd = round(payload["amount_cents"] // 100, 2)
+                priority_band = (
+                    "high"
+                    if payload["priority"] >= 8 and amount_usd >= 500.0
+                    else "normal"
+                )
+                processed_records.append(
+                    {{
+                        "order_id": payload["order_id"],
+                        "region": payload["region"],
+                        "amount_usd": amount_usd,
+                        "priority_band": priority_band,
+                    }}
+                )
+            processed_records.sort(key=lambda item: (item["amount_usd"], item["order_id"]))
+            return processed_records
+        if __name__ == "__main__":
+            mock_batch = {list(visible_batch)!r}
+            print(json.dumps(process_data_stream(mock_batch), indent=2, sort_keys=True))
+    '''
+    ).lstrip()
+def _render_broken_format_report_script(target_date: str) -> str:
+    title = f"=== Daily Revenue Report ({target_date}) ==="
+    return textwrap.dedent(
+        f'''\
+        import json
+        import sys
+        def format_report(input_path):
+            """Reads extracted data from JSON and produces a formatted stakeholder report."""
+            with open(input_path, encoding="utf-8") as f:
+                records = json.load(f)
+            lines = ["{title}", ""]
+            total_revenue = 0
+            for rec in records:
+                dept = rec["department"]
+                rev = int(rec["revenue"])  # BUG 1: int() truncates decimal precision
+                exp = rec["expenses"]
+                net = rev - exp
+                lines.append(f"Department: {{dept}}")
+                lines.append(f"  Revenue:  ${{rev}}")
+                lines.append(f"  Expenses: ${{exp:.2f}}")
+                lines.append(f"  Net:      ${{net:.2f}}")
+                lines.append("")
+                total_revenue += rev
+            lines.append(f"Total Revenue: ${{total_revenue}}")
+            lines.append("=== End of Report ===")
+            output = "\\n".join(lines)
+            print(output)
+            return output
+        if __name__ == "__main__":
+            if len(sys.argv) < 2:
+                print("Usage: python format_report.py <input.json>", file=sys.stderr)
+                sys.exit(1)
+            format_report(sys.argv[0])  # BUG 2: should be sys.argv[1]
+    '''
+    ).lstrip()

tests/test_grading.py ADDED Viewed

	@@ -0,0 +1,577 @@

+"""High-signal regression tests for seeded grading and public API shape."""
+from __future__ import annotations
+import json
+import shutil
+import sqlite3
+import sys
+from pathlib import Path
+from fastapi.testclient import TestClient
+_ROOT = Path(__file__).resolve().parents[1]
+if str(_ROOT) not in sys.path:
+    sys.path.insert(0, str(_ROOT))
+from models import DataOpsAction  # noqa: E402
+from server.app import app  # noqa: E402
+from server.dataops_env_environment import DataOpsEnvironment  # noqa: E402
+from server.grading import evaluate_task  # noqa: E402
+from server.task_specs import build_task_3_report  # noqa: E402
+def _fixed_pipeline_script(visible_batch: list[dict[str, object]]) -> str:
+    return f'''\
+import json
+def process_data_stream(payloads):
+    processed_records = []
+    for payload in payloads:
+        if payload["status"] != "ready" or int(payload["amount_cents"]) <= 0:
+            continue
+        amount_usd = round(int(payload["amount_cents"]) / 100.0, 2)
+        priority_band = (
+            "high"
+            if int(payload["priority"]) >= 8 or amount_usd >= 500.0
+            else "normal"
+        )
+        processed_records.append(
+            {{
+                "order_id": payload["order_id"],
+                "region": payload["region"],
+                "amount_usd": amount_usd,
+                "priority_band": priority_band,
+            }}
+        )
+    processed_records.sort(key=lambda item: (-item["amount_usd"], item["order_id"]))
+    return processed_records
+if __name__ == "__main__":
+    mock_batch = {visible_batch!r}
+    print(json.dumps(process_data_stream(mock_batch), indent=2, sort_keys=True))
+'''
+def _visible_only_pipeline_stub(
+    visible_batch: list[dict[str, object]],
+    visible_expected: list[dict[str, object]],
+) -> str:
+    return f'''\
+import json
+def process_data_stream(payloads):
+    visible = {visible_batch!r}
+    if payloads == visible:
+        return {visible_expected!r}
+    return []
+if __name__ == "__main__":
+    print(json.dumps({visible_expected!r}, indent=2, sort_keys=True))
+'''
+def _fixed_format_script(target_date: str) -> str:
+    return f'''\
+import json
+import sys
+def format_report(input_path):
+    with open(input_path, encoding="utf-8") as f:
+        records = json.load(f)
+    lines = ["=== Daily Revenue Report ({target_date}) ===", ""]
+    total_revenue = 0.0
+    for rec in records:
+        dept = rec["department"]
+        rev = float(rec["revenue"])
+        exp = float(rec["expenses"])
+        net = rev - exp
+        lines.append(f"Department: {{dept}}")
+        lines.append(f"  Revenue:  ${{rev:.2f}}")
+        lines.append(f"  Expenses: ${{exp:.2f}}")
+        lines.append(f"  Net:      ${{net:.2f}}")
+        lines.append("")
+        total_revenue += rev
+    lines.append(f"Total Revenue: ${{total_revenue:.2f}}")
+    lines.append("=== End of Report ===")
+    out = "\\n".join(lines)
+    print(out)
+    return out
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python format_report.py <input.json>", file=sys.stderr)
+        sys.exit(1)
+    format_report(sys.argv[1])
+'''
+def test_seeded_task_3_scenario_is_deterministic() -> None:
+    env_a = DataOpsEnvironment()
+    env_b = DataOpsEnvironment()
+    try:
+        env_a.reset(task_id="task_3_hard_e2e", seed=17)
+        env_b.reset(task_id="task_3_hard_e2e", seed=17)
+        assert env_a.scenario.task_3 == env_b.scenario.task_3
+    finally:
+        env_a.close()
+        env_b.close()
+def test_task_1_perfect_score_seeded() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_1_easy_anomaly", seed=7)
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={"query": "DELETE FROM transactions WHERE amount IS NULL"},
+            )
+        )
+        assert obs.status == "success"
+        out = evaluate_task("task_1_easy_anomaly", env)
+        assert out["score"] == 1.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_1_seeded_valid_rows_include_non_null_edge_amounts() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_1_easy_anomaly", seed=7)
+    try:
+        scenario = env.scenario.task_1
+        assert scenario is not None
+        amounts = [float(row["amount"]) for row in scenario.expected_rows]
+        assert any(amount == 0.0 for amount in amounts)
+        assert any(amount < 0.0 for amount in amounts)
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_1_rewriting_corrupted_rows_scores_zero() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_1_easy_anomaly", seed=7)
+    try:
+        with sqlite3.connect(env.db_path) as conn:
+            conn.execute("UPDATE transactions SET amount = 0 WHERE amount IS NULL")
+            conn.commit()
+        out = evaluate_task("task_1_easy_anomaly", env)
+        assert out["score"] == 0.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_1_deleting_non_null_adjustments_is_penalized() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_1_easy_anomaly", seed=7)
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={
+                    "query": "DELETE FROM transactions WHERE amount IS NULL OR amount <= 0"
+                },
+            )
+        )
+        assert obs.status == "success"
+        assert obs.reward is not None and obs.reward < 0
+        out = evaluate_task("task_1_easy_anomaly", env)
+        assert out["score"] == 0.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_reset_only_scores_zero_across_tasks() -> None:
+    for task_id in (
+        "task_1_easy_anomaly",
+        "task_2_medium_syntax",
+        "task_3_hard_e2e",
+    ):
+        env = DataOpsEnvironment()
+        try:
+            env.reset(task_id=task_id, seed=7)
+            out = evaluate_task(task_id, env)
+            assert out["score"] == 0.0
+        finally:
+            env.close()
+            shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_1_broad_delete_with_where_is_penalized() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_1_easy_anomaly", seed=7)
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={
+                    "query": "DELETE FROM transactions WHERE amount IS NULL OR 1 = 1"
+                },
+            )
+        )
+        assert obs.status == "success"
+        assert obs.reward is not None and obs.reward < 0
+        assert env.evidence["task_1"]["destructive_sql_attempted"] is True
+        out = evaluate_task("task_1_easy_anomaly", env)
+        assert out["score"] == 0.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_2_script_run_does_not_inherit_server_secrets(monkeypatch) -> None:
+    monkeypatch.setenv("API_KEY", "super-secret-value")
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_2_medium_syntax", seed=11)
+    script = """\
+import json
+import os
+def process_data_stream(payloads):
+    return []
+if __name__ == "__main__":
+    print(json.dumps({"api_key": os.getenv("API_KEY"), "home": os.getenv("HOME")}))
+"""
+    try:
+        env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={"filepath": "broken_pipeline.py", "content": script},
+            )
+        )
+        run_obs = env.step(
+            DataOpsAction(
+                action_type="RunScript",
+                payload={"filepath": "broken_pipeline.py", "args": []},
+            )
+        )
+        assert run_obs.status == "success"
+        payload = json.loads((run_obs.stdout or "").strip())
+        assert payload["api_key"] is None
+        assert payload["home"] == env.workspace_dir
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_2_perfect_score_seeded() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_2_medium_syntax", seed=11)
+    scenario = env.scenario.task_2
+    assert scenario is not None
+    try:
+        read_obs = env.step(
+            DataOpsAction(
+                action_type="ReadFile",
+                payload={"filepath": "broken_pipeline.py"},
+            )
+        )
+        assert read_obs.status == "success"
+        write_obs = env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={
+                    "filepath": "broken_pipeline.py",
+                    "content": _fixed_pipeline_script(list(scenario.visible_batch)),
+                },
+            )
+        )
+        assert write_obs.status == "success"
+        pre_run = evaluate_task("task_2_medium_syntax", env)
+        assert 0.0 < pre_run["score"] < 1.0
+        run_obs = env.step(
+            DataOpsAction(
+                action_type="RunScript",
+                payload={"filepath": "broken_pipeline.py", "args": []},
+            )
+        )
+        assert run_obs.status == "success"
+        out = evaluate_task("task_2_medium_syntax", env)
+        assert out["score"] == 1.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_2_print_only_stub_does_not_get_full_credit() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_2_medium_syntax", seed=11)
+    scenario = env.scenario.task_2
+    assert scenario is not None
+    stub = _visible_only_pipeline_stub(
+        list(scenario.visible_batch),
+        list(scenario.visible_expected),
+    )
+    try:
+        env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={"filepath": "broken_pipeline.py", "content": stub},
+            )
+        )
+        out = evaluate_task("task_2_medium_syntax", env)
+        assert out["score"] < 0.5
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_sql_policy_rejects_literal_table_name_bypass() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=19)
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={
+                    "query": (
+                        "SELECT name FROM sqlite_master "
+                        "WHERE 'daily_reports' = 'daily_reports'"
+                    )
+                },
+            )
+        )
+        assert obs.status == "error"
+        assert "disallowed" in obs.message.lower()
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_sql_policy_allows_cte_queries_over_daily_reports() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=19)
+    scenario = env.scenario.task_3
+    assert scenario is not None
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={
+                    "query": (
+                        "WITH scoped AS ("
+                        "SELECT department, revenue, expenses, headcount "
+                        "FROM daily_reports "
+                        f"WHERE report_date = '{scenario.target_date}'"
+                        ") "
+                        "SELECT department, revenue, expenses, headcount "
+                        "FROM scoped ORDER BY department"
+                    )
+                },
+            )
+        )
+        assert obs.status == "success"
+        assert obs.sql_results
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_perfect_score_requires_proven_workflow() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=19)
+    scenario = env.scenario.task_3
+    assert scenario is not None
+    try:
+        query = (
+            "SELECT department, revenue, expenses, headcount "
+            "FROM daily_reports "
+            f"WHERE report_date = '{scenario.target_date}' "
+            "ORDER BY department"
+        )
+        sql_obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={"query": query},
+            )
+        )
+        assert sql_obs.status == "success"
+        rows = sql_obs.sql_results
+        assert rows is not None
+        write_json = env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={"filepath": "report_data.json", "content": json.dumps(rows)},
+            )
+        )
+        assert write_json.status == "success"
+        write_script = env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={
+                    "filepath": "format_report.py",
+                    "content": _fixed_format_script(scenario.target_date),
+                },
+            )
+        )
+        assert write_script.status == "success"
+        run_obs = env.step(
+            DataOpsAction(
+                action_type="RunScript",
+                payload={"filepath": "format_report.py", "args": ["report_data.json"]},
+            )
+        )
+        assert run_obs.status == "success"
+        body = (run_obs.stdout or "").strip()
+        email_obs = env.step(
+            DataOpsAction(
+                action_type="SendEmail",
+                payload={
+                    "to_email": scenario.recipient,
+                    "subject": scenario.subject,
+                    "body": body,
+                },
+            )
+        )
+        assert email_obs.status == "success"
+        out = evaluate_task("task_3_hard_e2e", env)
+        assert out["score"] == 1.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_equivalent_relative_input_path_still_scores_perfect() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=29)
+    scenario = env.scenario.task_3
+    assert scenario is not None
+    try:
+        query = (
+            "SELECT department, revenue, expenses, headcount "
+            "FROM daily_reports "
+            f"WHERE report_date = '{scenario.target_date}' "
+            "ORDER BY department"
+        )
+        sql_obs = env.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={"query": query},
+            )
+        )
+        assert sql_obs.status == "success"
+        rows = sql_obs.sql_results
+        assert rows is not None
+        env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={"filepath": "report_data.json", "content": json.dumps(rows)},
+            )
+        )
+        env.step(
+            DataOpsAction(
+                action_type="WriteFile",
+                payload={
+                    "filepath": "format_report.py",
+                    "content": _fixed_format_script(scenario.target_date),
+                },
+            )
+        )
+        run_obs = env.step(
+            DataOpsAction(
+                action_type="RunScript",
+                payload={"filepath": "format_report.py", "args": ["./report_data.json"]},
+            )
+        )
+        assert run_obs.status == "success"
+        env.step(
+            DataOpsAction(
+                action_type="SendEmail",
+                payload={
+                    "to_email": scenario.recipient,
+                    "subject": scenario.subject,
+                    "body": (run_obs.stdout or "").strip(),
+                },
+            )
+        )
+        out = evaluate_task("task_3_hard_e2e", env)
+        assert out["score"] == 1.0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_fabricated_email_only_scores_low() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=23)
+    scenario = env.scenario.task_3
+    assert scenario is not None
+    try:
+        fake_body = build_task_3_report(list(scenario.expected_rows), scenario.target_date)
+        email_obs = env.step(
+            DataOpsAction(
+                action_type="SendEmail",
+                payload={
+                    "to_email": scenario.recipient,
+                    "subject": scenario.subject,
+                    "body": fake_body,
+                },
+            )
+        )
+        assert email_obs.status == "success"
+        out = evaluate_task("task_3_hard_e2e", env)
+        assert out["score"] <= 0.10
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_task_3_reading_formatter_source_awards_progress_signal() -> None:
+    env = DataOpsEnvironment()
+    env.reset(task_id="task_3_hard_e2e", seed=31)
+    try:
+        obs = env.step(
+            DataOpsAction(
+                action_type="ReadFile",
+                payload={"filepath": "format_report.py"},
+            )
+        )
+        assert obs.status == "success"
+        assert obs.reward is not None and obs.reward > 0
+    finally:
+        env.close()
+        shutil.rmtree(env.workspace_dir, ignore_errors=True)
+def test_tasks_endpoint_exposes_manifest_metadata() -> None:
+    with TestClient(app) as client:
+        response = client.get("/tasks")
+        payload = response.json()
+        assert response.status_code == 200
+        assert len(payload["tasks"]) == 3
+        assert payload["tasks"][0]["difficulty"] == "easy"
+        assert "action_schema" in payload
+def test_public_grader_hides_details_by_default(monkeypatch) -> None:
+    # Do not leak grader details when PUBLIC_GRADER_DETAILS is unset/false (ignore dev .env).
+    monkeypatch.setenv("PUBLIC_GRADER_DETAILS", "false")
+    with TestClient(app) as client:
+        reset = client.post("/reset?task_id=task_1_easy_anomaly", json={"seed": 5})
+        assert reset.status_code == 200
+        grade = client.get("/grader")
+        assert grade.status_code == 200
+        payload = grade.json()
+        assert "score" in payload
+        assert "details" not in payload

tests/test_inference_api.py ADDED Viewed

	@@ -0,0 +1,408 @@

+"""Integration-style checks for inference wiring and transport/session flows."""
+from __future__ import annotations
+import json
+import os
+import sys
+from pathlib import Path
+from subprocess import CompletedProcess
+from types import SimpleNamespace
+from typing import Any
+from fastapi.testclient import TestClient
+_ROOT = Path(__file__).resolve().parents[1]
+if str(_ROOT) not in sys.path:
+    sys.path.insert(0, str(_ROOT))
+import inference  # noqa: E402
+import env_loader  # noqa: E402
+import server.app as app_module  # noqa: E402
+from client import DataOpsEnvClient  # noqa: E402
+from models import DataOpsAction  # noqa: E402
+class _FakeResponse:
+    def __init__(self, payload: dict[str, Any]) -> None:
+        self._payload = payload
+    def raise_for_status(self) -> None:
+        return None
+    def json(self) -> dict[str, Any]:
+        return self._payload
+class _FakeHTTPSession:
+    def __init__(self) -> None:
+        self.urls: list[str] = []
+        self._step_count = 0
+    def request(
+        self,
+        method: str,
+        url: str,
+        timeout: float | None = None,
+        **kwargs: Any,
+    ) -> _FakeResponse:
+        del method, timeout, kwargs
+        self.urls.append(url)
+        if url.endswith("/reset"):
+            self._step_count = 0
+            return _FakeResponse(
+                {
+                    "observation": {
+                        "status": "success",
+                        "message": "Repair the ETL job.",
+                    },
+                    "reward": 0.0,
+                    "done": False,
+                }
+            )
+        if url.endswith("/step"):
+            self._step_count += 1
+            return _FakeResponse(
+                {
+                    "observation": {
+                        "status": "success",
+                        "message": "Read ok.",
+                    },
+                    "reward": 0.0,
+                    "done": self._step_count >= 2,
+                }
+            )
+        if url.endswith("/grader/task_2_medium_syntax"):
+            return _FakeResponse({"score": 0.25})
+        raise AssertionError(f"Unexpected URL requested: {url}")
+class _FakeChatCompletions:
+    def __init__(self, messages: list[Any]) -> None:
+        self._messages = iter(messages)
+    def create(self, **kwargs: Any) -> Any:
+        del kwargs
+        message = next(self._messages)
+        return SimpleNamespace(choices=[SimpleNamespace(message=message)])
+class _FakeClient:
+    def __init__(self, messages: list[Any]) -> None:
+        self.chat = SimpleNamespace(completions=_FakeChatCompletions(messages))
+        self.base_url = "https://model.local/v1"
+def _tool_message(name: str, arguments: dict[str, Any]) -> Any:
+    return SimpleNamespace(
+        tool_calls=[
+            SimpleNamespace(
+                id="call-1",
+                function=SimpleNamespace(
+                    name=name,
+                    arguments=json.dumps(arguments),
+                ),
+            )
+        ]
+    )
+def test_inference_run_task_uses_env_base_url(monkeypatch, capsys) -> None:
+    fake_http = _FakeHTTPSession()
+    fake_client = _FakeClient(
+        [
+            _tool_message("read_file", {"filepath": "broken_pipeline.py"}),
+            SimpleNamespace(tool_calls=[]),
+            _tool_message("invoke_python", {"filepath": "broken_pipeline.py", "args": []}),
+        ]
+    )
+    monkeypatch.setattr(inference, "ENV_BASE_URL", "http://env.local")
+    monkeypatch.setattr(inference, "API_BASE_URL", "https://model.local/v1")
+    monkeypatch.setattr(inference, "MODEL_NAME", "mock-model")
+    score = inference.run_task(
+        fake_client,
+        fake_http,
+        "task_2_medium_syntax",
+        max_turns=4,
+        seed=3,
+    )
+    assert score == 0.25
+    assert fake_http.urls
+    assert all(url.startswith("http://env.local") for url in fake_http.urls)
+    assert all("model.local" not in url for url in fake_http.urls)
+    stdout = capsys.readouterr().out
+    assert "[START]" in stdout
+    assert "[STEP]" in stdout
+    assert "[END]" in stdout
+    assert "success=false" in stdout
+def test_inference_emits_grader_details_to_stderr_when_enabled(monkeypatch, capsys) -> None:
+    class _DetailedHTTPSession(_FakeHTTPSession):
+        def request(
+            self,
+            method: str,
+            url: str,
+            timeout: float | None = None,
+            **kwargs: Any,
+        ) -> _FakeResponse:
+            if url.endswith("/grader/task_2_medium_syntax"):
+                return _FakeResponse(
+                    {
+                        "task_id": "task_2_medium_syntax",
+                        "score": 0.25,
+                        "details": {"reason": "Visible repair only"},
+                    }
+                )
+            return super().request(method, url, timeout=timeout, **kwargs)
+    fake_http = _DetailedHTTPSession()
+    fake_client = _FakeClient(
+        [_tool_message("read_file", {"filepath": "broken_pipeline.py"})]
+    )
+    monkeypatch.setenv("PUBLIC_GRADER_DETAILS", "true")
+    monkeypatch.setattr(inference, "ENV_BASE_URL", "http://env.local")
+    monkeypatch.setattr(inference, "MODEL_NAME", "mock-model")
+    inference.run_task(
+        fake_client,
+        fake_http,
+        "task_2_medium_syntax",
+        max_turns=1,
+        seed=3,
+    )
+    stderr = capsys.readouterr().err.strip()
+    assert stderr
+    assert json.loads(stderr)["details"]["reason"] == "Visible repair only"
+def test_baseline_endpoint_passes_env_base_url(monkeypatch) -> None:
+    captured: dict[str, Any] = {}
+    def fake_run(
+        command: list[str],
+        *,
+        cwd: str,
+        capture_output: bool,
+        text: bool,
+        timeout: float,
+        env: dict[str, str],
+    ) -> CompletedProcess[str]:
+        captured["command"] = command
+        captured["cwd"] = cwd
+        captured["capture_output"] = capture_output
+        captured["text"] = text
+        captured["timeout"] = timeout
+        captured["env"] = env
+        stdout = "\n".join(
+            [
+                "[START] task=task_1_easy_anomaly env=dataops_env model=fake-model",
+                "[END] success=true steps=1 score=1.000 rewards=1.00",
+                json.dumps(
+                    {
+                        "scores": {"task_1_easy_anomaly": 1.0},
+                        "grades": {
+                            "task_1_easy_anomaly": {
+                                "task_id": "task_1_easy_anomaly",
+                                "score": 1.0,
+                                "details": {"reason": "Perfect"},
+                            }
+                        },
+                        "average": 1.0,
+                        "model": "fake-model",
+                        "metadata": {"env_base_url": "http://127.0.0.1:7860"},
+                    }
+                ),
+            ]
+        )
+        stderr = json.dumps({"task_id": "task_1_easy_anomaly", "score": 1.0})
+        return CompletedProcess(command, 0, stdout=stdout, stderr=stderr)
+    monkeypatch.setenv("API_KEY", "test-key")
+    monkeypatch.delenv("ENV_BASE_URL", raising=False)
+    monkeypatch.setattr(app_module.subprocess, "run", fake_run)
+    with TestClient(app_module.app) as client:
+        response = client.post(
+            "/baseline",
+            json={
+                "task_ids": ["task_1_easy_anomaly"],
+                "seed": 7,
+                "max_turns": 5,
+            },
+        )
+    assert response.status_code == 200
+    assert "[START] task=task_1_easy_anomaly" in response.json()["stdout"]
+    assert response.json()["stderr"] == json.dumps({"task_id": "task_1_easy_anomaly", "score": 1.0})
+    assert response.json()["scores"]["task_1_easy_anomaly"] == 1.0
+    assert response.json()["grades"]["task_1_easy_anomaly"]["details"]["reason"] == "Perfect"
+    assert captured["env"]["ENV_BASE_URL"] == "http://127.0.0.1:7860"
+    assert "--seed" in captured["command"]
+    assert "--max-turns" in captured["command"]
+    assert "--task" in captured["command"]
+def test_inference_default_api_base_url_uses_google_for_api_key(
+    monkeypatch,
+) -> None:
+    monkeypatch.setenv("API_KEY", "test-key")
+    monkeypatch.delenv("HF_TOKEN", raising=False)
+    monkeypatch.delenv("API_BASE_URL", raising=False)
+    assert (
+        inference._resolve_api_base_url()
+        == inference.DEFAULT_GOOGLE_OPENAI_BASE_URL
+    )
+def test_inference_default_api_base_url_uses_hf_router_for_hf_token(
+    monkeypatch,
+) -> None:
+    monkeypatch.setenv("HF_TOKEN", "test-token")
+    monkeypatch.delenv("API_KEY", raising=False)
+    monkeypatch.delenv("API_BASE_URL", raising=False)
+    assert inference._resolve_api_base_url() == inference.DEFAULT_HF_OPENAI_BASE_URL
+def test_session_id_header_can_resume_http_episode() -> None:
+    with TestClient(app_module.app) as client:
+        reset = client.post("/reset?task_id=task_1_easy_anomaly", json={"seed": 5})
+        assert reset.status_code == 200
+        session_id = reset.headers["X-Session-ID"]
+        client.cookies.clear()
+        state = client.get("/state", headers={"X-Session-ID": session_id})
+        assert state.status_code == 200
+        payload = state.json()
+        assert payload["task_id"] == "task_1_easy_anomaly"
+        assert payload["seed"] == 5
+def test_reset_replaces_unknown_client_supplied_session_id() -> None:
+    with TestClient(app_module.app) as client:
+        reset = client.post(
+            "/reset?task_id=task_1_easy_anomaly",
+            headers={"X-Session-ID": "attacker-chosen-session"},
+            json={"seed": 4},
+        )
+        assert reset.status_code == 200
+        issued_session_id = reset.headers["X-Session-ID"]
+        assert issued_session_id != "attacker-chosen-session"
+        client.cookies.clear()
+        forged_state = client.get("/state", headers={"X-Session-ID": "attacker-chosen-session"})
+        assert forged_state.status_code == 400
+        restored_state = client.get("/state", headers={"X-Session-ID": issued_session_id})
+        assert restored_state.status_code == 200
+        assert restored_state.json()["seed"] == 4
+def test_websocket_reset_state_and_step_flow() -> None:
+    with TestClient(app_module.app) as client:
+        with client.websocket_connect("/ws") as websocket:
+            websocket.send_json(
+                {
+                    "type": "reset",
+                    "data": {"task_id": "task_1_easy_anomaly", "seed": 3},
+                }
+            )
+            reset_payload = websocket.receive_json()
+            assert reset_payload["data"]["observation"]["status"] == "success"
+            websocket.send_json({"type": "state"})
+            state_payload = websocket.receive_json()
+            assert state_payload["data"]["task_id"] == "task_1_easy_anomaly"
+            websocket.send_json(
+                {
+                    "type": "step",
+                    "data": {
+                        "action_type": "ExecuteSQL",
+                        "payload": {
+                            "query": (
+                                "SELECT id, amount FROM transactions "
+                                "WHERE amount IS NULL ORDER BY id"
+                            )
+                        },
+                    },
+                }
+            )
+            step_payload = websocket.receive_json()
+            assert step_payload["data"]["observation"]["status"] == "success"
+            assert step_payload["data"]["observation"]["sql_results"]
+            websocket.send_json({"type": "close", "data": {}})
+def test_http_client_overlays_top_level_reward_and_done() -> None:
+    class _FakeSession:
+        def post(self, url: str, **kwargs: Any) -> _FakeResponse:
+            del kwargs
+            if url.endswith("/reset"):
+                return _FakeResponse(
+                    {
+                        "observation": {"status": "success", "message": "ready"},
+                        "reward": 0.0,
+                        "done": False,
+                    }
+                )
+            if url.endswith("/step"):
+                return _FakeResponse(
+                    {
+                        "observation": {"status": "success", "message": "ok"},
+                        "reward": 0.25,
+                        "done": True,
+                    }
+                )
+            raise AssertionError(f"Unexpected URL requested: {url}")
+        def get(self, url: str, **kwargs: Any) -> _FakeResponse:
+            del kwargs
+            raise AssertionError(f"Unexpected URL requested: {url}")
+        def close(self) -> None:
+            return None
+    client = DataOpsEnvClient(base_url="http://env.local")
+    client._session = _FakeSession()
+    try:
+        reset_obs = client.reset(task_id="task_1_easy_anomaly", seed=5)
+        assert reset_obs.reward == 0.0
+        assert reset_obs.done is False
+        step_obs = client.step(
+            DataOpsAction(
+                action_type="ExecuteSQL",
+                payload={"query": "SELECT 1"},
+            )
+        )
+        assert step_obs.reward == 0.25
+        assert step_obs.done is True
+    finally:
+        client.close()
+def test_env_loader_uses_root_env_to_find_secondary_env_file(
+    tmp_path: Path, monkeypatch
+) -> None:
+    monkeypatch.setattr(env_loader, "_PROJECT_ROOT", tmp_path)
+    monkeypatch.delenv("PUBLIC_GRADER_DETAILS", raising=False)
+    monkeypatch.delenv("MODEL_NAME", raising=False)
+    (tmp_path / ".env").write_text("ENV_FILE=.env.dev\n", encoding="utf-8")
+    (tmp_path / ".env.dev").write_text(
+        "PUBLIC_GRADER_DETAILS=true\nMODEL_NAME=debug-model\n",
+        encoding="utf-8",
+    )
+    env_loader.load_env()
+    assert os.getenv("PUBLIC_GRADER_DETAILS") == "true"
+    assert os.getenv("MODEL_NAME") == "debug-model"

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff