--- title: DataOpsEnv emoji: 🧩 colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks. tags: - openenv base_path: /web --- # DataOpsEnv [Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix) ## Environment description and motivation **DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**. Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering. **Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`). --- ## Action space Each step submits JSON: `{"action": {"action_type": "", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task). | `action_type` | Payload fields | Role | | ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | `ExecuteSQL` | `query` (string, 1–2000 chars) | Run task-scoped SQL against the episode SQLite DB. | | `ReadFile` | `filepath` (string, 1–255 chars) | Read an allowed file from the episode workspace. | | `WriteFile` | `filepath`, `content` (content ≤ 1M chars) | Overwrite an allowed workspace file. | | `RunScript` | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. | | `SendEmail` | `to_email`, `subject`, `body` | Queue a mock email (used for the hard task). | Machine-readable schema: **`GET /schema`** → `action`, or **`GET /tasks`** → `action_schema`. --- ## Observation space Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`. | Field | Type | Meaning | | ----------------------- | ------------------------ | ----------------------------------------------------------------- | | `done` | boolean | Whether the episode has ended (step limit or terminal condition). | | `reward` | number \| null | Shaped **step reward** after this transition (trajectory signal). | | `metadata` | object | OpenEnv extension bucket (usually empty). | | `status` | `"success"` \| `"error"` | Whether the action executed successfully. | | `message` | string | Short human-readable summary. | | `stdout` | string \| null | Captured stdout (e.g. script or file read). | | `stderr` | string \| null | Captured stderr. | | `sql_results` | list of objects \| null | Row dicts for successful `SELECT`-style outcomes. | | `email_delivery_status` | string \| null | Mock send confirmation when applicable. | | `step_count` | integer | Steps taken in the episode. | | `max_steps` | integer | Episode step budget. | **Terminal evaluation:** The top-level **grader score** returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to **(0.00, 1.00)**, so exact internal boundary scores are surfaced as **0.01** and **0.99**. When **`details`** are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone. Machine-readable schema: **`GET /schema`** → `observation`. --- ## Tasks (descriptions and expected difficulty) | Task ID | Expected difficulty | Description | | ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `task_1_easy_anomaly` | **Easy** | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments. | | `task_2_medium_syntax` | **Medium** | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches. | | `task_3_hard_e2e` | **Hard** | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. | Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**. --- # Setup and usage ## Get the repository ```bash git clone https://github.com/vishesh-rathi/dataops-openenv.git cd dataops-openenv ``` ## Install `uv` (if needed) **macOS / Linux** ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` Restart your shell, or load the new path in the current session: ```bash source "$HOME/.local/bin/env" ``` **Windows (PowerShell)** ```powershell powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` Open a new PowerShell session after installation. **Verify** ```bash uv --version ``` --- **Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**. ```bash uv sync cp .env.example .env.dev printf 'ENV_FILE=.env.dev\n' > .env ``` Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files. **Run the server** from the repository root so the active env file is discovered. Explicitly exported runtime vars such as **`HOST`**, **`PORT`**, and **`DEBUG`** still take precedence: ```bash uv run python -m server ``` Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**. **OpenEnv packaging:** ```bash uv run openenv validate ``` **Docker:** ```bash printf 'ENV_FILE=.env.dev\n' > .env bash build_and_run_image.sh ``` The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host. **Baseline inference (local):** | Variable | Purpose | | ---------------------- | -------------------------------------------------------------------------------------- | | `ENV_BASE_URL` | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) | | `API_KEY` / `HF_TOKEN` | Exactly one model access credential source | | `API_BASE_URL` | Optional model provider base URL override | | `MODEL_NAME` | Optional Chat model ID | ```bash export ENV_BASE_URL=http://127.0.0.1:7860 uv run python inference.py --seed 7 --max-turns 12 ``` If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**. Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`. **`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically. Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process. **Minimal HTTP smoke test:** ```bash curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \ -H 'Content-Type: application/json' \ -d '{"seed": 7}' curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \ -H 'Content-Type: application/json' \ -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}' curl -b cookies.txt 'http://127.0.0.1:7860/grader' ``` By default **`/grader`** returns **`task_id`** and the normalized top-level **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. The public top-level **`score`** uses the same 2-decimal **`0.01..0.99`** normalization as `inference.py`; nested **`details`** component scores remain raw. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`. --- ## Baseline scores All figures below are **reported terminal scores**, rounded to 2 decimals and normalized to **(0.00, 1.00)**. Exact internal boundary scores are surfaced as **0.01** and **0.99**. Scores depend on provider, model revision, temperature, and `seed`. ### Null baseline (no agent actions) | Condition | `task_1` | `task_2` | `task_3` | Avg | | ---------------------------------------------------- | -------- | -------- | -------- | ---- | | `reset` only (`seed=7`), then grader; **no** `/step` | 0.01 | 0.01 | 0.01 | 0.01 | ### Reference tool-calling baseline `[END] success=true` in the harness logs means the reported terminal score reached **0.99** for that task. | Model | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average | | ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- | | `gemini-3.1-flash-lite-preview` | 7 | 0.99 | 0.99 | 0.99 | 0.99 | **Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run: ```bash export MODEL_NAME=gemini-3.1-flash-lite-preview export ENV_BASE_URL=http://127.0.0.1:7860 uv run python inference.py --seed 7 --max-turns 12 --json-scores ``` The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**. --- ## Hugging Face Spaces There are two methods for running the baseline against a deployed Hugging Face Space: 1. Running **`inference.py`** externally against the public Space URL: ```bash export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space uv run python inference.py --seed 7 --max-turns 12 --json-scores ``` In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space. 2. Hitting **`/baseline`** API with a `POST` request: ```bash curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \ -H 'Content-Type: application/json' \ -d '{"seed": 7, "max_turns": 12}' ``` In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**. --- ## API reference | Method | Path | Purpose | | ------ | -------------------- | ------------------------------------------------------------- | | GET | `/health` | Liveness | | GET | `/metadata` | Name, description, version, task count | | GET | `/schema` | JSON Schemas: action, observation, state | | GET | `/tasks` | Tasks + action/observation/state schemas | | POST | `/mcp` | Minimal JSON-RPC tool-list compatibility stub | | POST | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id` | | POST | `/step` | One action; optional `timeout_s` | | GET | `/state` | Episode state (`task_id`, `seed`, …) | | GET | `/grader` | Normalized reported terminal score for active task | | GET | `/grader/{task_id}` | Same; `task_id` must match the active task | | POST | `/baseline` | Subprocess baseline (see [Setup and usage](#setup-and-usage)) | | WS | `/ws` | OpenEnv WebSocket session | --- ## Environment variables (server / container) | Variable | Purpose | | ------------------------ | --------------------------------------------------------------------------------------------- | | `HOST` | Listen host used by `python -m server` and the container entrypoint | | `PORT` | Listen port used by `python -m server` and the container entrypoint | | `DEBUG` | Enables reload for local `python -m server` runs | | `ENV_FILE` | Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars | | `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child | | `MAX_HTTP_SESSIONS` | Concurrent HTTP sessions cap | | `MAX_WS_SESSIONS` | Concurrent WebSocket sessions cap | | `ADMIN_API_KEY` | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details | | `PUBLIC_GRADER_DETAILS` | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`** | | `COOKIE_SECURE` | Set `Secure` on session cookies (HTTPS) | | `CORS_ALLOW_ORIGINS` | Comma-separated origins; empty disables permissive CORS (recommended default) | --- ## Tests ```bash uv sync --extra dev uv run pytest -q ``` --- ## Appendix | Command | Description | | ------- | ----------- | | `uv --version` | Confirm `uv` is installed and available in `PATH`. | | `uv init` | Create a new Python project managed by `uv`. | | `uv venv` | Create a virtual environment. | | `uv sync` | Install dependencies from the project metadata and lockfile. | | `uv add ` | Add a dependency to the current project. | | `uv remove ` | Remove a dependency from the current project. | | `uv lock` | Update or generate the lockfile. | | `uv run ` | Run a command inside the project environment. | | `uv python install` | Install and manage Python versions through `uv`. | | `uv pip install ` | Install a package using the `pip`-compatible interface. | | `uv tree` | Show the resolved dependency tree. |