Spaces:
Sleeping
Sleeping
| title: DataOpsEnv | |
| emoji: 🧩 | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks. | |
| tags: | |
| - openenv | |
| base_path: /web | |
| # DataOpsEnv | |
| [Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix) | |
| ## Environment description and motivation | |
| **DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**. | |
| Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering. | |
| **Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`). | |
| --- | |
| ## Action space | |
| Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task). | |
| | `action_type` | Payload fields | Role | | |
| | ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- | | |
| | `ExecuteSQL` | `query` (string, 1–2000 chars) | Run task-scoped SQL against the episode SQLite DB. | | |
| | `ReadFile` | `filepath` (string, 1–255 chars) | Read an allowed file from the episode workspace. | | |
| | `WriteFile` | `filepath`, `content` (content ≤ 1M chars) | Overwrite an allowed workspace file. | | |
| | `RunScript` | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. | | |
| | `SendEmail` | `to_email`, `subject`, `body` | Queue a mock email (used for the hard task). | | |
| Machine-readable schema: **`GET /schema`** → `action`, or **`GET /tasks`** → `action_schema`. | |
| --- | |
| ## Observation space | |
| Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`. | |
| | Field | Type | Meaning | | |
| | ----------------------- | ------------------------ | ----------------------------------------------------------------- | | |
| | `done` | boolean | Whether the episode has ended (step limit or terminal condition). | | |
| | `reward` | number \| null | Shaped **step reward** after this transition (trajectory signal). | | |
| | `metadata` | object | OpenEnv extension bucket (usually empty). | | |
| | `status` | `"success"` \| `"error"` | Whether the action executed successfully. | | |
| | `message` | string | Short human-readable summary. | | |
| | `stdout` | string \| null | Captured stdout (e.g. script or file read). | | |
| | `stderr` | string \| null | Captured stderr. | | |
| | `sql_results` | list of objects \| null | Row dicts for successful `SELECT`-style outcomes. | | |
| | `email_delivery_status` | string \| null | Mock send confirmation when applicable. | | |
| | `step_count` | integer | Steps taken in the episode. | | |
| | `max_steps` | integer | Episode step budget. | | |
| **Terminal evaluation:** The top-level **grader score** returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to **(0.00, 1.00)**, so exact internal boundary scores are surfaced as **0.01** and **0.99**. When **`details`** are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone. | |
| Machine-readable schema: **`GET /schema`** → `observation`. | |
| --- | |
| ## Tasks (descriptions and expected difficulty) | |
| | Task ID | Expected difficulty | Description | | |
| | ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | |
| | `task_1_easy_anomaly` | **Easy** | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments. | | |
| | `task_2_medium_syntax` | **Medium** | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches. | | |
| | `task_3_hard_e2e` | **Hard** | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. | | |
| Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**. | |
| --- | |
| # Setup and usage | |
| ## Get the repository | |
| ```bash | |
| git clone https://github.com/vishesh-rathi/dataops-openenv.git | |
| cd dataops-openenv | |
| ``` | |
| ## Install `uv` (if needed) | |
| **macOS / Linux** | |
| ```bash | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| ``` | |
| Restart your shell, or load the new path in the current session: | |
| ```bash | |
| source "$HOME/.local/bin/env" | |
| ``` | |
| **Windows (PowerShell)** | |
| ```powershell | |
| powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" | |
| ``` | |
| Open a new PowerShell session after installation. | |
| **Verify** | |
| ```bash | |
| uv --version | |
| ``` | |
| --- | |
| **Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**. | |
| ```bash | |
| uv sync | |
| cp .env.example .env.dev | |
| printf 'ENV_FILE=.env.dev\n' > .env | |
| ``` | |
| Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files. | |
| **Run the server** from the repository root so the active env file is discovered. Explicitly exported runtime vars such as **`HOST`**, **`PORT`**, and **`DEBUG`** still take precedence: | |
| ```bash | |
| uv run python -m server | |
| ``` | |
| Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**. | |
| **OpenEnv packaging:** | |
| ```bash | |
| uv run openenv validate | |
| ``` | |
| **Docker:** | |
| ```bash | |
| printf 'ENV_FILE=.env.dev\n' > .env | |
| bash build_and_run_image.sh | |
| ``` | |
| The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host. | |
| **Baseline inference (local):** | |
| | Variable | Purpose | | |
| | ---------------------- | -------------------------------------------------------------------------------------- | | |
| | `ENV_BASE_URL` | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) | | |
| | `API_KEY` / `HF_TOKEN` | Exactly one model access credential source | | |
| | `API_BASE_URL` | Optional model provider base URL override | | |
| | `MODEL_NAME` | Optional Chat model ID | | |
| ```bash | |
| export ENV_BASE_URL=http://127.0.0.1:7860 | |
| uv run python inference.py --seed 7 --max-turns 12 | |
| ``` | |
| If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**. | |
| Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`. | |
| **`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically. | |
| Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process. | |
| **Minimal HTTP smoke test:** | |
| ```bash | |
| curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"seed": 7}' | |
| curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}' | |
| curl -b cookies.txt 'http://127.0.0.1:7860/grader' | |
| ``` | |
| By default **`/grader`** returns **`task_id`** and the normalized top-level **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. The public top-level **`score`** uses the same 2-decimal **`0.01..0.99`** normalization as `inference.py`; nested **`details`** component scores remain raw. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`. | |
| --- | |
| ## Baseline scores | |
| All figures below are **reported terminal scores**, rounded to 2 decimals and normalized to **(0.00, 1.00)**. Exact internal boundary scores are surfaced as **0.01** and **0.99**. Scores depend on provider, model revision, temperature, and `seed`. | |
| ### Null baseline (no agent actions) | |
| | Condition | `task_1` | `task_2` | `task_3` | Avg | | |
| | ---------------------------------------------------- | -------- | -------- | -------- | ---- | | |
| | `reset` only (`seed=7`), then grader; **no** `/step` | 0.01 | 0.01 | 0.01 | 0.01 | | |
| ### Reference tool-calling baseline | |
| `[END] success=true` in the harness logs means the reported terminal score reached **0.99** for that task. | |
| | Model | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average | | |
| | ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- | | |
| | `gemini-3.1-flash-lite-preview` | 7 | 0.99 | 0.99 | 0.99 | 0.99 | | |
| **Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run: | |
| ```bash | |
| export MODEL_NAME=gemini-3.1-flash-lite-preview | |
| export ENV_BASE_URL=http://127.0.0.1:7860 | |
| uv run python inference.py --seed 7 --max-turns 12 --json-scores | |
| ``` | |
| The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**. | |
| --- | |
| ## Hugging Face Spaces | |
| There are two methods for running the baseline against a deployed Hugging Face Space: | |
| 1. Running **`inference.py`** externally against the public Space URL: | |
| ```bash | |
| export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space | |
| uv run python inference.py --seed 7 --max-turns 12 --json-scores | |
| ``` | |
| In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space. | |
| 2. Hitting **`/baseline`** API with a `POST` request: | |
| ```bash | |
| curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"seed": 7, "max_turns": 12}' | |
| ``` | |
| In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**. | |
| --- | |
| ## API reference | |
| | Method | Path | Purpose | | |
| | ------ | -------------------- | ------------------------------------------------------------- | | |
| | GET | `/health` | Liveness | | |
| | GET | `/metadata` | Name, description, version, task count | | |
| | GET | `/schema` | JSON Schemas: action, observation, state | | |
| | GET | `/tasks` | Tasks + action/observation/state schemas | | |
| | POST | `/mcp` | Minimal JSON-RPC tool-list compatibility stub | | |
| | POST | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id` | | |
| | POST | `/step` | One action; optional `timeout_s` | | |
| | GET | `/state` | Episode state (`task_id`, `seed`, …) | | |
| | GET | `/grader` | Normalized reported terminal score for active task | | |
| | GET | `/grader/{task_id}` | Same; `task_id` must match the active task | | |
| | POST | `/baseline` | Subprocess baseline (see [Setup and usage](#setup-and-usage)) | | |
| | WS | `/ws` | OpenEnv WebSocket session | | |
| --- | |
| ## Environment variables (server / container) | |
| | Variable | Purpose | | |
| | ------------------------ | --------------------------------------------------------------------------------------------- | | |
| | `HOST` | Listen host used by `python -m server` and the container entrypoint | | |
| | `PORT` | Listen port used by `python -m server` and the container entrypoint | | |
| | `DEBUG` | Enables reload for local `python -m server` runs | | |
| | `ENV_FILE` | Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars | | |
| | `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child | | |
| | `MAX_HTTP_SESSIONS` | Concurrent HTTP sessions cap | | |
| | `MAX_WS_SESSIONS` | Concurrent WebSocket sessions cap | | |
| | `ADMIN_API_KEY` | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details | | |
| | `PUBLIC_GRADER_DETAILS` | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`** | | |
| | `COOKIE_SECURE` | Set `Secure` on session cookies (HTTPS) | | |
| | `CORS_ALLOW_ORIGINS` | Comma-separated origins; empty disables permissive CORS (recommended default) | | |
| --- | |
| ## Tests | |
| ```bash | |
| uv sync --extra dev | |
| uv run pytest -q | |
| ``` | |
| --- | |
| ## Appendix | |
| | Command | Description | | |
| | ------- | ----------- | | |
| | `uv --version` | Confirm `uv` is installed and available in `PATH`. | | |
| | `uv init` | Create a new Python project managed by `uv`. | | |
| | `uv venv` | Create a virtual environment. | | |
| | `uv sync` | Install dependencies from the project metadata and lockfile. | | |
| | `uv add <package>` | Add a dependency to the current project. | | |
| | `uv remove <package>` | Remove a dependency from the current project. | | |
| | `uv lock` | Update or generate the lockfile. | | |
| | `uv run <command>` | Run a command inside the project environment. | | |
| | `uv python install` | Install and manage Python versions through `uv`. | | |
| | `uv pip install <package>` | Install a package using the `pip`-compatible interface. | | |
| | `uv tree` | Show the resolved dependency tree. | | |