dataops-env / README.md
visheshrathi's picture
Upload folder using huggingface_hub
a1b343c verified
---
title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps SQLite, ETL repair, three graded tasks.
tags:
- openenv
base_path: /web
---
# DataOpsEnv
[Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix)
## Environment description and motivation
**DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**.
Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.
**Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`).
---
## Action space
Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).
| `action_type` | Payload fields | Role |
| ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `ExecuteSQL` | `query` (string, 1–2000 chars) | Run task-scoped SQL against the episode SQLite DB. |
| `ReadFile` | `filepath` (string, 1–255 chars) | Read an allowed file from the episode workspace. |
| `WriteFile` | `filepath`, `content` (content ≤ 1M chars) | Overwrite an allowed workspace file. |
| `RunScript` | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. |
| `SendEmail` | `to_email`, `subject`, `body` | Queue a mock email (used for the hard task). |
Machine-readable schema: **`GET /schema`**`action`, or **`GET /tasks`**`action_schema`.
---
## Observation space
Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`.
| Field | Type | Meaning |
| ----------------------- | ------------------------ | ----------------------------------------------------------------- |
| `done` | boolean | Whether the episode has ended (step limit or terminal condition). |
| `reward` | number \| null | Shaped **step reward** after this transition (trajectory signal). |
| `metadata` | object | OpenEnv extension bucket (usually empty). |
| `status` | `"success"` \| `"error"` | Whether the action executed successfully. |
| `message` | string | Short human-readable summary. |
| `stdout` | string \| null | Captured stdout (e.g. script or file read). |
| `stderr` | string \| null | Captured stderr. |
| `sql_results` | list of objects \| null | Row dicts for successful `SELECT`-style outcomes. |
| `email_delivery_status` | string \| null | Mock send confirmation when applicable. |
| `step_count` | integer | Steps taken in the episode. |
| `max_steps` | integer | Episode step budget. |
**Terminal evaluation:** The top-level **grader score** returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to **(0.00, 1.00)**, so exact internal boundary scores are surfaced as **0.01** and **0.99**. When **`details`** are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone.
Machine-readable schema: **`GET /schema`** → `observation`.
---
## Tasks (descriptions and expected difficulty)
| Task ID | Expected difficulty | Description |
| ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `task_1_easy_anomaly` | **Easy** | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments. |
| `task_2_medium_syntax` | **Medium** | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches. |
| `task_3_hard_e2e` | **Hard** | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. |
Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**.
---
# Setup and usage
## Get the repository
```bash
git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv
```
## Install `uv` (if needed)
**macOS / Linux**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Restart your shell, or load the new path in the current session:
```bash
source "$HOME/.local/bin/env"
```
**Windows (PowerShell)**
```powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```
Open a new PowerShell session after installation.
**Verify**
```bash
uv --version
```
---
**Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**.
```bash
uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env
```
Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.
**Run the server** from the repository root so the active env file is discovered. Explicitly exported runtime vars such as **`HOST`**, **`PORT`**, and **`DEBUG`** still take precedence:
```bash
uv run python -m server
```
Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**.
**OpenEnv packaging:**
```bash
uv run openenv validate
```
**Docker:**
```bash
printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh
```
The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host.
**Baseline inference (local):**
| Variable | Purpose |
| ---------------------- | -------------------------------------------------------------------------------------- |
| `ENV_BASE_URL` | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) |
| `API_KEY` / `HF_TOKEN` | Exactly one model access credential source |
| `API_BASE_URL` | Optional model provider base URL override |
| `MODEL_NAME` | Optional Chat model ID |
```bash
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12
```
If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**.
Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`.
**`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically.
Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.
**Minimal HTTP smoke test:**
```bash
curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
-H 'Content-Type: application/json' \
-d '{"seed": 7}'
curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
-H 'Content-Type: application/json' \
-d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'
curl -b cookies.txt 'http://127.0.0.1:7860/grader'
```
By default **`/grader`** returns **`task_id`** and the normalized top-level **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. The public top-level **`score`** uses the same 2-decimal **`0.01..0.99`** normalization as `inference.py`; nested **`details`** component scores remain raw. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`.
---
## Baseline scores
All figures below are **reported terminal scores**, rounded to 2 decimals and normalized to **(0.00, 1.00)**. Exact internal boundary scores are surfaced as **0.01** and **0.99**. Scores depend on provider, model revision, temperature, and `seed`.
### Null baseline (no agent actions)
| Condition | `task_1` | `task_2` | `task_3` | Avg |
| ---------------------------------------------------- | -------- | -------- | -------- | ---- |
| `reset` only (`seed=7`), then grader; **no** `/step` | 0.01 | 0.01 | 0.01 | 0.01 |
### Reference tool-calling baseline
`[END] success=true` in the harness logs means the reported terminal score reached **0.99** for that task.
| Model | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average |
| ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- |
| `gemini-3.1-flash-lite-preview` | 7 | 0.99 | 0.99 | 0.99 | 0.99 |
**Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run:
```bash
export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```
The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**.
---
## Hugging Face Spaces
There are two methods for running the baseline against a deployed Hugging Face Space:
1. Running **`inference.py`** externally against the public Space URL:
```bash
export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```
In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space.
2. Hitting **`/baseline`** API with a `POST` request:
```bash
curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
-H 'Content-Type: application/json' \
-d '{"seed": 7, "max_turns": 12}'
```
In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**.
---
## API reference
| Method | Path | Purpose |
| ------ | -------------------- | ------------------------------------------------------------- |
| GET | `/health` | Liveness |
| GET | `/metadata` | Name, description, version, task count |
| GET | `/schema` | JSON Schemas: action, observation, state |
| GET | `/tasks` | Tasks + action/observation/state schemas |
| POST | `/mcp` | Minimal JSON-RPC tool-list compatibility stub |
| POST | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id` |
| POST | `/step` | One action; optional `timeout_s` |
| GET | `/state` | Episode state (`task_id`, `seed`, …) |
| GET | `/grader` | Normalized reported terminal score for active task |
| GET | `/grader/{task_id}` | Same; `task_id` must match the active task |
| POST | `/baseline` | Subprocess baseline (see [Setup and usage](#setup-and-usage)) |
| WS | `/ws` | OpenEnv WebSocket session |
---
## Environment variables (server / container)
| Variable | Purpose |
| ------------------------ | --------------------------------------------------------------------------------------------- |
| `HOST` | Listen host used by `python -m server` and the container entrypoint |
| `PORT` | Listen port used by `python -m server` and the container entrypoint |
| `DEBUG` | Enables reload for local `python -m server` runs |
| `ENV_FILE` | Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars |
| `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child |
| `MAX_HTTP_SESSIONS` | Concurrent HTTP sessions cap |
| `MAX_WS_SESSIONS` | Concurrent WebSocket sessions cap |
| `ADMIN_API_KEY` | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details |
| `PUBLIC_GRADER_DETAILS` | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`** |
| `COOKIE_SECURE` | Set `Secure` on session cookies (HTTPS) |
| `CORS_ALLOW_ORIGINS` | Comma-separated origins; empty disables permissive CORS (recommended default) |
---
## Tests
```bash
uv sync --extra dev
uv run pytest -q
```
---
## Appendix
| Command | Description |
| ------- | ----------- |
| `uv --version` | Confirm `uv` is installed and available in `PATH`. |
| `uv init` | Create a new Python project managed by `uv`. |
| `uv venv` | Create a virtual environment. |
| `uv sync` | Install dependencies from the project metadata and lockfile. |
| `uv add <package>` | Add a dependency to the current project. |
| `uv remove <package>` | Remove a dependency from the current project. |
| `uv lock` | Update or generate the lockfile. |
| `uv run <command>` | Run a command inside the project environment. |
| `uv python install` | Install and manage Python versions through `uv`. |
| `uv pip install <package>` | Install a package using the `pip`-compatible interface. |
| `uv tree` | Show the resolved dependency tree. |