Spaces:

visheshrathi
/

dataops-env

Sleeping

App Files Files Community

dataops-env / README.md

visheshrathi

Upload folder using huggingface_hub

a1b343c verified about 1 month ago

preview code

raw

history blame contribute delete

19.5 kB

	---
	title: DataOpsEnv
	emoji: 🧩
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_port: 7860
	short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks.
	tags:
	- openenv
	base_path: /web
	---

	# DataOpsEnv

	[Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix)

	## Environment description and motivation

	DataOpsEnv is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small SQLite warehouse, repairing Python ETL scripts, and completing an end-to-end reporting incident (extract data, fix a formatter, send a mock email). Episodes are seeded (`reset` may include `seed`) so scenarios are reproducible; each HTTP session receives an isolated workspace and database.

	Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect real workflows; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.

	Implementation: FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`).

	---

	## Action space

	Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).

	\| `action_type` \| Payload fields \| Role \|
	\| ------------- \| ---------------------------------------------------------------------------------------------------- \| ---------------------------------------------------------------- \|
	\| `ExecuteSQL` \| `query` (string, 1–2000 chars) \| Run task-scoped SQL against the episode SQLite DB. \|
	\| `ReadFile` \| `filepath` (string, 1–255 chars) \| Read an allowed file from the episode workspace. \|
	\| `WriteFile` \| `filepath`, `content` (content ≤ 1M chars) \| Overwrite an allowed workspace file. \|
	\| `RunScript` \| `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) \| Execute a Python script in the workspace with optional CLI args. \|
	\| `SendEmail` \| `to_email`, `subject`, `body` \| Queue a mock email (used for the hard task). \|

	Machine-readable schema: `GET /schema` → `action`, or `GET /tasks` → `action_schema`.

	---

	## Observation space

	Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the DataOps layer; the OpenEnv base also defines `done`, `reward`, and `metadata`.

	\| Field \| Type \| Meaning \|
	\| ----------------------- \| ------------------------ \| ----------------------------------------------------------------- \|
	\| `done` \| boolean \| Whether the episode has ended (step limit or terminal condition). \|
	\| `reward` \| number \\| null \| Shaped step reward after this transition (trajectory signal). \|
	\| `metadata` \| object \| OpenEnv extension bucket (usually empty). \|
	\| `status` \| `"success"` \\| `"error"` \| Whether the action executed successfully. \|
	\| `message` \| string \| Short human-readable summary. \|
	\| `stdout` \| string \\| null \| Captured stdout (e.g. script or file read). \|
	\| `stderr` \| string \\| null \| Captured stderr. \|
	\| `sql_results` \| list of objects \\| null \| Row dicts for successful `SELECT`-style outcomes. \|
	\| `email_delivery_status` \| string \\| null \| Mock send confirmation when applicable. \|
	\| `step_count` \| integer \| Steps taken in the episode. \|
	\| `max_steps` \| integer \| Episode step budget. \|

	Terminal evaluation: The top-level grader score returned by `GET /grader` (or `GET /grader/{task_id}`) reflects the final database, files, and outbox (and, for the hard task, provenance constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to (0.00, 1.00), so exact internal boundary scores are surfaced as 0.01 and 0.99. When `details` are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the grader as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return `reward=0.0` when they neither improve grader state nor unlock a milestone.

	Machine-readable schema: `GET /schema` → `observation`.

	---

	## Tasks (descriptions and expected difficulty)

	\| Task ID \| Expected difficulty \| Description \|
	\| ---------------------- \| ------------------- \| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| `task_1_easy_anomaly` \| Easy \| The `transactions` table contains valid rows and rows with NULL `amount`. The agent must delete only the corrupted rows and leave all valid rows unchanged, including legitimate seeded zero-value or negative non-null adjustments. \|
	\| `task_2_medium_syntax` \| Medium \| `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must read, patch, and run the script so `process_data_stream` produces the correct downstream-ready records on both visible and hidden seeded batches. \|
	\| `task_3_hard_e2e` \| Hard \| End-to-end incident: query the correct `daily_reports` slice for the scenario date, persist results as `report_data.json`, repair `format_report.py`, run it on that JSON, then send exactly one email whose body matches the formatter output, with scenario-specific recipient and subject. \|

	Task list, difficulty labels, and allowed actions per task: `GET /tasks` and `openenv.yaml`.

	---
	# Setup and usage

	## Get the repository

	```bash
	git clone https://github.com/vishesh-rathi/dataops-openenv.git
	cd dataops-openenv
	```

	## Install `uv` (if needed)

	macOS / Linux

	```bash
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	```

	Restart your shell, or load the new path in the current session:

	```bash
	source "$HOME/.local/bin/env"
	```

	Windows (PowerShell)

	```powershell
	powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 \| iex"
	```

	Open a new PowerShell session after installation.

	Verify

	```bash
	uv --version
	```

	---


	Prerequisites: Python 3.12+, [uv](https://docs.astral.sh/uv/).

	```bash
	uv sync
	cp .env.example .env.dev
	printf 'ENV_FILE=.env.dev\n' > .env
	```

	Repo-root `.env` selects the active secondary env file. Use `.env.dev` for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.

	Run the server from the repository root so the active env file is discovered. Explicitly exported runtime vars such as `HOST`, `PORT`, and `DEBUG` still take precedence:

	```bash
	uv run python -m server
	```

	Clients reuse the `Set-Cookie` session cookie or `X-Session-ID` header from `POST /reset` on `/step`, `/state`, and `/grader`.

	OpenEnv packaging:

	```bash
	uv run openenv validate
	```

	Docker:

	```bash
	printf 'ENV_FILE=.env.dev\n' > .env
	bash build_and_run_image.sh
	```

	The helper script reads repo-root `.env` only to resolve `ENV_FILE`, then passes that secondary file to `docker run --env-file ...`. The container does not receive a merged view of repo-root `.env` plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host.

	Baseline inference (local):

	\| Variable \| Purpose \|
	\| ---------------------- \| -------------------------------------------------------------------------------------- \|
	\| `ENV_BASE_URL` \| Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) \|
	\| `API_KEY` / `HF_TOKEN` \| Exactly one model access credential source \|
	\| `API_BASE_URL` \| Optional model provider base URL override \|
	\| `MODEL_NAME` \| Optional Chat model ID \|

	```bash
	export ENV_BASE_URL=http://127.0.0.1:7860
	uv run python inference.py --seed 7 --max-turns 12
	```

	If `API_BASE_URL` is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for `API_KEY` and Hugging Face's router for `HF_TOKEN`.

	Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`.

	`POST /baseline` runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If `ADMIN_API_KEY` is unset, the route is open. If `ADMIN_API_KEY` is set, callers must send `X-Admin-Key`. If `ENV_BASE_URL` is unset, the server injects `http://127.0.0.1:$PORT` into the child process automatically.

	Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.

	Minimal HTTP smoke test:

	```bash
	curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
	-H 'Content-Type: application/json' \
	-d '{"seed": 7}'

	curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
	-H 'Content-Type: application/json' \
	-d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'

	curl -b cookies.txt 'http://127.0.0.1:7860/grader'
	```

	By default `/grader` returns `task_id` and the normalized top-level `score` only. Full grader `details` require `PUBLIC_GRADER_DETAILS=true` or a valid `X-Admin-Key` when `ADMIN_API_KEY` is set. The public top-level `score` uses the same 2-decimal `0.01..0.99` normalization as `inference.py`; nested `details` component scores remain raw. This does not change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`.

	---

	## Baseline scores

	All figures below are reported terminal scores, rounded to 2 decimals and normalized to (0.00, 1.00). Exact internal boundary scores are surfaced as 0.01 and 0.99. Scores depend on provider, model revision, temperature, and `seed`.

	### Null baseline (no agent actions)

	\| Condition \| `task_1` \| `task_2` \| `task_3` \| Avg \|
	\| ---------------------------------------------------- \| -------- \| -------- \| -------- \| ---- \|
	\| `reset` only (`seed=7`), then grader; no `/step` \| 0.01 \| 0.01 \| 0.01 \| 0.01 \|

	### Reference tool-calling baseline

	`[END] success=true` in the harness logs means the reported terminal score reached 0.99 for that task.

	\| Model \| Seed \| `task_1_easy_anomaly` \| `task_2_medium_syntax` \| `task_3_hard_e2e` \| Average \|
	\| ------------------------------- \| ---- \| --------------------- \| ---------------------- \| ----------------- \| ------- \|
	\| `gemini-3.1-flash-lite-preview` \| 7 \| 0.99 \| 0.99 \| 0.99 \| 0.99 \|

	Reproducing a baseline run: With the API server running locally on `7860` and model credentials configured, run:

	```bash
	export MODEL_NAME=gemini-3.1-flash-lite-preview
	export ENV_BASE_URL=http://127.0.0.1:7860
	uv run python inference.py --seed 7 --max-turns 12 --json-scores
	```

	The final line of stdout is a single JSON object with `scores`, `grades`, `average`, `model`, and `metadata`.

	---

	## Hugging Face Spaces

	There are two methods for running the baseline against a deployed Hugging Face Space:

	1. Running `inference.py` externally against the public Space URL:

	```bash
	export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
	uv run python inference.py --seed 7 --max-turns 12 --json-scores
	```

	In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs `inference.py`, not on the Space.

	2. Hitting `/baseline` API with a `POST` request:

	```bash
	curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
	-H 'Content-Type: application/json' \
	-d '{"seed": 7, "max_turns": 12}'
	```

	In this mode, the Space itself executes `inference.py`. Configure one model credential source on the Space (`API_KEY` or `HF_TOKEN`). `MODEL_NAME` and `API_BASE_URL` are optional overrides. `ENV_BASE_URL` is not required for `POST /baseline` because the server injects `http://127.0.0.1:$PORT` when it launches the child `inference.py` process. If `ADMIN_API_KEY` is unset, `POST /baseline` is open; if it is set, callers must send `X-Admin-Key`.

	---

	## API reference

	\| Method \| Path \| Purpose \|
	\| ------ \| -------------------- \| ------------------------------------------------------------- \|
	\| GET \| `/health` \| Liveness \|
	\| GET \| `/metadata` \| Name, description, version, task count \|
	\| GET \| `/schema` \| JSON Schemas: action, observation, state \|
	\| GET \| `/tasks` \| Tasks + action/observation/state schemas \|
	\| POST \| `/mcp` \| Minimal JSON-RPC tool-list compatibility stub \|
	\| POST \| `/reset?task_id=...` \| New episode; body may include `seed`, `episode_id` \|
	\| POST \| `/step` \| One action; optional `timeout_s` \|
	\| GET \| `/state` \| Episode state (`task_id`, `seed`, …) \|
	\| GET \| `/grader` \| Normalized reported terminal score for active task \|
	\| GET \| `/grader/{task_id}` \| Same; `task_id` must match the active task \|
	\| POST \| `/baseline` \| Subprocess baseline (see [Setup and usage](#setup-and-usage)) \|
	\| WS \| `/ws` \| OpenEnv WebSocket session \|

	---

	## Environment variables (server / container)

	\| Variable \| Purpose \|
	\| ------------------------ \| --------------------------------------------------------------------------------------------- \|
	\| `HOST` \| Listen host used by `python -m server` and the container entrypoint \|
	\| `PORT` \| Listen port used by `python -m server` and the container entrypoint \|
	\| `DEBUG` \| Enables reload for local `python -m server` runs \|
	\| `ENV_FILE` \| Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars \|
	\| `HTTP_SESSION_TIMEOUT_S` \| HTTP session idle TTL; max wall time for `POST /baseline` child \|
	\| `MAX_HTTP_SESSIONS` \| Concurrent HTTP sessions cap \|
	\| `MAX_WS_SESSIONS` \| Concurrent WebSocket sessions cap \|
	\| `ADMIN_API_KEY` \| When set, protects `POST /baseline` and lets `X-Admin-Key` unlock full grader details \|
	\| `PUBLIC_GRADER_DETAILS` \| If `true`, public `/grader` and `/grader/{task_id}` responses include `details` \|
	\| `COOKIE_SECURE` \| Set `Secure` on session cookies (HTTPS) \|
	\| `CORS_ALLOW_ORIGINS` \| Comma-separated origins; empty disables permissive CORS (recommended default) \|

	---

	## Tests

	```bash
	uv sync --extra dev
	uv run pytest -q
	```

	---

	## Appendix

	\| Command \| Description \|
	\| ------- \| ----------- \|
	\| `uv --version` \| Confirm `uv` is installed and available in `PATH`. \|
	\| `uv init` \| Create a new Python project managed by `uv`. \|
	\| `uv venv` \| Create a virtual environment. \|
	\| `uv sync` \| Install dependencies from the project metadata and lockfile. \|
	\| `uv add <package>` \| Add a dependency to the current project. \|
	\| `uv remove <package>` \| Remove a dependency from the current project. \|
	\| `uv lock` \| Update or generate the lockfile. \|
	\| `uv run <command>` \| Run a command inside the project environment. \|
	\| `uv python install` \| Install and manage Python versions through `uv`. \|
	\| `uv pip install <package>` \| Install a package using the `pip`-compatible interface. \|
	\| `uv tree` \| Show the resolved dependency tree. \|