Spaces:

visheshrathi
/

dataops-env

Sleeping

App Files Files Community

dataops-env / README.md

visheshrathi

Upload folder using huggingface_hub

a1b343c verified about 1 month ago

preview code

raw

history blame contribute delete

19.5 kB

metadata

title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks.
tags:
  - openenv
base_path: /web

DataOpsEnv

Overview · Tasks · Get the repository · Setup and run · Baseline scores · Hugging Face Spaces · HTTP API · Tests · Appendix

Environment description and motivation

DataOpsEnv is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small SQLite warehouse, repairing Python ETL scripts, and completing an end-to-end reporting incident (extract data, fix a formatter, send a mock email). Episodes are seeded (reset may include seed) so scenarios are reproducible; each HTTP session receives an isolated workspace and database.

Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect real workflows; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.

Implementation: FastAPI (server/app.py), environment logic (server/dataops_env_environment.py), terminal graders (server/grading.py), scenario definitions (server/task_specs.py, data/init_db.py), Pydantic types (models.py), OpenEnv manifest (openenv.yaml).

Action space

Each step submits JSON: {"action": {"action_type": "<type>", "payload": { ... }}}. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).

`action_type`	Payload fields	Role
`ExecuteSQL`	`query` (string, 1–2000 chars)	Run task-scoped SQL against the episode SQLite DB.
`ReadFile`	`filepath` (string, 1–255 chars)	Read an allowed file from the episode workspace.
`WriteFile`	`filepath`, `content` (content ≤ 1M chars)	Overwrite an allowed workspace file.
`RunScript`	`filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars)	Execute a Python script in the workspace with optional CLI args.
`SendEmail`	`to_email`, `subject`, `body`	Queue a mock email (used for the hard task).

Machine-readable schema: GET /schema → action, or GET /tasks → action_schema.

Observation space

Each step / reset response includes an observation object (REST also exposes wrapper fields such as reward / done). The fields below describe the DataOps layer; the OpenEnv base also defines done, reward, and metadata.

Field	Type	Meaning
`done`	boolean	Whether the episode has ended (step limit or terminal condition).
`reward`	number \| null	Shaped step reward after this transition (trajectory signal).
`metadata`	object	OpenEnv extension bucket (usually empty).
`status`	`"success"` \| `"error"`	Whether the action executed successfully.
`message`	string	Short human-readable summary.
`stdout`	string \| null	Captured stdout (e.g. script or file read).
`stderr`	string \| null	Captured stderr.
`sql_results`	list of objects \| null	Row dicts for successful `SELECT`-style outcomes.
`email_delivery_status`	string \| null	Mock send confirmation when applicable.
`step_count`	integer	Steps taken in the episode.
`max_steps`	integer	Episode step budget.

Terminal evaluation: The top-level grader score returned by GET /grader (or GET /grader/{task_id}) reflects the final database, files, and outbox (and, for the hard task, provenance constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to (0.00, 1.00), so exact internal boundary scores are surfaced as 0.01 and 0.99. When details are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the grader as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return reward=0.0 when they neither improve grader state nor unlock a milestone.

Machine-readable schema: GET /schema → observation.

Tasks (descriptions and expected difficulty)

Task ID	Expected difficulty	Description
`task_1_easy_anomaly`	Easy	The `transactions` table contains valid rows and rows with NULL `amount`. The agent must delete only the corrupted rows and leave all valid rows unchanged, including legitimate seeded zero-value or negative non-null adjustments.
`task_2_medium_syntax`	Medium	`broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must read, patch, and run the script so `process_data_stream` produces the correct downstream-ready records on both visible and hidden seeded batches.
`task_3_hard_e2e`	Hard	End-to-end incident: query the correct `daily_reports` slice for the scenario date, persist results as `report_data.json`, repair `format_report.py`, run it on that JSON, then send exactly one email whose body matches the formatter output, with scenario-specific recipient and subject.

Task list, difficulty labels, and allowed actions per task: GET /tasks and openenv.yaml.

Setup and usage

Get the repository

git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv

Install `uv` (if needed)

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Restart your shell, or load the new path in the current session:

source "$HOME/.local/bin/env"

Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Open a new PowerShell session after installation.

Verify

uv --version

Prerequisites: Python 3.12+, uv.

uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env

Repo-root .env selects the active secondary env file. Use .env.dev for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.

Run the server from the repository root so the active env file is discovered. Explicitly exported runtime vars such as HOST, PORT, and DEBUG still take precedence:

uv run python -m server

Clients reuse the Set-Cookie session cookie or X-Session-ID header from POST /reset on /step, /state, and /grader.

OpenEnv packaging:

uv run openenv validate

Docker:

printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh

The helper script reads repo-root .env only to resolve ENV_FILE, then passes that secondary file to docker run --env-file .... The container does not receive a merged view of repo-root .env plus the secondary file. Keeping .env* out of the image is intentional; runtime configuration is injected from the host.

Baseline inference (local):

Variable	Purpose
`ENV_BASE_URL`	Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default)
`API_KEY` / `HF_TOKEN`	Exactly one model access credential source
`API_BASE_URL`	Optional model provider base URL override
`MODEL_NAME`	Optional Chat model ID

export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12

If API_BASE_URL is unset, inference.py defaults to Google's OpenAI-compatible Gemini endpoint for API_KEY and Hugging Face's router for HF_TOKEN.

Flags: --task (repeatable), --seed, --max-turns, --json-scores (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When PUBLIC_GRADER_DETAILS=true and the grader API exposes details, inference.py also writes the per-task grader payloads to stderr.

POST /baseline runs the same script inside the server process; optional JSON body: task_ids, seed, max_turns. If ADMIN_API_KEY is unset, the route is open. If ADMIN_API_KEY is set, callers must send X-Admin-Key. If ENV_BASE_URL is unset, the server injects http://127.0.0.1:$PORT into the child process automatically.

Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.

Minimal HTTP smoke test:

curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7}'

curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'

curl -b cookies.txt 'http://127.0.0.1:7860/grader'

By default /grader returns task_id and the normalized top-level score only. Full grader details require PUBLIC_GRADER_DETAILS=true or a valid X-Admin-Key when ADMIN_API_KEY is set. The public top-level score uses the same 2-decimal 0.01..0.99 normalization as inference.py; nested details component scores remain raw. This does not change the mandatory [START] / [STEP] / [END] lines from inference.py; it affects the grader API, the optional trailing JSON emitted by --json-scores, and the captured stderr payloads written by inference.py.

Baseline scores

All figures below are reported terminal scores, rounded to 2 decimals and normalized to (0.00, 1.00). Exact internal boundary scores are surfaced as 0.01 and 0.99. Scores depend on provider, model revision, temperature, and seed.

Null baseline (no agent actions)

Condition	`task_1`	`task_2`	`task_3`	Avg
`reset` only (`seed=7`), then grader; no `/step`	0.01	0.01	0.01	0.01

Reference tool-calling baseline

[END] success=true in the harness logs means the reported terminal score reached 0.99 for that task.

Model	Seed	`task_1_easy_anomaly`	`task_2_medium_syntax`	`task_3_hard_e2e`	Average
`gemini-3.1-flash-lite-preview`	7	0.99	0.99	0.99	0.99

Reproducing a baseline run: With the API server running locally on 7860 and model credentials configured, run:

export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores

The final line of stdout is a single JSON object with scores, grades, average, model, and metadata.

Hugging Face Spaces

There are two methods for running the baseline against a deployed Hugging Face Space:

Running inference.py externally against the public Space URL:

export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores

In this mode, the Space only needs to expose the environment API (/reset, /step, /grader, /tasks, /schema, /health, /metadata, /ws, /mcp). Model credentials are provided on the machine that runs inference.py, not on the Space.

Hitting /baseline API with a POST request:

curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7, "max_turns": 12}'

In this mode, the Space itself executes inference.py. Configure one model credential source on the Space (API_KEY or HF_TOKEN). MODEL_NAME and API_BASE_URL are optional overrides. ENV_BASE_URL is not required for POST /baseline because the server injects http://127.0.0.1:$PORT when it launches the child inference.py process. If ADMIN_API_KEY is unset, POST /baseline is open; if it is set, callers must send X-Admin-Key.

API reference

Method	Path	Purpose
GET	`/health`	Liveness
GET	`/metadata`	Name, description, version, task count
GET	`/schema`	JSON Schemas: action, observation, state
GET	`/tasks`	Tasks + action/observation/state schemas
POST	`/mcp`	Minimal JSON-RPC tool-list compatibility stub
POST	`/reset?task_id=...`	New episode; body may include `seed`, `episode_id`
POST	`/step`	One action; optional `timeout_s`
GET	`/state`	Episode state (`task_id`, `seed`, …)
GET	`/grader`	Normalized reported terminal score for active task
GET	`/grader/{task_id}`	Same; `task_id` must match the active task
POST	`/baseline`	Subprocess baseline (see Setup and usage)
WS	`/ws`	OpenEnv WebSocket session

Environment variables (server / container)

Variable	Purpose
`HOST`	Listen host used by `python -m server` and the container entrypoint
`PORT`	Listen port used by `python -m server` and the container entrypoint
`DEBUG`	Enables reload for local `python -m server` runs
`ENV_FILE`	Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars
`HTTP_SESSION_TIMEOUT_S`	HTTP session idle TTL; max wall time for `POST /baseline` child
`MAX_HTTP_SESSIONS`	Concurrent HTTP sessions cap
`MAX_WS_SESSIONS`	Concurrent WebSocket sessions cap
`ADMIN_API_KEY`	When set, protects `POST /baseline` and lets `X-Admin-Key` unlock full grader details
`PUBLIC_GRADER_DETAILS`	If `true`, public `/grader` and `/grader/{task_id}` responses include `details`
`COOKIE_SECURE`	Set `Secure` on session cookies (HTTPS)
`CORS_ALLOW_ORIGINS`	Comma-separated origins; empty disables permissive CORS (recommended default)

Tests

uv sync --extra dev
uv run pytest -q

Appendix

Command	Description
`uv --version`	Confirm `uv` is installed and available in `PATH`.
`uv init`	Create a new Python project managed by `uv`.
`uv venv`	Create a virtual environment.
`uv sync`	Install dependencies from the project metadata and lockfile.
`uv add <package>`	Add a dependency to the current project.
`uv remove <package>`	Remove a dependency from the current project.
`uv lock`	Update or generate the lockfile.
`uv run <command>`	Run a command inside the project environment.
`uv python install`	Install and manage Python versions through `uv`.
`uv pip install <package>`	Install a package using the `pip`-compatible interface.
`uv tree`	Show the resolved dependency tree.