dataops-env / README.md
visheshrathi's picture
Upload folder using huggingface_hub
a1b343c verified
metadata
title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps  SQLite, ETL repair, three graded tasks.
tags:
  - openenv
base_path: /web

DataOpsEnv

Overview · Tasks · Get the repository · Setup and run · Baseline scores · Hugging Face Spaces · HTTP API · Tests · Appendix

Environment description and motivation

DataOpsEnv is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small SQLite warehouse, repairing Python ETL scripts, and completing an end-to-end reporting incident (extract data, fix a formatter, send a mock email). Episodes are seeded (reset may include seed) so scenarios are reproducible; each HTTP session receives an isolated workspace and database.

Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect real workflows; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.

Implementation: FastAPI (server/app.py), environment logic (server/dataops_env_environment.py), terminal graders (server/grading.py), scenario definitions (server/task_specs.py, data/init_db.py), Pydantic types (models.py), OpenEnv manifest (openenv.yaml).


Action space

Each step submits JSON: {"action": {"action_type": "<type>", "payload": { ... }}}. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).

action_type Payload fields Role
ExecuteSQL query (string, 1–2000 chars) Run task-scoped SQL against the episode SQLite DB.
ReadFile filepath (string, 1–255 chars) Read an allowed file from the episode workspace.
WriteFile filepath, content (content ≤ 1M chars) Overwrite an allowed workspace file.
RunScript filepath (must be *.py basename), args (optional list of strings, ≤ 20 args, each ≤ 500 chars) Execute a Python script in the workspace with optional CLI args.
SendEmail to_email, subject, body Queue a mock email (used for the hard task).

Machine-readable schema: GET /schemaaction, or GET /tasksaction_schema.


Observation space

Each step / reset response includes an observation object (REST also exposes wrapper fields such as reward / done). The fields below describe the DataOps layer; the OpenEnv base also defines done, reward, and metadata.

Field Type Meaning
done boolean Whether the episode has ended (step limit or terminal condition).
reward number | null Shaped step reward after this transition (trajectory signal).
metadata object OpenEnv extension bucket (usually empty).
status "success" | "error" Whether the action executed successfully.
message string Short human-readable summary.
stdout string | null Captured stdout (e.g. script or file read).
stderr string | null Captured stderr.
sql_results list of objects | null Row dicts for successful SELECT-style outcomes.
email_delivery_status string | null Mock send confirmation when applicable.
step_count integer Steps taken in the episode.
max_steps integer Episode step budget.

Terminal evaluation: The top-level grader score returned by GET /grader (or GET /grader/{task_id}) reflects the final database, files, and outbox (and, for the hard task, provenance constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to (0.00, 1.00), so exact internal boundary scores are surfaced as 0.01 and 0.99. When details are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the grader as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return reward=0.0 when they neither improve grader state nor unlock a milestone.

Machine-readable schema: GET /schemaobservation.


Tasks (descriptions and expected difficulty)

Task ID Expected difficulty Description
task_1_easy_anomaly Easy The transactions table contains valid rows and rows with NULL amount. The agent must delete only the corrupted rows and leave all valid rows unchanged, including legitimate seeded zero-value or negative non-null adjustments.
task_2_medium_syntax Medium broken_pipeline.py is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must read, patch, and run the script so process_data_stream produces the correct downstream-ready records on both visible and hidden seeded batches.
task_3_hard_e2e Hard End-to-end incident: query the correct daily_reports slice for the scenario date, persist results as report_data.json, repair format_report.py, run it on that JSON, then send exactly one email whose body matches the formatter output, with scenario-specific recipient and subject.

Task list, difficulty labels, and allowed actions per task: GET /tasks and openenv.yaml.


Setup and usage

Get the repository

git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv

Install uv (if needed)

macOS / Linux

curl -LsSf https://astral.sh/uv/install.sh | sh

Restart your shell, or load the new path in the current session:

source "$HOME/.local/bin/env"

Windows (PowerShell)

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Open a new PowerShell session after installation.

Verify

uv --version

Prerequisites: Python 3.12+, uv.

uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env

Repo-root .env selects the active secondary env file. Use .env.dev for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.

Run the server from the repository root so the active env file is discovered. Explicitly exported runtime vars such as HOST, PORT, and DEBUG still take precedence:

uv run python -m server

Clients reuse the Set-Cookie session cookie or X-Session-ID header from POST /reset on /step, /state, and /grader.

OpenEnv packaging:

uv run openenv validate

Docker:

printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh

The helper script reads repo-root .env only to resolve ENV_FILE, then passes that secondary file to docker run --env-file .... The container does not receive a merged view of repo-root .env plus the secondary file. Keeping .env* out of the image is intentional; runtime configuration is injected from the host.

Baseline inference (local):

Variable Purpose
ENV_BASE_URL Environment server URL (default http://127.0.0.1:$PORT, with PORT=7860 by default)
API_KEY / HF_TOKEN Exactly one model access credential source
API_BASE_URL Optional model provider base URL override
MODEL_NAME Optional Chat model ID
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12

If API_BASE_URL is unset, inference.py defaults to Google's OpenAI-compatible Gemini endpoint for API_KEY and Hugging Face's router for HF_TOKEN.

Flags: --task (repeatable), --seed, --max-turns, --json-scores (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When PUBLIC_GRADER_DETAILS=true and the grader API exposes details, inference.py also writes the per-task grader payloads to stderr.

POST /baseline runs the same script inside the server process; optional JSON body: task_ids, seed, max_turns. If ADMIN_API_KEY is unset, the route is open. If ADMIN_API_KEY is set, callers must send X-Admin-Key. If ENV_BASE_URL is unset, the server injects http://127.0.0.1:$PORT into the child process automatically.

Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.

Minimal HTTP smoke test:

curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7}'

curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'

curl -b cookies.txt 'http://127.0.0.1:7860/grader'

By default /grader returns task_id and the normalized top-level score only. Full grader details require PUBLIC_GRADER_DETAILS=true or a valid X-Admin-Key when ADMIN_API_KEY is set. The public top-level score uses the same 2-decimal 0.01..0.99 normalization as inference.py; nested details component scores remain raw. This does not change the mandatory [START] / [STEP] / [END] lines from inference.py; it affects the grader API, the optional trailing JSON emitted by --json-scores, and the captured stderr payloads written by inference.py.


Baseline scores

All figures below are reported terminal scores, rounded to 2 decimals and normalized to (0.00, 1.00). Exact internal boundary scores are surfaced as 0.01 and 0.99. Scores depend on provider, model revision, temperature, and seed.

Null baseline (no agent actions)

Condition task_1 task_2 task_3 Avg
reset only (seed=7), then grader; no /step 0.01 0.01 0.01 0.01

Reference tool-calling baseline

[END] success=true in the harness logs means the reported terminal score reached 0.99 for that task.

Model Seed task_1_easy_anomaly task_2_medium_syntax task_3_hard_e2e Average
gemini-3.1-flash-lite-preview 7 0.99 0.99 0.99 0.99

Reproducing a baseline run: With the API server running locally on 7860 and model credentials configured, run:

export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores

The final line of stdout is a single JSON object with scores, grades, average, model, and metadata.


Hugging Face Spaces

There are two methods for running the baseline against a deployed Hugging Face Space:

  1. Running inference.py externally against the public Space URL:
export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores

In this mode, the Space only needs to expose the environment API (/reset, /step, /grader, /tasks, /schema, /health, /metadata, /ws, /mcp). Model credentials are provided on the machine that runs inference.py, not on the Space.

  1. Hitting /baseline API with a POST request:
curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7, "max_turns": 12}'

In this mode, the Space itself executes inference.py. Configure one model credential source on the Space (API_KEY or HF_TOKEN). MODEL_NAME and API_BASE_URL are optional overrides. ENV_BASE_URL is not required for POST /baseline because the server injects http://127.0.0.1:$PORT when it launches the child inference.py process. If ADMIN_API_KEY is unset, POST /baseline is open; if it is set, callers must send X-Admin-Key.


API reference

Method Path Purpose
GET /health Liveness
GET /metadata Name, description, version, task count
GET /schema JSON Schemas: action, observation, state
GET /tasks Tasks + action/observation/state schemas
POST /mcp Minimal JSON-RPC tool-list compatibility stub
POST /reset?task_id=... New episode; body may include seed, episode_id
POST /step One action; optional timeout_s
GET /state Episode state (task_id, seed, …)
GET /grader Normalized reported terminal score for active task
GET /grader/{task_id} Same; task_id must match the active task
POST /baseline Subprocess baseline (see Setup and usage)
WS /ws OpenEnv WebSocket session

Environment variables (server / container)

Variable Purpose
HOST Listen host used by python -m server and the container entrypoint
PORT Listen port used by python -m server and the container entrypoint
DEBUG Enables reload for local python -m server runs
ENV_FILE Repo-relative dotenv loaded after .env without overriding externally injected runtime vars
HTTP_SESSION_TIMEOUT_S HTTP session idle TTL; max wall time for POST /baseline child
MAX_HTTP_SESSIONS Concurrent HTTP sessions cap
MAX_WS_SESSIONS Concurrent WebSocket sessions cap
ADMIN_API_KEY When set, protects POST /baseline and lets X-Admin-Key unlock full grader details
PUBLIC_GRADER_DETAILS If true, public /grader and /grader/{task_id} responses include details
COOKIE_SECURE Set Secure on session cookies (HTTPS)
CORS_ALLOW_ORIGINS Comma-separated origins; empty disables permissive CORS (recommended default)

Tests

uv sync --extra dev
uv run pytest -q

Appendix

Command Description
uv --version Confirm uv is installed and available in PATH.
uv init Create a new Python project managed by uv.
uv venv Create a virtual environment.
uv sync Install dependencies from the project metadata and lockfile.
uv add <package> Add a dependency to the current project.
uv remove <package> Remove a dependency from the current project.
uv lock Update or generate the lockfile.
uv run <command> Run a command inside the project environment.
uv python install Install and manage Python versions through uv.
uv pip install <package> Install a package using the pip-compatible interface.
uv tree Show the resolved dependency tree.