Spaces:
Sleeping
title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks.
tags:
- openenv
base_path: /web
DataOpsEnv
Overview · Tasks · Get the repository · Setup and run · Baseline scores · Hugging Face Spaces · HTTP API · Tests · Appendix
Environment description and motivation
DataOpsEnv is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small SQLite warehouse, repairing Python ETL scripts, and completing an end-to-end reporting incident (extract data, fix a formatter, send a mock email). Episodes are seeded (reset may include seed) so scenarios are reproducible; each HTTP session receives an isolated workspace and database.
Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect real workflows; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.
Implementation: FastAPI (server/app.py), environment logic (server/dataops_env_environment.py), terminal graders (server/grading.py), scenario definitions (server/task_specs.py, data/init_db.py), Pydantic types (models.py), OpenEnv manifest (openenv.yaml).
Action space
Each step submits JSON: {"action": {"action_type": "<type>", "payload": { ... }}}. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).
action_type |
Payload fields | Role |
|---|---|---|
ExecuteSQL |
query (string, 1–2000 chars) |
Run task-scoped SQL against the episode SQLite DB. |
ReadFile |
filepath (string, 1–255 chars) |
Read an allowed file from the episode workspace. |
WriteFile |
filepath, content (content ≤ 1M chars) |
Overwrite an allowed workspace file. |
RunScript |
filepath (must be *.py basename), args (optional list of strings, ≤ 20 args, each ≤ 500 chars) |
Execute a Python script in the workspace with optional CLI args. |
SendEmail |
to_email, subject, body |
Queue a mock email (used for the hard task). |
Machine-readable schema: GET /schema → action, or GET /tasks → action_schema.
Observation space
Each step / reset response includes an observation object (REST also exposes wrapper fields such as reward / done). The fields below describe the DataOps layer; the OpenEnv base also defines done, reward, and metadata.
| Field | Type | Meaning |
|---|---|---|
done |
boolean | Whether the episode has ended (step limit or terminal condition). |
reward |
number | null | Shaped step reward after this transition (trajectory signal). |
metadata |
object | OpenEnv extension bucket (usually empty). |
status |
"success" | "error" |
Whether the action executed successfully. |
message |
string | Short human-readable summary. |
stdout |
string | null | Captured stdout (e.g. script or file read). |
stderr |
string | null | Captured stderr. |
sql_results |
list of objects | null | Row dicts for successful SELECT-style outcomes. |
email_delivery_status |
string | null | Mock send confirmation when applicable. |
step_count |
integer | Steps taken in the episode. |
max_steps |
integer | Episode step budget. |
Terminal evaluation: The top-level grader score returned by GET /grader (or GET /grader/{task_id}) reflects the final database, files, and outbox (and, for the hard task, provenance constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to (0.00, 1.00), so exact internal boundary scores are surfaced as 0.01 and 0.99. When details are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the grader as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return reward=0.0 when they neither improve grader state nor unlock a milestone.
Machine-readable schema: GET /schema → observation.
Tasks (descriptions and expected difficulty)
| Task ID | Expected difficulty | Description |
|---|---|---|
task_1_easy_anomaly |
Easy | The transactions table contains valid rows and rows with NULL amount. The agent must delete only the corrupted rows and leave all valid rows unchanged, including legitimate seeded zero-value or negative non-null adjustments. |
task_2_medium_syntax |
Medium | broken_pipeline.py is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must read, patch, and run the script so process_data_stream produces the correct downstream-ready records on both visible and hidden seeded batches. |
task_3_hard_e2e |
Hard | End-to-end incident: query the correct daily_reports slice for the scenario date, persist results as report_data.json, repair format_report.py, run it on that JSON, then send exactly one email whose body matches the formatter output, with scenario-specific recipient and subject. |
Task list, difficulty labels, and allowed actions per task: GET /tasks and openenv.yaml.
Setup and usage
Get the repository
git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv
Install uv (if needed)
macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
Restart your shell, or load the new path in the current session:
source "$HOME/.local/bin/env"
Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Open a new PowerShell session after installation.
Verify
uv --version
Prerequisites: Python 3.12+, uv.
uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env
Repo-root .env selects the active secondary env file. Use .env.dev for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.
Run the server from the repository root so the active env file is discovered. Explicitly exported runtime vars such as HOST, PORT, and DEBUG still take precedence:
uv run python -m server
Clients reuse the Set-Cookie session cookie or X-Session-ID header from POST /reset on /step, /state, and /grader.
OpenEnv packaging:
uv run openenv validate
Docker:
printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh
The helper script reads repo-root .env only to resolve ENV_FILE, then passes that secondary file to docker run --env-file .... The container does not receive a merged view of repo-root .env plus the secondary file. Keeping .env* out of the image is intentional; runtime configuration is injected from the host.
Baseline inference (local):
| Variable | Purpose |
|---|---|
ENV_BASE_URL |
Environment server URL (default http://127.0.0.1:$PORT, with PORT=7860 by default) |
API_KEY / HF_TOKEN |
Exactly one model access credential source |
API_BASE_URL |
Optional model provider base URL override |
MODEL_NAME |
Optional Chat model ID |
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12
If API_BASE_URL is unset, inference.py defaults to Google's OpenAI-compatible Gemini endpoint for API_KEY and Hugging Face's router for HF_TOKEN.
Flags: --task (repeatable), --seed, --max-turns, --json-scores (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When PUBLIC_GRADER_DETAILS=true and the grader API exposes details, inference.py also writes the per-task grader payloads to stderr.
POST /baseline runs the same script inside the server process; optional JSON body: task_ids, seed, max_turns. If ADMIN_API_KEY is unset, the route is open. If ADMIN_API_KEY is set, callers must send X-Admin-Key. If ENV_BASE_URL is unset, the server injects http://127.0.0.1:$PORT into the child process automatically.
Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.
Minimal HTTP smoke test:
curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
-H 'Content-Type: application/json' \
-d '{"seed": 7}'
curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
-H 'Content-Type: application/json' \
-d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'
curl -b cookies.txt 'http://127.0.0.1:7860/grader'
By default /grader returns task_id and the normalized top-level score only. Full grader details require PUBLIC_GRADER_DETAILS=true or a valid X-Admin-Key when ADMIN_API_KEY is set. The public top-level score uses the same 2-decimal 0.01..0.99 normalization as inference.py; nested details component scores remain raw. This does not change the mandatory [START] / [STEP] / [END] lines from inference.py; it affects the grader API, the optional trailing JSON emitted by --json-scores, and the captured stderr payloads written by inference.py.
Baseline scores
All figures below are reported terminal scores, rounded to 2 decimals and normalized to (0.00, 1.00). Exact internal boundary scores are surfaced as 0.01 and 0.99. Scores depend on provider, model revision, temperature, and seed.
Null baseline (no agent actions)
| Condition | task_1 |
task_2 |
task_3 |
Avg |
|---|---|---|---|---|
reset only (seed=7), then grader; no /step |
0.01 | 0.01 | 0.01 | 0.01 |
Reference tool-calling baseline
[END] success=true in the harness logs means the reported terminal score reached 0.99 for that task.
| Model | Seed | task_1_easy_anomaly |
task_2_medium_syntax |
task_3_hard_e2e |
Average |
|---|---|---|---|---|---|
gemini-3.1-flash-lite-preview |
7 | 0.99 | 0.99 | 0.99 | 0.99 |
Reproducing a baseline run: With the API server running locally on 7860 and model credentials configured, run:
export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores
The final line of stdout is a single JSON object with scores, grades, average, model, and metadata.
Hugging Face Spaces
There are two methods for running the baseline against a deployed Hugging Face Space:
- Running
inference.pyexternally against the public Space URL:
export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores
In this mode, the Space only needs to expose the environment API (/reset, /step, /grader, /tasks, /schema, /health, /metadata, /ws, /mcp). Model credentials are provided on the machine that runs inference.py, not on the Space.
- Hitting
/baselineAPI with aPOSTrequest:
curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
-H 'Content-Type: application/json' \
-d '{"seed": 7, "max_turns": 12}'
In this mode, the Space itself executes inference.py. Configure one model credential source on the Space (API_KEY or HF_TOKEN). MODEL_NAME and API_BASE_URL are optional overrides. ENV_BASE_URL is not required for POST /baseline because the server injects http://127.0.0.1:$PORT when it launches the child inference.py process. If ADMIN_API_KEY is unset, POST /baseline is open; if it is set, callers must send X-Admin-Key.
API reference
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
Liveness |
| GET | /metadata |
Name, description, version, task count |
| GET | /schema |
JSON Schemas: action, observation, state |
| GET | /tasks |
Tasks + action/observation/state schemas |
| POST | /mcp |
Minimal JSON-RPC tool-list compatibility stub |
| POST | /reset?task_id=... |
New episode; body may include seed, episode_id |
| POST | /step |
One action; optional timeout_s |
| GET | /state |
Episode state (task_id, seed, …) |
| GET | /grader |
Normalized reported terminal score for active task |
| GET | /grader/{task_id} |
Same; task_id must match the active task |
| POST | /baseline |
Subprocess baseline (see Setup and usage) |
| WS | /ws |
OpenEnv WebSocket session |
Environment variables (server / container)
| Variable | Purpose |
|---|---|
HOST |
Listen host used by python -m server and the container entrypoint |
PORT |
Listen port used by python -m server and the container entrypoint |
DEBUG |
Enables reload for local python -m server runs |
ENV_FILE |
Repo-relative dotenv loaded after .env without overriding externally injected runtime vars |
HTTP_SESSION_TIMEOUT_S |
HTTP session idle TTL; max wall time for POST /baseline child |
MAX_HTTP_SESSIONS |
Concurrent HTTP sessions cap |
MAX_WS_SESSIONS |
Concurrent WebSocket sessions cap |
ADMIN_API_KEY |
When set, protects POST /baseline and lets X-Admin-Key unlock full grader details |
PUBLIC_GRADER_DETAILS |
If true, public /grader and /grader/{task_id} responses include details |
COOKIE_SECURE |
Set Secure on session cookies (HTTPS) |
CORS_ALLOW_ORIGINS |
Comma-separated origins; empty disables permissive CORS (recommended default) |
Tests
uv sync --extra dev
uv run pytest -q
Appendix
| Command | Description |
|---|---|
uv --version |
Confirm uv is installed and available in PATH. |
uv init |
Create a new Python project managed by uv. |
uv venv |
Create a virtual environment. |
uv sync |
Install dependencies from the project metadata and lockfile. |
uv add <package> |
Add a dependency to the current project. |
uv remove <package> |
Remove a dependency from the current project. |
uv lock |
Update or generate the lockfile. |
uv run <command> |
Run a command inside the project environment. |
uv python install |
Install and manage Python versions through uv. |
uv pip install <package> |
Install a package using the pip-compatible interface. |
uv tree |
Show the resolved dependency tree. |