Spaces:

visheshrathi
/

dataops-env

Sleeping

File size: 19,481 Bytes

---
title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps — SQLite, ETL repair, three graded tasks.
tags:
  - openenv
base_path: /web
---

# DataOpsEnv

[Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix)

## Environment description and motivation

**DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**.

Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.

**Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`).

---

## Action space

Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).

| `action_type` | Payload fields                                                                                       | Role                                                             |
| ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `ExecuteSQL`  | `query` (string, 1–2000 chars)                                                                       | Run task-scoped SQL against the episode SQLite DB.               |
| `ReadFile`    | `filepath` (string, 1–255 chars)                                                                     | Read an allowed file from the episode workspace.                 |
| `WriteFile`   | `filepath`, `content` (content ≤ 1M chars)                                                           | Overwrite an allowed workspace file.                             |
| `RunScript`   | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. |
| `SendEmail`   | `to_email`, `subject`, `body`                                                                        | Queue a mock email (used for the hard task).                     |

Machine-readable schema: **`GET /schema`** → `action`, or **`GET /tasks`** → `action_schema`.

---

## Observation space

Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`.

| Field                   | Type                     | Meaning                                                           |
| ----------------------- | ------------------------ | ----------------------------------------------------------------- |
| `done`                  | boolean                  | Whether the episode has ended (step limit or terminal condition). |
| `reward`                | number \| null           | Shaped **step reward** after this transition (trajectory signal). |
| `metadata`              | object                   | OpenEnv extension bucket (usually empty).                         |
| `status`                | `"success"` \| `"error"` | Whether the action executed successfully.                         |
| `message`               | string                   | Short human-readable summary.                                     |
| `stdout`                | string \| null           | Captured stdout (e.g. script or file read).                       |
| `stderr`                | string \| null           | Captured stderr.                                                  |
| `sql_results`           | list of objects \| null  | Row dicts for successful `SELECT`-style outcomes.                 |
| `email_delivery_status` | string \| null           | Mock send confirmation when applicable.                           |
| `step_count`            | integer                  | Steps taken in the episode.                                       |
| `max_steps`             | integer                  | Episode step budget.                                              |

**Terminal evaluation:** The top-level **grader score** returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to **(0.00, 1.00)**, so exact internal boundary scores are surfaced as **0.01** and **0.99**. When **`details`** are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone.

Machine-readable schema: **`GET /schema`** → `observation`.

---

## Tasks (descriptions and expected difficulty)

| Task ID                | Expected difficulty | Description                                                                                                                                                                                                                                                                                                                               |
| ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `task_1_easy_anomaly`  | **Easy**            | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments.                                                                                          |
| `task_2_medium_syntax` | **Medium**          | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches.                                              |
| `task_3_hard_e2e`      | **Hard**            | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. |

Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**.

---
# Setup and usage

## Get the repository

```bash
git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv
```

## Install `uv` (if needed)

**macOS / Linux**

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Restart your shell, or load the new path in the current session:

```bash
source "$HOME/.local/bin/env"
```

**Windows (PowerShell)**

```powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

Open a new PowerShell session after installation.

**Verify**

```bash
uv --version
```

---


**Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**.

```bash
uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env
```

Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.

**Run the server** from the repository root so the active env file is discovered. Explicitly exported runtime vars such as **`HOST`**, **`PORT`**, and **`DEBUG`** still take precedence:

```bash
uv run python -m server
```

Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**.

**OpenEnv packaging:**

```bash
uv run openenv validate
```

**Docker:**

```bash
printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh
```

The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host.

**Baseline inference (local):**

| Variable               | Purpose                                                                                |
| ---------------------- | -------------------------------------------------------------------------------------- |
| `ENV_BASE_URL`         | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) |
| `API_KEY` / `HF_TOKEN` | Exactly one model access credential source                                             |
| `API_BASE_URL`         | Optional model provider base URL override                                              |
| `MODEL_NAME`           | Optional Chat model ID                                                                 |

```bash
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12
```

If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**.

Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`.

**`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically.

Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.

**Minimal HTTP smoke test:**

```bash
curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7}'

curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'

curl -b cookies.txt 'http://127.0.0.1:7860/grader'
```

By default **`/grader`** returns **`task_id`** and the normalized top-level **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. The public top-level **`score`** uses the same 2-decimal **`0.01..0.99`** normalization as `inference.py`; nested **`details`** component scores remain raw. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`.

---

## Baseline scores

All figures below are **reported terminal scores**, rounded to 2 decimals and normalized to **(0.00, 1.00)**. Exact internal boundary scores are surfaced as **0.01** and **0.99**. Scores depend on provider, model revision, temperature, and `seed`.

### Null baseline (no agent actions)

| Condition                                            | `task_1` | `task_2` | `task_3` | Avg  |
| ---------------------------------------------------- | -------- | -------- | -------- | ---- |
| `reset` only (`seed=7`), then grader; **no** `/step` | 0.01     | 0.01     | 0.01     | 0.01 |

### Reference tool-calling baseline

`[END] success=true` in the harness logs means the reported terminal score reached **0.99** for that task.

| Model                           | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average |
| ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- |
| `gemini-3.1-flash-lite-preview` | 7    | 0.99                  | 0.99                   | 0.99              | 0.99    |

**Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run:

```bash
export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```

The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**.

---

## Hugging Face Spaces

There are two methods for running the baseline against a deployed Hugging Face Space:

1. Running **`inference.py`** externally against the public Space URL:

```bash
export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```

In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space.

2. Hitting **`/baseline`** API with a `POST` request:

```bash
curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7, "max_turns": 12}'
```

In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**.

---

## API reference

| Method | Path                 | Purpose                                                       |
| ------ | -------------------- | ------------------------------------------------------------- |
| GET    | `/health`            | Liveness                                                      |
| GET    | `/metadata`          | Name, description, version, task count                        |
| GET    | `/schema`            | JSON Schemas: action, observation, state                      |
| GET    | `/tasks`             | Tasks + action/observation/state schemas                      |
| POST   | `/mcp`               | Minimal JSON-RPC tool-list compatibility stub                 |
| POST   | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id`            |
| POST   | `/step`              | One action; optional `timeout_s`                              |
| GET    | `/state`             | Episode state (`task_id`, `seed`, …)                          |
| GET    | `/grader`            | Normalized reported terminal score for active task            |
| GET    | `/grader/{task_id}`  | Same; `task_id` must match the active task                    |
| POST   | `/baseline`          | Subprocess baseline (see [Setup and usage](#setup-and-usage)) |
| WS     | `/ws`                | OpenEnv WebSocket session                                     |

---

## Environment variables (server / container)

| Variable                 | Purpose                                                                                       |
| ------------------------ | --------------------------------------------------------------------------------------------- |
| `HOST`                   | Listen host used by `python -m server` and the container entrypoint                           |
| `PORT`                   | Listen port used by `python -m server` and the container entrypoint                           |
| `DEBUG`                  | Enables reload for local `python -m server` runs                                              |
| `ENV_FILE`               | Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars  |
| `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child                           |
| `MAX_HTTP_SESSIONS`      | Concurrent HTTP sessions cap                                                                  |
| `MAX_WS_SESSIONS`        | Concurrent WebSocket sessions cap                                                             |
| `ADMIN_API_KEY`          | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details |
| `PUBLIC_GRADER_DETAILS`  | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`**   |
| `COOKIE_SECURE`          | Set `Secure` on session cookies (HTTPS)                                                       |
| `CORS_ALLOW_ORIGINS`     | Comma-separated origins; empty disables permissive CORS (recommended default)                 |

---

## Tests

```bash
uv sync --extra dev
uv run pytest -q
```

---

## Appendix

| Command | Description |
| ------- | ----------- |
| `uv --version` | Confirm `uv` is installed and available in `PATH`. |
| `uv init` | Create a new Python project managed by `uv`. |
| `uv venv` | Create a virtual environment. |
| `uv sync` | Install dependencies from the project metadata and lockfile. |
| `uv add <package>` | Add a dependency to the current project. |
| `uv remove <package>` | Remove a dependency from the current project. |
| `uv lock` | Update or generate the lockfile. |
| `uv run <command>` | Run a command inside the project environment. |
| `uv python install` | Install and manage Python versions through `uv`. |
| `uv pip install <package>` | Install a package using the `pip`-compatible interface. |
| `uv tree` | Show the resolved dependency tree. |