File size: 19,481 Bytes
8afce53
f89b1ac
 
 
 
8afce53
f89b1ac
 
 
 
 
8afce53
 
f89b1ac
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
a1b343c
f89b1ac
 
 
a1b343c
f89b1ac
 
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
f89b1ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1b343c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: DataOpsEnv
emoji: 🧩
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
short_description: OpenEnv DataOps  SQLite, ETL repair, three graded tasks.
tags:
  - openenv
base_path: /web
---

# DataOpsEnv

[Overview](#environment-description-and-motivation) · [Tasks](#tasks-descriptions-and-expected-difficulty) · [Get the repository](#get-the-repository) · [Setup and run](#setup-and-usage) · [Baseline scores](#baseline-scores) · [Hugging Face Spaces](#hugging-face-spaces) · [HTTP API](#api-reference) · [Tests](#tests) · [Appendix](#appendix)

## Environment description and motivation

**DataOpsEnv** is an OpenEnv-compliant benchmark in which an agent performs data-engineering work: inspecting a small **SQLite** warehouse, **repairing Python ETL scripts**, and completing an **end-to-end reporting incident** (extract data, fix a formatter, send a mock email). Episodes are **seeded** (`reset` may include `seed`) so scenarios are **reproducible**; each HTTP session receives an **isolated workspace and database**.

Many agent benchmarks are game-like or shallow. Data cleaning, script debugging, and stakeholder communication reflect **real workflows**; this environment exercises multi-step tool use, constraint respect, and verifiable outcomes rather than single-shot question answering.

**Implementation:** FastAPI (`server/app.py`), environment logic (`server/dataops_env_environment.py`), terminal graders (`server/grading.py`), scenario definitions (`server/task_specs.py`, `data/init_db.py`), Pydantic types (`models.py`), OpenEnv manifest (`openenv.yaml`).

---

## Action space

Each step submits JSON: `{"action": {"action_type": "<type>", "payload": { ... }}}`. Payloads are validated per task (allowed files, SQL policy, email enabled only on the hard task).

| `action_type` | Payload fields                                                                                       | Role                                                             |
| ------------- | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| `ExecuteSQL`  | `query` (string, 1–2000 chars)                                                                       | Run task-scoped SQL against the episode SQLite DB.               |
| `ReadFile`    | `filepath` (string, 1–255 chars)                                                                     | Read an allowed file from the episode workspace.                 |
| `WriteFile`   | `filepath`, `content` (content ≤ 1M chars)                                                           | Overwrite an allowed workspace file.                             |
| `RunScript`   | `filepath` (must be `*.py` basename), `args` (optional list of strings, ≤ 20 args, each ≤ 500 chars) | Execute a Python script in the workspace with optional CLI args. |
| `SendEmail`   | `to_email`, `subject`, `body`                                                                        | Queue a mock email (used for the hard task).                     |

Machine-readable schema: **`GET /schema`**`action`, or **`GET /tasks`**`action_schema`.

---

## Observation space

Each `step` / `reset` response includes an observation object (REST also exposes wrapper fields such as `reward` / `done`). The fields below describe the **DataOps** layer; the OpenEnv base also defines `done`, `reward`, and `metadata`.

| Field                   | Type                     | Meaning                                                           |
| ----------------------- | ------------------------ | ----------------------------------------------------------------- |
| `done`                  | boolean                  | Whether the episode has ended (step limit or terminal condition). |
| `reward`                | number \| null           | Shaped **step reward** after this transition (trajectory signal). |
| `metadata`              | object                   | OpenEnv extension bucket (usually empty).                         |
| `status`                | `"success"` \| `"error"` | Whether the action executed successfully.                         |
| `message`               | string                   | Short human-readable summary.                                     |
| `stdout`                | string \| null           | Captured stdout (e.g. script or file read).                       |
| `stderr`                | string \| null           | Captured stderr.                                                  |
| `sql_results`           | list of objects \| null  | Row dicts for successful `SELECT`-style outcomes.                 |
| `email_delivery_status` | string \| null           | Mock send confirmation when applicable.                           |
| `step_count`            | integer                  | Steps taken in the episode.                                       |
| `max_steps`             | integer                  | Episode step budget.                                              |

**Terminal evaluation:** The top-level **grader score** returned by **`GET /grader`** (or **`GET /grader/{task_id}`**) reflects the **final** database, files, and outbox (and, for the hard task, **provenance** constraints). Publicly reported top-level scores are rounded to 2 decimals and normalized to **(0.00, 1.00)**, so exact internal boundary scores are surfaced as **0.01** and **0.99**. When **`details`** are exposed, nested component scores remain raw diagnostic values. Hackathon-style evaluations typically treat the **grader** as the primary benchmark metric; step rewards remain a supplementary signal. Successful actions can still return **`reward=0.0`** when they neither improve grader state nor unlock a milestone.

Machine-readable schema: **`GET /schema`** → `observation`.

---

## Tasks (descriptions and expected difficulty)

| Task ID                | Expected difficulty | Description                                                                                                                                                                                                                                                                                                                               |
| ---------------------- | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `task_1_easy_anomaly`  | **Easy**            | The `transactions` table contains valid rows and rows with **NULL** `amount`. The agent must **delete only** the corrupted rows and leave all valid rows **unchanged**, including legitimate seeded zero-value or negative non-null adjustments.                                                                                          |
| `task_2_medium_syntax` | **Medium**          | `broken_pipeline.py` is a seeded ETL normalization job with broken filtering, priority logic, and ordering. The agent must **read**, **patch**, and **run** the script so **`process_data_stream`** produces the correct downstream-ready records on both visible and hidden seeded batches.                                              |
| `task_3_hard_e2e`      | **Hard**            | **End-to-end incident:** query the correct **`daily_reports`** slice for the **scenario date**, persist results as **`report_data.json`**, **repair** **`format_report.py`**, **run** it on that JSON, then **send exactly one** email whose **body matches** the formatter output, with scenario-specific **recipient** and **subject**. |

Task list, difficulty labels, and allowed actions per task: **`GET /tasks`** and **`openenv.yaml`**.

---
# Setup and usage

## Get the repository

```bash
git clone https://github.com/vishesh-rathi/dataops-openenv.git
cd dataops-openenv
```

## Install `uv` (if needed)

**macOS / Linux**

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Restart your shell, or load the new path in the current session:

```bash
source "$HOME/.local/bin/env"
```

**Windows (PowerShell)**

```powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

Open a new PowerShell session after installation.

**Verify**

```bash
uv --version
```

---


**Prerequisites:** Python **3.12+**, **[uv](https://docs.astral.sh/uv/)**.

```bash
uv sync
cp .env.example .env.dev
printf 'ENV_FILE=.env.dev\n' > .env
```

Repo-root **`.env`** selects the active secondary env file. Use **`.env.dev`** for local runtime/model configuration. Hosted deployments that inject environment variables directly can skip both files.

**Run the server** from the repository root so the active env file is discovered. Explicitly exported runtime vars such as **`HOST`**, **`PORT`**, and **`DEBUG`** still take precedence:

```bash
uv run python -m server
```

Clients reuse the **`Set-Cookie`** session cookie or **`X-Session-ID`** header from **`POST /reset`** on **`/step`**, **`/state`**, and **`/grader`**.

**OpenEnv packaging:**

```bash
uv run openenv validate
```

**Docker:**

```bash
printf 'ENV_FILE=.env.dev\n' > .env
bash build_and_run_image.sh
```

The helper script reads repo-root **`.env`** only to resolve **`ENV_FILE`**, then passes that secondary file to `docker run --env-file ...`. The container does **not** receive a merged view of repo-root **`.env`** plus the secondary file. Keeping `.env*` out of the image is intentional; runtime configuration is injected from the host.

**Baseline inference (local):**

| Variable               | Purpose                                                                                |
| ---------------------- | -------------------------------------------------------------------------------------- |
| `ENV_BASE_URL`         | Environment server URL (default `http://127.0.0.1:$PORT`, with `PORT=7860` by default) |
| `API_KEY` / `HF_TOKEN` | Exactly one model access credential source                                             |
| `API_BASE_URL`         | Optional model provider base URL override                                              |
| `MODEL_NAME`           | Optional Chat model ID                                                                 |

```bash
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12
```

If **`API_BASE_URL`** is unset, `inference.py` defaults to Google's OpenAI-compatible Gemini endpoint for **`API_KEY`** and Hugging Face's router for **`HF_TOKEN`**.

Flags: `--task` (repeatable), `--seed`, `--max-turns`, `--json-scores` (emits one JSON object on stdout after the harness lines, including raw grader payloads when available). When `PUBLIC_GRADER_DETAILS=true` and the grader API exposes details, `inference.py` also writes the per-task grader payloads to `stderr`.

**`POST /baseline`** runs the same script inside the server process; optional JSON body: `task_ids`, `seed`, `max_turns`. If **`ADMIN_API_KEY`** is unset, the route is open. If **`ADMIN_API_KEY`** is set, callers must send **`X-Admin-Key`**. If **`ENV_BASE_URL`** is unset, the server injects **`http://127.0.0.1:$PORT`** into the child process automatically.

Agent-executed Python scripts run with a stripped environment, bounded resources, and capped captured output so task verification does not inherit model-provider secrets from the server process.

**Minimal HTTP smoke test:**

```bash
curl -c cookies.txt -X POST 'http://127.0.0.1:7860/reset?task_id=task_1_easy_anomaly' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7}'

curl -b cookies.txt -X POST 'http://127.0.0.1:7860/step' \
  -H 'Content-Type: application/json' \
  -d '{"action":{"action_type":"ExecuteSQL","payload":{"query":"DELETE FROM transactions WHERE amount IS NULL"}}}'

curl -b cookies.txt 'http://127.0.0.1:7860/grader'
```

By default **`/grader`** returns **`task_id`** and the normalized top-level **`score`** only. Full grader **`details`** require **`PUBLIC_GRADER_DETAILS=true`** or a valid **`X-Admin-Key`** when **`ADMIN_API_KEY`** is set. The public top-level **`score`** uses the same 2-decimal **`0.01..0.99`** normalization as `inference.py`; nested **`details`** component scores remain raw. This does **not** change the mandatory `[START]` / `[STEP]` / `[END]` lines from `inference.py`; it affects the grader API, the optional trailing JSON emitted by `--json-scores`, and the captured `stderr` payloads written by `inference.py`.

---

## Baseline scores

All figures below are **reported terminal scores**, rounded to 2 decimals and normalized to **(0.00, 1.00)**. Exact internal boundary scores are surfaced as **0.01** and **0.99**. Scores depend on provider, model revision, temperature, and `seed`.

### Null baseline (no agent actions)

| Condition                                            | `task_1` | `task_2` | `task_3` | Avg  |
| ---------------------------------------------------- | -------- | -------- | -------- | ---- |
| `reset` only (`seed=7`), then grader; **no** `/step` | 0.01     | 0.01     | 0.01     | 0.01 |

### Reference tool-calling baseline

`[END] success=true` in the harness logs means the reported terminal score reached **0.99** for that task.

| Model                           | Seed | `task_1_easy_anomaly` | `task_2_medium_syntax` | `task_3_hard_e2e` | Average |
| ------------------------------- | ---- | --------------------- | ---------------------- | ----------------- | ------- |
| `gemini-3.1-flash-lite-preview` | 7    | 0.99                  | 0.99                   | 0.99              | 0.99    |

**Reproducing a baseline run:** With the API server running locally on `7860` and model credentials configured, run:

```bash
export MODEL_NAME=gemini-3.1-flash-lite-preview
export ENV_BASE_URL=http://127.0.0.1:7860
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```

The final line of stdout is a single JSON object with **`scores`**, **`grades`**, **`average`**, **`model`**, and **`metadata`**.

---

## Hugging Face Spaces

There are two methods for running the baseline against a deployed Hugging Face Space:

1. Running **`inference.py`** externally against the public Space URL:

```bash
export ENV_BASE_URL=https://visheshrathi-dataops-env.hf.space
uv run python inference.py --seed 7 --max-turns 12 --json-scores
```

In this mode, the Space only needs to expose the environment API (`/reset`, `/step`, `/grader`, `/tasks`, `/schema`, `/health`, `/metadata`, `/ws`, `/mcp`). Model credentials are provided on the machine that runs **`inference.py`**, not on the Space.

2. Hitting **`/baseline`** API with a `POST` request:

```bash
curl -X POST 'https://visheshrathi-dataops-env.hf.space/baseline' \
  -H 'Content-Type: application/json' \
  -d '{"seed": 7, "max_turns": 12}'
```

In this mode, the Space itself executes **`inference.py`**. Configure one model credential source on the Space (**`API_KEY`** or **`HF_TOKEN`**). **`MODEL_NAME`** and **`API_BASE_URL`** are optional overrides. **`ENV_BASE_URL`** is not required for **`POST /baseline`** because the server injects **`http://127.0.0.1:$PORT`** when it launches the child `inference.py` process. If **`ADMIN_API_KEY`** is unset, **`POST /baseline`** is open; if it is set, callers must send **`X-Admin-Key`**.

---

## API reference

| Method | Path                 | Purpose                                                       |
| ------ | -------------------- | ------------------------------------------------------------- |
| GET    | `/health`            | Liveness                                                      |
| GET    | `/metadata`          | Name, description, version, task count                        |
| GET    | `/schema`            | JSON Schemas: action, observation, state                      |
| GET    | `/tasks`             | Tasks + action/observation/state schemas                      |
| POST   | `/mcp`               | Minimal JSON-RPC tool-list compatibility stub                 |
| POST   | `/reset?task_id=...` | New episode; body may include `seed`, `episode_id`            |
| POST   | `/step`              | One action; optional `timeout_s`                              |
| GET    | `/state`             | Episode state (`task_id`, `seed`, …)                          |
| GET    | `/grader`            | Normalized reported terminal score for active task            |
| GET    | `/grader/{task_id}`  | Same; `task_id` must match the active task                    |
| POST   | `/baseline`          | Subprocess baseline (see [Setup and usage](#setup-and-usage)) |
| WS     | `/ws`                | OpenEnv WebSocket session                                     |

---

## Environment variables (server / container)

| Variable                 | Purpose                                                                                       |
| ------------------------ | --------------------------------------------------------------------------------------------- |
| `HOST`                   | Listen host used by `python -m server` and the container entrypoint                           |
| `PORT`                   | Listen port used by `python -m server` and the container entrypoint                           |
| `DEBUG`                  | Enables reload for local `python -m server` runs                                              |
| `ENV_FILE`               | Repo-relative dotenv loaded after `.env` without overriding externally injected runtime vars  |
| `HTTP_SESSION_TIMEOUT_S` | HTTP session idle TTL; max wall time for **`POST /baseline`** child                           |
| `MAX_HTTP_SESSIONS`      | Concurrent HTTP sessions cap                                                                  |
| `MAX_WS_SESSIONS`        | Concurrent WebSocket sessions cap                                                             |
| `ADMIN_API_KEY`          | When set, protects **`POST /baseline`** and lets **`X-Admin-Key`** unlock full grader details |
| `PUBLIC_GRADER_DETAILS`  | If `true`, public **`/grader`** and **`/grader/{task_id}`** responses include **`details`**   |
| `COOKIE_SECURE`          | Set `Secure` on session cookies (HTTPS)                                                       |
| `CORS_ALLOW_ORIGINS`     | Comma-separated origins; empty disables permissive CORS (recommended default)                 |

---

## Tests

```bash
uv sync --extra dev
uv run pytest -q
```

---

## Appendix

| Command | Description |
| ------- | ----------- |
| `uv --version` | Confirm `uv` is installed and available in `PATH`. |
| `uv init` | Create a new Python project managed by `uv`. |
| `uv venv` | Create a virtual environment. |
| `uv sync` | Install dependencies from the project metadata and lockfile. |
| `uv add <package>` | Add a dependency to the current project. |
| `uv remove <package>` | Remove a dependency from the current project. |
| `uv lock` | Update or generate the lockfile. |
| `uv run <command>` | Run a command inside the project environment. |
| `uv python install` | Install and manage Python versions through `uv`. |
| `uv pip install <package>` | Install a package using the `pip`-compatible interface. |
| `uv tree` | Show the resolved dependency tree. |