huggingmenfordays commited on
Commit
2a37f10
·
verified ·
1 Parent(s): 6dd29c1

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +906 -115
  2. tests/test_packaginge.py +3 -2
README.md CHANGED
@@ -9,50 +9,162 @@ tags:
9
  base_path: /web
10
  ---
11
 
12
- # sysadmin env
13
-
14
- sysadmin env is an openenv style environment for training and evaluating agents on realistic linux remediation tasks.
15
-
16
- the environment models work that humans actually do in production support and platform engineering.
17
-
18
- the agent must inspect a broken system, choose shell commands, repair the fault, and stop when the service is healthy again.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- ## environment motivation
21
 
22
- most agent benchmarks are either too synthetic or too shallow.
 
 
 
 
 
 
 
 
23
 
24
- this environment focuses on a practical domain with immediate utility.
25
 
26
- - service recovery
27
- - disk pressure diagnosis
28
- - routing and dns recovery
29
- - reward shaping across a multi step remediation trajectory
30
 
31
- ## openenv interface
32
 
33
- the server exposes validator friendly http endpoints and a websocket agent endpoint.
 
 
 
 
 
 
34
 
35
- - `POST /reset`
36
- - `POST /step`
37
- - `GET /state`
38
- - `GET /health`
39
- - `GET /tasks`
40
- - `WS /ws`
41
 
42
- the http contract is intended to satisfy `openenv validate` style probing.
43
 
44
- ## action space
45
 
46
- the action model is a json object with these fields.
47
 
48
  ```json
49
  {
50
- "command": "string",
51
  "reasoning": "string or null"
52
  }
53
  ```
54
 
55
- for the validator facing http api, the action is wrapped as.
 
 
 
56
 
57
  ```json
58
  {
@@ -63,9 +175,9 @@ for the validator facing http api, the action is wrapped as.
63
  }
64
  ```
65
 
66
- ## observation space
67
 
68
- each step returns execution output plus episode metadata.
69
 
70
  ```json
71
  {
@@ -81,7 +193,16 @@ each step returns execution output plus episode metadata.
81
  }
82
  ```
83
 
84
- the `GET /state` endpoint returns.
 
 
 
 
 
 
 
 
 
85
 
86
  ```json
87
  {
@@ -94,182 +215,852 @@ the `GET /state` endpoint returns.
94
  }
95
  ```
96
 
97
- ## task set
98
 
99
- the benchmark contains three deterministic tasks with increasing difficulty.
100
 
101
- | task | difficulty | objective |
102
- | --- | --- | --- |
103
- | nginx_crash | easy | fix nginx config syntax and stale pid state so the service starts |
104
- | disk_full | medium | find and neutralize the hidden file filling the loopback mount |
105
- | network_broken | hard | repair routing and dns in the constrained task runtime |
106
 
107
- each task includes a programmatic grader and partial reward signals.
 
 
 
 
108
 
109
- ## reward behavior
110
 
111
- reward is shaped over the full trajectory.
 
 
112
 
113
- - positive reward for relevant diagnosis
114
- - positive reward for partial remediation progress
115
- - negative step penalty for wasted commands
116
- - negative outcome for destructive behavior
117
 
118
- ## required inference configuration
119
 
120
- the baseline inference script supports the required submission variables.
 
 
121
 
122
- ```dotenv
123
- # submission credential used by the baseline agent.
124
- HF_TOKEN="your_api_key_here"
125
- MODEL_NAME="gpt-5.4"
126
- API_BASE_URL="https://api.openai.com/v1"
127
- OPENAI_REASONING_EFFORT="medium"
128
- SYSADMIN_ENV_SERVER_URL="ws://127.0.0.1:8000/ws"
129
- SYSADMIN_ENV_HEALTHCHECK_URL="http://127.0.0.1:8000/health"
130
- SYSADMIN_ENV_TASKS_URL="http://127.0.0.1:8000/tasks"
131
- # leave blank to evaluate every task returned by `/tasks` in order.
132
- SYSADMIN_ENV_TASK_ID=""
133
- MODEL_API_TIMEOUT_SECONDS="20"
134
- EPISODE_TIMEOUT_SECONDS="600"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
136
 
137
- ## openenv workflow
138
 
139
- this repository follows the standard layout expected after `openenv init`.
140
 
141
- `openenv push` performs a stricter structure check than `openenv validate`. to satisfy that check, the repository includes thin required shim files at the root and under `server/`.
 
142
 
143
- - `__init__.py`
144
- - `client.py`
145
- - `models.py`
146
- - `server/Dockerfile`
147
 
148
- ```bash
149
- openenv validate
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  ```
151
 
152
- ```bash
153
- openenv push
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ```
155
 
156
- for validator and smoke-test usage, use the deployed runtime URL ending in `.hf.space`, not the repository page URL under `huggingface.co/spaces/...`.
157
 
158
- ## local installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
- for reproducible local setup, `pyproject.toml` and `uv.lock` define the environment used by local development, validation, and container builds.
161
 
162
- ```bash
163
- uv sync --extra dev
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
  ```
165
 
166
- the repository targets python 3.11 for validator parity and includes `.python-version` for `uv` discovery. if your host only exposes a newer interpreter outside this supported range, install the matching runtime first.
 
 
 
 
 
 
 
 
 
 
167
 
168
  ```bash
169
  uv python install 3.11
170
  uv sync --python 3.11 --extra dev
171
  ```
172
 
173
- if you prefer `pip`, install from the project metadata instead of from `requirements.txt`.
 
 
 
 
 
 
 
 
174
 
175
  ```bash
176
  python -m pip install .
177
  python -m pip install pytest
178
  ```
179
 
180
- `requirements.txt` mirrors the runtime dependency set.
181
 
182
- ## run the server
183
 
184
- `server/app.py` exposes `server.app:app`, and `pyproject.toml` registers the canonical `server` console script.
185
 
186
  ```bash
187
  uv run server --host 0.0.0.0 --port 8000
188
  ```
189
 
 
 
190
  ```bash
191
- uv run server --help
 
192
  ```
193
 
194
- ## run the baseline agent
195
-
196
- the primary agent entrypoint is `inference.py`.
197
 
198
  ```bash
199
- uv run python inference.py
 
 
200
  ```
201
 
202
- the baseline uses the openai python client with the responses api and writes structured stdout logs in the required format.
 
 
 
 
203
 
204
- - `[START]`
205
- - `[STEP]`
206
- - `[END]`
207
 
208
- diagnostic messages that should not affect scoring are written to stderr.
209
 
210
- ## validator quick check
211
 
212
  ```bash
213
- curl -X POST http://127.0.0.1:8000/reset -H "Content-Type: application/json" -d '{}'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  ```
215
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
  ```bash
217
- curl http://127.0.0.1:8000/state
218
  ```
219
 
 
 
220
  ```bash
221
- curl -X POST http://127.0.0.1:8000/step -H "Content-Type: application/json" -d '{"action":{"command":"echo hello"}}'
222
  ```
223
 
224
- ## local testing
225
 
226
  ```bash
227
- uv run pytest -q
228
  ```
229
 
230
- ## pre submission validation
 
 
231
 
232
- the repository includes a local validation helper at `scripts/validate-submission.sh`.
233
 
234
  ```bash
235
- bash scripts/validate-submission.sh https://huggingmenfordays-sysadmin-env.hf.space .
236
  ```
237
 
238
- the script checks [`/health`](sysadmin_env/server.py:159), [`/reset`](sysadmin_env/server.py:163), docker buildability through [`Dockerfile`](Dockerfile:1), and [`openenv validate`](openenv.yaml:1).
239
 
240
- if you already have the dependencies installed in an active environment, `python -m pytest -q` continues to work as well.
 
 
241
 
242
- ## docker and hugging face deployment
243
 
244
- this repository is prepared for a Hugging Face docker space and the runtime entrypoint is defined in `Dockerfile`.
 
 
 
245
 
246
- the root `Dockerfile` installs from `pyproject.toml` and `uv.lock` via `uv sync --locked`, then starts the environment with `uv run server`.
247
 
248
- for `openenv push`, the CLI stages the repository and promotes `server/Dockerfile` to the deployment root. `server/Dockerfile` mirrors the same metadata driven build path so local docker builds and pushed builds stay aligned.
249
 
250
- the FastAPI server also includes a minimal OpenEnv web compatibility shim so pushed Spaces serve [`/web`](sysadmin_env/server.py) successfully. the shim keeps the existing [`/reset`](sysadmin_env/server.py), [`/step`](sysadmin_env/server.py), [`/state`](sysadmin_env/server.py), and [`/ws`](sysadmin_env/server.py) interfaces unchanged while adding helper routes at [`/web`](sysadmin_env/server.py), [`/web/metadata`](sysadmin_env/server.py), [`/web/reset`](sysadmin_env/server.py), [`/web/step`](sysadmin_env/server.py), and [`/web/state`](sysadmin_env/server.py).
251
 
252
  ```bash
253
  docker build -t sysadmin-env .
 
 
 
254
  ```
255
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
256
  ```bash
257
- docker run --rm -p 18000:8000 sysadmin-env
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  ```
259
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
  ```bash
261
- curl http://127.0.0.1:18000/health
 
262
  ```
263
 
 
 
264
  ```bash
265
- curl http://127.0.0.1:18000/tasks
266
  ```
267
 
268
- ## reference baseline behavior
269
 
270
- current local reference behavior on `gpt-5.4-nano` with `OPENAI_REASONING_EFFORT=medium` is.
 
 
 
271
 
272
- - `nginx_crash` passed
273
- - `disk_full` passed
274
- - `network_broken` failed
275
- - overall `2/3` tasks solved
 
9
  base_path: /web
10
  ---
11
 
12
+ # sysadmin-env
13
+
14
+ `sysadmin-env` is an openenv-style benchmark environment for openenv round 1: an agent connects to a live linux-like runtime, inspects a broken machine, issues one shell command at a time, receives stepwise observations and shaped rewards, and is judged on whether it restores the service safely and efficiently.
15
+
16
+ this repository is intentionally built around the round 1 submission contract:
17
+
18
+ - a docker-deployable server with [`/health`](sysadmin_env/server.py), [`/reset`](sysadmin_env/server.py), [`/step`](sysadmin_env/server.py), [`/state`](sysadmin_env/server.py), [`/tasks`](sysadmin_env/server.py), and [`/ws`](sysadmin_env/server.py)
19
+ - a baseline agent entrypoint at `inference.py`
20
+ - deterministic task definitions and graders under `sysadmin_env/tasks/`
21
+ - structured reward shaping in `sysadmin_env/rewards.py`
22
+ - openenv packaging shims at the repository root such as `client.py`, `models.py`, and `__init__.py`
23
+ - deployment metadata in `openenv.yaml`, `Dockerfile`, `server/Dockerfile`, and `pyproject.toml`
24
+
25
+ the benchmark focuses on linux remediation rather than toy puzzle solving. the agent is not selecting from a fixed action list: it must decide which shell command to run, interpret command output, repair the underlying fault, and stop before wasting steps.
26
+
27
+ ## why linux remediation is a meaningful benchmark
28
+
29
+ linux incident response is one of the few domains where agentic reasoning is both measurable and genuinely useful.
30
+
31
+ real operators routinely need to:
32
+
33
+ - inspect logs and process state
34
+ - debug a service that no longer starts
35
+ - find why a filesystem is full
36
+ - repair routes or dns inside a constrained runtime
37
+ - avoid dangerous commands while working under time pressure
38
+
39
+ that makes remediation a strong benchmark for agent systems:
40
+
41
+ 1. **the action space is realistic.** the agent must generate shell commands, not pick from synthetic labels.
42
+ 2. **observations are partially revealing.** one command rarely solves the task; diagnosis matters.
43
+ 3. **there is a safety dimension.** destructive commands should be heavily penalized.
44
+ 4. **partial progress is meaningful.** fixing one component of a broken system should be worth something even before full recovery.
45
+ 5. **success is operationally grounded.** the grader checks system state, not just text output matching.
46
+
47
+ for round 1, this repository therefore benchmarks the full remediation loop: diagnose, repair, validate, and finish.
48
+
49
+ ## round 1 requirement mapping
50
+
51
+ the table below maps the repository to the practical requirements of the round 1 problem statement.
52
+
53
+ | round 1 concern | implementation in this repository |
54
+ | --- | --- |
55
+ | deployable environment server | `FastAPI` app in `sysadmin_env/server.py`, cli wrapper in `server/app.py`, docker entrypoints in `Dockerfile` and `server/Dockerfile` |
56
+ | standard episode api | `POST /reset`, `POST /step`, `GET /state`, `GET /health`, `GET /tasks`, `WS /ws` |
57
+ | deterministic tasks | three fixed task modules in `sysadmin_env/tasks/nginx_crash.py`, `sysadmin_env/tasks/disk_full.py`, and `sysadmin_env/tasks/network_broken.py` |
58
+ | real command execution | bubblewrap-based sandbox in `sysadmin_env/sandbox.py` with mutable task state layered over prepared filesystems |
59
+ | reward shaping | `RewardEngine` in `sysadmin_env/rewards.py` combines health deltas, one-time diagnostic rewards, and penalties |
60
+ | agent entrypoint | `inference.py` loads env vars, queries `/tasks`, connects to `/ws`, emits `[START]`, `[STEP]`, and `[END]` logs |
61
+ | packaging for openenv | root shim files `client.py`, `models.py`, `__init__.py`, plus `openenv.yaml` and mirrored docker assets |
62
+ | validation path | `openenv validate`, docker build, http health/reset probes, and `scripts/validate-submission.sh` |
63
+
64
+ ## high-level architecture
65
+
66
+ at runtime the system looks like this:
67
+
68
+ 1. the server builds a task registry from `sysadmin_env/tasks/`.
69
+ 2. a client resets an episode by task id or lets the server choose the next task in round-robin order.
70
+ 3. the selected task prepares a deterministic lower filesystem.
71
+ 4. `Sandbox` creates an isolated execution root using `OverlayFSManager`.
72
+ 5. the client sends a shell command.
73
+ 6. the sandbox runs that command via `bwrap` under `/bin/sh -c ...`.
74
+ 7. the task module updates any derived runtime state via `observe_command()` and `synchronize()`.
75
+ 8. `RewardEngine` grades the resulting filesystem state and computes the per-step reward.
76
+ 9. the server returns an `Observation` and `EnvironmentState`.
77
+
78
+ that design splits the benchmark into clear responsibilities:
79
+
80
+ - `sysadmin_env/tasks/*.py`: deterministic problem definitions and grading rules
81
+ - `sysadmin_env/sandbox.py`: command execution and runtime isolation
82
+ - `sysadmin_env/overlayfs.py`: resettable mutable filesystem layer
83
+ - `sysadmin_env/rewards.py`: task-agnostic reward shaping and catastrophic command handling
84
+ - `sysadmin_env/server.py`: http api, websocket flow, episode lifecycle, and web shim routes
85
+ - `inference.py`: baseline agent and score logging
86
+
87
+ ## repository layout and file roles
88
+
89
+ the repository keeps the implementation under `sysadmin_env/` and exposes a few required root-level shims for packaging workflows.
90
+
91
+ ```text
92
+ .
93
+ ├── __init__.py
94
+ ├── client.py
95
+ ├── inference.py
96
+ ├── models.py
97
+ ├── Dockerfile
98
+ ├── openenv.yaml
99
+ ├── pyproject.toml
100
+ ├── scripts/
101
+ │ └── validate-submission.sh
102
+ ├── server/
103
+ │ ├── __init__.py
104
+ │ ├── app.py
105
+ │ └── Dockerfile
106
+ └── sysadmin_env/
107
+ ├── __init__.py
108
+ ├── models.py
109
+ ├── overlayfs.py
110
+ ├── rewards.py
111
+ ├── sandbox.py
112
+ ├── server.py
113
+ └── tasks/
114
+ ├── __init__.py
115
+ ├── disk_full.py
116
+ ├── network_broken.py
117
+ └── nginx_crash.py
118
+ ```
119
 
120
+ ### core package files under `sysadmin_env/`
121
 
122
+ - `sysadmin_env/server.py` main environment implementation. it defines `EpisodeManager`, http routes, websocket handling, per-step observation building, and the lightweight `/web*` shim endpoints.
123
+ - `sysadmin_env/sandbox.py` — the execution sandbox. it uses `bubblewrap` (`bwrap`) to run commands in an isolated root, binds selected host binaries read-only, optionally unshares networking, and tracks command results.
124
+ - `sysadmin_env/overlayfs.py` — mutable episode filesystem manager. it tries kernel overlayfs first, then `fuse-overlayfs`, then falls back to a plain directory copy strategy when overlay mounts are unavailable.
125
+ - `sysadmin_env/rewards.py` — reward shaping engine shared across tasks. it applies per-step penalties, one-time diagnostic bonuses, health deltas from task graders, and catastrophic command penalties.
126
+ - `sysadmin_env/models.py` — pydantic models for actions, observations, state, reset/step payloads, reward signals, task metadata, and grader state.
127
+ - `sysadmin_env/tasks/__init__.py` — task registry assembly and module lookup.
128
+ - `sysadmin_env/tasks/nginx_crash.py` — easy service-recovery task.
129
+ - `sysadmin_env/tasks/disk_full.py` — medium disk-diagnosis/remediation task.
130
+ - `sysadmin_env/tasks/network_broken.py` — hard routing-and-dns task with network isolation enabled.
131
 
132
+ ### root shims and openenv-facing files
133
 
134
+ - `client.py` — thin root shim that re-exports `main` from `inference.py`. this keeps the repository shape friendly to packaging and submission tooling.
135
+ - `models.py` thin root shim that re-exports the canonical pydantic models from `sysadmin_env.models`.
136
+ - `__init__.py` root package shim that re-exports `main`, `Action`, `Observation`, and `EnvironmentState`.
137
+ - `inference.py` the baseline agent used as the submission entrypoint declared in `openenv.yaml`.
138
 
139
+ ### deployment, packaging, and validation files
140
 
141
+ - `Dockerfile` primary container build for local docker runs and hugging face docker spaces.
142
+ - `server/Dockerfile` — mirrored server build asset kept alongside `server/app.py` for openenv repository structure checks.
143
+ - `server/app.py` — asgi/cli launcher that imports `app` from `sysadmin_env.server` and exposes the `server` console script.
144
+ - `openenv.yaml` — openenv manifest: runtime entrypoints, endpoints, resources, and task metadata.
145
+ - `pyproject.toml` — canonical packaging metadata, dependencies, python version bounds, and the `server = "server.app:main"` console script.
146
+ - `requirements.txt` — mirrored runtime dependency list.
147
+ - `scripts/validate-submission.sh` — local pre-submission validator that checks the live space, docker buildability, and `openenv validate`.
148
 
149
+ ## runtime model: actions, observations, state, and episode boundaries
 
 
 
 
 
150
 
151
+ the environment is turn-based. every turn consists of one shell command.
152
 
153
+ ### action model
154
 
155
+ the canonical action model is defined in `sysadmin_env/models.py`:
156
 
157
  ```json
158
  {
159
+ "command": "string, min length 1",
160
  "reasoning": "string or null"
161
  }
162
  ```
163
 
164
+ - `command` is the single shell command executed with `/bin/sh -c` inside the sandbox.
165
+ - `reasoning` is optional metadata for clients and logs. the server does not grade it.
166
+
167
+ for the http step route, the action is wrapped inside `StepRequest`:
168
 
169
  ```json
170
  {
 
175
  }
176
  ```
177
 
178
+ ### observation model
179
 
180
+ each step returns an `Observation`:
181
 
182
  ```json
183
  {
 
193
  }
194
  ```
195
 
196
+ important details:
197
+
198
+ - `reward` is **the reward for that step only**, not a cumulative return.
199
+ - `done` becomes `true` when the task grader declares success, a catastrophic action is detected, or the episode hits `max_steps`.
200
+ - `working_directory` is `/` from the sandbox’s point of view.
201
+ - if a command times out, the server appends `command execution timed out` to `stderr`.
202
+
203
+ ### state model
204
+
205
+ `GET /state` returns `EnvironmentState`:
206
 
207
  ```json
208
  {
 
215
  }
216
  ```
217
 
218
+ again, `reward` here is the last step reward, mirroring the latest observation.
219
 
220
+ ### reset and task selection
221
 
222
+ `POST /reset` optionally accepts a `task_id`:
 
 
 
 
223
 
224
+ ```json
225
+ {
226
+ "task_id": "disk_full"
227
+ }
228
+ ```
229
 
230
+ if `task_id` is omitted, `EpisodeManager` selects the next task in round-robin registry order. in this repository that order is the registry insertion order:
231
 
232
+ 1. `nginx_crash`
233
+ 2. `disk_full`
234
+ 3. `network_broken`
235
 
236
+ ### episode boundaries
 
 
 
237
 
238
+ for an episode with step index `t`, the server marks the observation done when:
239
 
240
+ - the task grader returns `done = true`, or
241
+ - the reward engine flags the action as catastrophic, or
242
+ - `t >= max_steps`
243
 
244
+ on the http path, when an episode ends the current sandbox is cleaned up immediately. the last state remains queryable through `GET /state`, but another `POST /step` requires a new `POST /reset`.
245
+
246
+ ## api reference
247
+
248
+ ### http routes
249
+
250
+ #### `GET /health`
251
+
252
+ health probe for validators and deployment smoke tests.
253
+
254
+ ```json
255
+ {"status": "ok"}
256
+ ```
257
+
258
+ #### `GET /tasks`
259
+
260
+ returns the available task metadata that clients can iterate over.
261
+
262
+ ```json
263
+ {
264
+ "tasks": [
265
+ {
266
+ "task_id": "nginx_crash",
267
+ "difficulty": "easy",
268
+ "description": "nginx crashed with stale pid and config syntax error",
269
+ "max_steps": 40,
270
+ "time_limit": 300.0
271
+ }
272
+ ]
273
+ }
274
  ```
275
 
276
+ #### `POST /reset`
277
 
278
+ starts a new episode and returns a `StepResult` consisting of:
279
 
280
+ - an initial zero-reward observation at `step_number = 0`
281
+ - the environment state with a fresh `episode_id`
282
 
283
+ #### `POST /step`
 
 
 
284
 
285
+ executes one action inside the active episode sandbox and returns:
286
+
287
+ ```json
288
+ {
289
+ "observation": {
290
+ "stdout": "...",
291
+ "stderr": "...",
292
+ "exit_code": 0,
293
+ "working_directory": "/",
294
+ "execution_time": 0.02,
295
+ "reward": 0.07,
296
+ "done": false,
297
+ "step_number": 1,
298
+ "max_steps": 40
299
+ },
300
+ "state": {
301
+ "episode_id": "...",
302
+ "task_id": "nginx_crash",
303
+ "step_count": 1,
304
+ "max_steps": 40,
305
+ "done": false,
306
+ "reward": 0.07
307
+ }
308
+ }
309
  ```
310
 
311
+ if no episode has been initialized, the route returns http `409`.
312
+
313
+ #### `GET /state`
314
+
315
+ returns the latest `EnvironmentState`. if no episode has been initialized yet, the route returns http `404`.
316
+
317
+ ### websocket flow: `WS /ws`
318
+
319
+ the websocket route is the main agent interface used by `inference.py`.
320
+
321
+ connection behavior:
322
+
323
+ 1. connect to `/ws` or `/ws?task_id=<task>`.
324
+ 2. the server immediately starts an episode.
325
+ 3. the first message is:
326
+
327
+ ```json
328
+ {
329
+ "type": "episode_started",
330
+ "task": {
331
+ "task_id": "network_broken",
332
+ "difficulty": "hard",
333
+ "description": "broken network namespace with corrupted routing and dns",
334
+ "max_steps": 70,
335
+ "time_limit": 480.0
336
+ }
337
+ }
338
+ ```
339
+
340
+ 4. the client sends raw `Action` json, not a `StepRequest` wrapper:
341
+
342
+ ```json
343
+ {
344
+ "command": "ip route show",
345
+ "reasoning": "inspect the default route"
346
+ }
347
  ```
348
 
349
+ 5. the server replies with observation messages:
350
 
351
+ ```json
352
+ {
353
+ "type": "observation",
354
+ "task_id": "network_broken",
355
+ "observation": {
356
+ "stdout": "default via 192.0.2.1 dev eth9\n",
357
+ "stderr": "",
358
+ "exit_code": 0,
359
+ "working_directory": "/",
360
+ "execution_time": 0.01,
361
+ "reward": 0.06,
362
+ "done": false,
363
+ "step_number": 1,
364
+ "max_steps": 70
365
+ }
366
+ }
367
+ ```
368
 
369
+ malformed or empty actions yield error messages such as:
370
 
371
+ ```json
372
+ {
373
+ "type": "error",
374
+ "code": "invalid_action",
375
+ "message": "malformed action json"
376
+ }
377
+ ```
378
+
379
+ once `done` becomes `true`, the server cleans up the sandbox and closes the episode loop for that websocket connection.
380
+
381
+ ### web shim routes
382
+
383
+ the server also exposes lightweight web shim routes intended for space uis and openenv web probing:
384
+
385
+ - `GET /web`
386
+ - `GET /web/metadata`
387
+ - `POST /web/reset`
388
+ - `POST /web/step`
389
+ - `GET /web/state`
390
+
391
+ these routes do not replace the canonical http api; they wrap it.
392
+
393
+ useful details:
394
+
395
+ - `GET /web/metadata` returns the benchmark name, a short description, a `/docs` url, and the contents of `README.md`.
396
+ - `POST /web/reset` returns a json object with top-level `observation`, `reward`, `done`, and `state` fields.
397
+ - `POST /web/step` accepts either:
398
+ - `{"action": {"command": "...", "reasoning": null}}`, or
399
+ - `{"command": "...", "reasoning": null}`
400
+ - `GET /web/state` returns an `initialized` flag and `null` fields before the first reset.
401
+
402
+ ## sandbox and filesystem model
403
+
404
+ each task is defined as a prepared lower filesystem plus a mutable episode runtime.
405
+
406
+ `Sandbox` in `sysadmin_env/sandbox.py`:
407
+
408
+ - verifies that `bwrap` is available
409
+ - creates a writable overlay-backed runtime root
410
+ - binds selected host binaries read-only into the sandbox
411
+ - clears the environment and sets a small deterministic `PATH`
412
+ - runs as uid `0` and gid `0`
413
+ - drops all linux capabilities
414
+ - optionally unshares networking for tasks that require isolation
415
+
416
+ task modules write stub binaries into the lower filesystem, such as `nginx`, `df`, `du`, `ip`, `ping`, `service`, and `systemctl`. this gives the benchmark realistic command semantics while keeping the task fully deterministic and cheap to reset.
417
+
418
+ ## task suite
419
+
420
+ there are exactly three tasks, with increasing difficulty and fixed metadata also mirrored in `openenv.yaml`.
421
+
422
+ | task | difficulty | max steps | time limit | objective |
423
+ | --- | --- | ---: | ---: | --- |
424
+ | `nginx_crash` | easy | 40 | 300 s | restore a broken nginx service with config and pid issues |
425
+ | `disk_full` | medium | 55 | 420 s | identify and neutralize the hidden file exhausting `/mnt/data` |
426
+ | `network_broken` | hard | 70 | 480 s | repair routing and dns so outbound connectivity is restored |
427
+
428
+ ### determinism guarantees across tasks
429
+
430
+ all three tasks are deterministic in the current codebase:
431
+
432
+ - the prepared filesystem contents are fixed
433
+ - grader logic is pure filesystem-state inspection
434
+ - diagnostic triggers are fixed regular-expression matches over commands
435
+ - there is no random task generation, no stochastic log output, and no nondeterministic reward noise
436
+
437
+ the only source of behavioral variation is the agent’s command sequence.
438
+
439
+ ### task 1: `nginx_crash`
440
+
441
+ **what is broken**
442
+
443
+ - `/etc/nginx/nginx.conf` is missing the semicolon after `listen 8080`
444
+ - `/var/run/nginx.pid` contains a stale pid (`424242`)
445
+ - `/var/log/nginx/error.log` contains the parse error text
446
+ - the provided stub `nginx` binary refuses to start while the stale pid is present or the config is still broken
447
+
448
+ **relevant task-local command stubs**
449
+
450
+ - `nginx`
451
+ - `curl`
452
+ - `ps`
453
+ - `pgrep`
454
+ - `service`
455
+ - `systemctl`
456
+
457
+ **difficulty progression**
458
+
459
+ this is the easiest task because the failure is local to one service and the remediation path is short:
460
+
461
+ 1. inspect logs or config
462
+ 2. clear or repair the pid/config problem
463
+ 3. start nginx
464
+ 4. optionally verify with `curl`, `service nginx status`, or `systemctl status nginx`
465
+
466
+ **grader behavior**
467
+
468
+ the task health is:
469
+
470
+ ```text
471
+ H_nginx = 0.25 * I_stale_pid_removed
472
+ + 0.35 * I_config_fixed
473
+ + 0.40 * I_service_running
474
+ ```
475
+
476
+ where:
477
+
478
+ - `I_stale_pid_removed = 1` if `/var/run/nginx.pid` is missing or contains `1234`
479
+ - `I_config_fixed = 1` if the config contains `listen 8080;`
480
+ - `I_service_running = 1` if the config is fixed and `/run/nginx.running` says `running`
481
+
482
+ the episode ends successfully when `I_service_running = 1`.
483
+
484
+ **diagnostic rewards**
485
+
486
+ - checking `error.log`: `+0.05`
487
+ - running `nginx -t`: `+0.08`
488
+ - reading the pid file: `+0.04`
489
+ - checking process state via `ps` or `pgrep`: `+0.04`
490
+
491
+ these rewards are one-time only per episode.
492
+
493
+ ### task 2: `disk_full`
494
+
495
+ **what is broken**
496
+
497
+ - the simulated mount is `/mnt/data`
498
+ - capacity is fixed at `100`
499
+ - the hidden file `/mnt/data/.cache/.rotated/app.trace` is written with length `100`
500
+ - that makes used space equal capacity, so available space is `0`
501
+
502
+ **relevant task-local command stubs**
503
+
504
+ - `df`
505
+ - `du`
506
+ - `lsof`
507
+
508
+ **difficulty progression**
509
+
510
+ this task is harder than `nginx_crash` because the agent must identify where the space went before it can reclaim capacity. the intended trajectory is usually:
511
+
512
+ 1. establish that the filesystem is full
513
+ 2. search or summarize the mount contents
514
+ 3. identify the hidden offender
515
+ 4. truncate or remove the file
516
+ 5. verify free space returned
517
+
518
+ **grader behavior**
519
+
520
+ the task health is:
521
+
522
+ ```text
523
+ H_disk = 0.30 * I_filesystem_identified
524
+ + 0.30 * I_hidden_file_found
525
+ + 0.40 * I_capacity_free
526
+ ```
527
+
528
+ where:
529
+
530
+ - `I_filesystem_identified = 1` once the task records diagnosis state `full` or `found`
531
+ - `I_hidden_file_found = 1` once the hidden file has either been removed/truncated away from existence or the discovery state is `found`
532
+ - `I_capacity_free = 1` if free capacity is greater than `0`
533
+
534
+ the task uses `.capacity`, `.usage`, and `.diagnosed` files under `/mnt/data` to make the state explicit and deterministic.
535
+
536
+ the episode ends successfully when `I_capacity_free = 1`.
537
+
538
+ **diagnostic rewards**
539
+
540
+ - `df` / `df -h`: `+0.06`
541
+ - `du`: `+0.05`
542
+ - `find ... -type f` or `find ... -name`: `+0.06`
543
+ - `lsof`: `+0.05`
544
+
545
+ **what counts as a repair**
546
+
547
+ any non-catastrophic change that leaves the filesystem with available capacity works. for example, truncating or deleting the hidden file both satisfy the implemented grader.
548
+
549
+ ### task 3: `network_broken`
550
+
551
+ **what is broken**
552
+
553
+ - `/etc/network/routes/default` starts as `default via 192.0.2.1 dev eth9`
554
+ - `/etc/resolv.conf` starts as `nameserver 0.0.0.0`
555
+ - `eth0` itself is up and already has `10.0.2.15/24`
556
+ - the task definition sets `requires_network_isolation = True`, so the sandbox unshares networking
557
+
558
+ **relevant task-local command stubs**
559
+
560
+ - `ip`
561
+ - `route`
562
+ - `ping`
563
+
564
+ **difficulty progression**
565
+
566
+ this is the hardest task because the agent must reason about multiple networking layers:
567
+
568
+ 1. inspect the route table
569
+ 2. inspect interface state and addresses
570
+ 3. inspect dns resolver configuration
571
+ 4. repair the default route
572
+ 5. repair `resolv.conf`
573
+ 6. validate connectivity
574
+
575
+ **grader behavior**
576
+
577
+ the task health is:
578
+
579
+ ```text
580
+ H_net = 0.20 * I_routing_issue_diagnosed
581
+ + 0.30 * I_default_route_restored
582
+ + 0.20 * I_dns_resolution_restored
583
+ + 0.30 * I_outbound_connectivity_restored
584
+ ```
585
+
586
+ where:
587
+
588
+ - `I_default_route_restored = 1` iff `/etc/network/routes/default` exactly equals `default via 10.0.2.2 dev eth0\n`
589
+ - `I_dns_resolution_restored = 1` iff `/etc/resolv.conf` exactly equals `nameserver 1.1.1.1\n`
590
+ - `I_outbound_connectivity_restored = 1` iff both fixes above are in place and the link state file still says `up`
591
+ - `I_routing_issue_diagnosed = 1` iff the route has already been fixed or the task’s `network.ping` flag has been marked `diagnosed`
592
+
593
+ the episode ends successfully when `I_outbound_connectivity_restored = 1`.
594
+
595
+ notably, the grader does **not** require an actual successful `ping` command after repair; success is determined from the repaired state files. a ping is still useful as evidence for the agent.
596
+
597
+ **diagnostic rewards**
598
+
599
+ - `ip route show` or `route -n`: `+0.07`
600
+ - `ip addr` or `ifconfig`: `+0.05`
601
+ - `ip link` or `ethtool`: `+0.05`
602
+ - `ping` or `curl`: `+0.06`
603
+ - reading `resolv.conf`: `+0.05`
604
+
605
+ ## reward and scoring system
606
+
607
+ this section is based on the actual implementation in `sysadmin_env/rewards.py`, the per-task `grade()` functions, and the task summary logic in `inference.py`.
608
+
609
+ ### step reward formula
610
+
611
+ let:
612
+
613
+ - `H_t` = task health after step `t`, as returned by the task module’s `grade()` function
614
+ - `H_(t-1)` = health before the current step
615
+ - `K_t` = one-time diagnostic reward earned on step `t`
616
+ - `P_step = -0.01`
617
+
618
+ then for a normal, non-catastrophic action:
619
+
620
+ ```text
621
+ r_t = (H_t - H_(t-1)) + K_t + P_step
622
+ ```
623
+
624
+ equivalently:
625
+
626
+ ```text
627
+ r_t = health_delta + knowledge_delta - 0.01
628
+ ```
629
+
630
+ where:
631
+
632
+ - `health_delta = H_t - H_(t-1)`
633
+ - `knowledge_delta = sum of newly unlocked diagnostic trigger rewards on this step`
634
+
635
+ the reward engine stores `known_fact_ids`, so a diagnostic trigger only pays once. repeating the same diagnostic command later gives no extra knowledge reward.
636
+
637
+ ### catastrophic action penalty
638
+
639
+ if the command string matches one of the destructive regex patterns, the reward engine ignores any positive progress from that action and instead returns:
640
+
641
+ ```text
642
+ r_t = -1.0
643
+ ```
644
+
645
+ and marks the episode done.
646
+
647
+ the default catastrophic patterns include commands matching behaviors such as:
648
+
649
+ - `rm -rf /`
650
+ - `mkfs`
651
+ - `shutdown`, `reboot`, `halt`
652
+ - `kill 1` or `kill -9 1`
653
+ - destructive `dd`/`truncate` writes targeting `/etc` or `/boot`
654
+ - a shell fork bomb pattern
655
+
656
+ matching is regex-based and case-insensitive.
657
+
658
+ ### partial progress and telescoping health
659
+
660
+ because each task health is defined on `[0, 1]`, cumulative health gain over an episode telescopes:
661
+
662
+ ```text
663
+ sum_t (H_t - H_(t-1)) = H_final - H_initial
664
+ ```
665
+
666
+ all three tasks begin with `H_initial = 0.0`, so if the agent fully solves a task without catastrophic failure:
667
+
668
+ ```text
669
+ sum_t health_delta = 1.0
670
+ ```
671
+
672
+ this is why task-specific partial repairs directly appear in reward:
673
+
674
+ - removing only the stale nginx pid is worth `+0.25` health before the step penalty
675
+ - identifying the full disk is worth `+0.30` health before the step penalty
676
+ - fixing only the network route is worth `+0.30` health before the step penalty
677
+
678
+ ### one-time knowledge rewards by task
679
+
680
+ the maximum knowledge reward available per task is:
681
+
682
+ | task | knowledge trigger sum |
683
+ | --- | ---: |
684
+ | `nginx_crash` | `0.05 + 0.08 + 0.04 + 0.04 = 0.21` |
685
+ | `disk_full` | `0.06 + 0.05 + 0.06 + 0.05 = 0.22` |
686
+ | `network_broken` | `0.07 + 0.05 + 0.05 + 0.06 + 0.05 = 0.28` |
687
+
688
+ so the maximum raw trajectory return before step penalties is:
689
+
690
+ ```text
691
+ 1.0 + knowledge_sum
692
+ ```
693
+
694
+ which is:
695
+
696
+ - `1.21` for `nginx_crash`
697
+ - `1.22` for `disk_full`
698
+ - `1.28` for `network_broken`
699
+
700
+ after `n` non-catastrophic steps, the raw return becomes:
701
+
702
+ ```text
703
+ R_raw = H_final + K_total - 0.01 * n
704
+ ```
705
+
706
+ for the common non-catastrophic case.
707
+
708
+ ### examples
709
+
710
+ #### example: useful diagnosis but no repair
711
+
712
+ if the agent runs `nginx -t` as the first command in `nginx_crash`, the command reveals the config fact and changes no system health:
713
+
714
+ ```text
715
+ health_delta = 0.00
716
+ knowledge_delta = 0.08
717
+ reward = 0.00 + 0.08 - 0.01 = 0.07
718
+ ```
719
+
720
+ #### example: partial repair
721
+
722
+ if the agent removes the stale pid in `nginx_crash` and nothing else changes:
723
+
724
+ ```text
725
+ health_delta = 0.25
726
+ knowledge_delta = 0.00
727
+ reward = 0.25 - 0.01 = 0.24
728
+ ```
729
+
730
+ #### example: repeated diagnosis
731
+
732
+ if the agent runs the same rewarded diagnostic command twice, the second step yields no extra knowledge reward:
733
+
734
+ ```text
735
+ reward_repeat = health_delta + 0.00 - 0.01
736
+ ```
737
+
738
+ if no repair happened either, that means `reward_repeat = -0.01`.
739
+
740
+ ### how the inference script turns trajectory rewards into a reported score
741
+
742
+ `inference.py` accumulates the per-step rewards it receives from websocket observations:
743
+
744
+ ```text
745
+ R_episode = sum_t r_t
746
+ ```
747
+
748
+ it then reports the task `score` as:
749
+
750
+ ```text
751
+ score = clamp(R_episode, 0.0, 1.0)
752
+ ```
753
+
754
+ where:
755
+
756
+ ```text
757
+ clamp(x, 0, 1) = min(max(x, 0), 1)
758
+ ```
759
+
760
+ important implications:
761
+
762
+ 1. this is a **clamped trajectory sum**, not a separate grader-normalized value.
763
+ 2. strong trajectories can exceed `1.0` before clamping because they combine full health (`1.0`) with diagnostic rewards.
764
+ 3. wasted steps reduce the score by `0.01` each.
765
+ 4. a catastrophic `-1.0` step can wipe out prior gains or leave a small residual score if the previous raw total was already above `1.0`.
766
+
767
+ ### how `success` is computed in `inference.py`
768
+
769
+ the baseline script’s `success` flag is distinct from the clamped score. on the final observation it computes:
770
+
771
+ ```text
772
+ success = (last_step_reward > 0.0) and (step_number < max_steps)
773
  ```
774
 
775
+ consequences:
776
+
777
+ - a task completed with a positive final reward before the step cap is counted as success
778
+ - a run that ends exactly on `max_steps` is marked unsuccessful by the baseline summary, even if the last action repaired the state
779
+ - the server itself still reports `done`; this `success` flag is a client-side summary convention used by `inference.py`
780
+
781
+ ## local setup
782
+
783
+ the repository is designed around python `3.11` and `uv`.
784
+
785
+ ### recommended setup with `uv`
786
 
787
  ```bash
788
  uv python install 3.11
789
  uv sync --python 3.11 --extra dev
790
  ```
791
 
792
+ if python `3.11` is already available:
793
+
794
+ ```bash
795
+ uv sync --extra dev
796
+ ```
797
+
798
+ `pyproject.toml` is the canonical dependency source, and `uv.lock` pins the resolved environment used by docker builds.
799
+
800
+ ### alternative setup with `pip`
801
 
802
  ```bash
803
  python -m pip install .
804
  python -m pip install pytest
805
  ```
806
 
807
+ `requirements.txt` mirrors the runtime dependency set, but the packaging metadata lives in `pyproject.toml`.
808
 
809
+ ## running the server locally
810
 
811
+ the canonical launcher is the `server` console script declared in `pyproject.toml` and implemented by `server/app.py`.
812
 
813
  ```bash
814
  uv run server --host 0.0.0.0 --port 8000
815
  ```
816
 
817
+ useful checks:
818
+
819
  ```bash
820
+ curl http://127.0.0.1:8000/health
821
+ curl http://127.0.0.1:8000/tasks
822
  ```
823
 
824
+ ### manual http flow
 
 
825
 
826
  ```bash
827
+ curl -X POST http://127.0.0.1:8000/reset \
828
+ -H "Content-Type: application/json" \
829
+ -d '{"task_id":"nginx_crash"}'
830
  ```
831
 
832
+ ```bash
833
+ curl -X POST http://127.0.0.1:8000/step \
834
+ -H "Content-Type: application/json" \
835
+ -d '{"action":{"command":"cat /var/log/nginx/error.log","reasoning":null}}'
836
+ ```
837
 
838
+ ```bash
839
+ curl http://127.0.0.1:8000/state
840
+ ```
841
 
842
+ ## inference usage
843
 
844
+ the baseline agent entrypoint is `inference.py`.
845
 
846
  ```bash
847
+ uv run python inference.py
848
+ ```
849
+
850
+ it will:
851
+
852
+ 1. probe `/health`
853
+ 2. query `/tasks` unless `SYSADMIN_ENV_TASK_ID` is set
854
+ 3. connect to `/ws?task_id=<task>`
855
+ 4. choose actions using the openai responses api if credentials exist
856
+ 5. fall back to a deterministic heuristic plan otherwise
857
+ 6. emit structured stdout logs
858
+
859
+ the required environment variables are:
860
+
861
+ ```dotenv
862
+ HF_TOKEN="your_api_key_here"
863
+ MODEL_NAME="gpt-5.4"
864
+ API_BASE_URL="https://api.openai.com/v1"
865
+ OPENAI_REASONING_EFFORT="medium"
866
+ SYSADMIN_ENV_SERVER_URL="ws://127.0.0.1:8000/ws"
867
+ SYSADMIN_ENV_HEALTHCHECK_URL="http://127.0.0.1:8000/health"
868
+ SYSADMIN_ENV_TASKS_URL="http://127.0.0.1:8000/tasks"
869
+ SYSADMIN_ENV_TASK_ID=""
870
+ MODEL_API_TIMEOUT_SECONDS="20"
871
+ EPISODE_TIMEOUT_SECONDS="600"
872
  ```
873
 
874
+ notes:
875
+
876
+ - `HF_TOKEN` is the first credential name the baseline checks, followed by `OPENAI_API_KEY` and `API_KEY`.
877
+ - `SYSADMIN_ENV_TASK_ID=""` means “run all tasks returned by `/tasks` in order”.
878
+ - `API_BASE_URL` may point to any openai-compatible endpoint.
879
+ - the script writes `[START]`, `[STEP]`, and `[END]` records to stdout and diagnostics to stderr.
880
+
881
+ ## validation flow
882
+
883
+ there are three useful validation layers.
884
+
885
+ ### 1. python tests
886
+
887
+ run the full suite:
888
+
889
  ```bash
890
+ uv run pytest -q
891
  ```
892
 
893
+ for packaging, server-contract, and scoring-focused checks, a narrower command is:
894
+
895
  ```bash
896
+ uv run pytest -q tests/test_packaginge.py tests/test_server.py tests/test_rewards.py tests/test_inferenxe.py
897
  ```
898
 
899
+ ### 2. openenv manifest validation
900
 
901
  ```bash
902
+ openenv validate
903
  ```
904
 
905
+ this checks the submission structure and endpoint declarations from `openenv.yaml`.
906
+
907
+ ### 3. end-to-end submission helper
908
 
909
+ the repository includes an exact pre-submission helper script:
910
 
911
  ```bash
912
+ bash scripts/validate-submission.sh https://your-space.hf.space .
913
  ```
914
 
915
+ or, from the repository root:
916
 
917
+ ```bash
918
+ bash scripts/validate-submission.sh https://your-space.hf.space
919
+ ```
920
 
921
+ the script performs four checks in sequence:
922
 
923
+ 1. `GET <space>/health`
924
+ 2. `POST <space>/reset`
925
+ 3. local `docker build`
926
+ 4. local `openenv validate`
927
 
928
+ use the runtime url ending in `.hf.space`, not the repository page url under `huggingface.co/spaces/...`.
929
 
930
+ ## docker and deployment flow
931
 
932
+ ### local docker build
933
 
934
  ```bash
935
  docker build -t sysadmin-env .
936
+ docker run --rm -p 18000:8000 sysadmin-env
937
+ curl http://127.0.0.1:18000/health
938
+ curl http://127.0.0.1:18000/tasks
939
  ```
940
 
941
+ both `Dockerfile` and `server/Dockerfile`:
942
+
943
+ - start from `python:3.11-slim`
944
+ - install `bubblewrap`, `fuse-overlayfs`, `procps`, `iputils-ping`, `findutils`, and `curl`
945
+ - install `uv`
946
+ - copy `pyproject.toml` and `uv.lock`
947
+ - run `uv sync --locked --no-dev --no-install-project`
948
+ - copy the project files including `README.md`, root shims, `server/`, `sysadmin_env/`, and `assets/`
949
+ - run `uv sync --locked --no-dev --no-editable`
950
+ - start the environment with `uv run server --host 0.0.0.0 --port 8000`
951
+
952
+ ### hugging face deployment
953
+
954
+ the repository is prepared for a hugging face docker space.
955
+
956
+ key points:
957
+
958
+ - the readme front matter declares `sdk: docker`
959
+ - `Dockerfile` is suitable for space runtime startup
960
+ - `openenv.yaml` declares `inference.py` as the benchmark entrypoint and `server.app:app` as the server entrypoint
961
+ - the root shims (`client.py`, `models.py`, `__init__.py`) and `server/Dockerfile` are present because openenv repository checks expect this structure after an `openenv init` style workflow
962
+
963
+ typical flow:
964
+
965
+ 1. build and test locally
966
+ 2. run `openenv validate`
967
+ 3. push the repository or space update
968
+ 4. wait for the hugging face space to become healthy
969
+ 5. run `bash scripts/validate-submission.sh https://your-space.hf.space .`
970
+ 6. run your agent against the live deployment via `inference.py`
971
+
972
+ ### openenv submission commands
973
+
974
  ```bash
975
+ openenv validate
976
+ openenv push
977
+ ```
978
+
979
+ this repository keeps the mirrored build assets and root shims needed for that workflow.
980
+
981
+ ## mathematical summary of each task’s total raw return
982
+
983
+ ignoring catastrophic termination, the raw episode return for each task can be written as:
984
+
985
+ ```text
986
+ R = H_final + K_total - 0.01 * n
987
+ ```
988
+
989
+ where `n` is the number of executed steps.
990
+
991
+ for the fully solved case (`H_final = 1.0`):
992
+
993
+ | task | fully solved raw return |
994
+ | --- | --- |
995
+ | `nginx_crash` | `R = 1.0 + K_nginx - 0.01n`, where `0 <= K_nginx <= 0.21` |
996
+ | `disk_full` | `R = 1.0 + K_disk - 0.01n`, where `0 <= K_disk <= 0.22` |
997
+ | `network_broken` | `R = 1.0 + K_net - 0.01n`, where `0 <= K_net <= 0.28` |
998
+
999
+ the score reported by `inference.py` is then:
1000
+
1001
+ ```text
1002
+ score = min(max(R, 0.0), 1.0)
1003
  ```
1004
 
1005
+ so the benchmark strongly rewards:
1006
+
1007
+ - solving the task at all
1008
+ - gathering useful evidence without repeating it
1009
+ - reaching the repair quickly
1010
+ - avoiding destructive commands entirely
1011
+
1012
+ ## limitations and portability notes
1013
+
1014
+ ### overlay mount constraints on hugging face and other managed runtimes
1015
+
1016
+ managed container platforms often restrict privileged mount operations. in practice, hugging face docker spaces may not allow kernel overlay mounts, and some environments may also lack a usable `fuse-overlayfs` path.
1017
+
1018
+ `sysadmin_env/overlayfs.py` handles this explicitly:
1019
+
1020
+ 1. try kernel overlayfs
1021
+ 2. if that fails, try `fuse-overlayfs`
1022
+ 3. if that also fails, use a plain directory copy fallback
1023
+
1024
+ the fallback is important because it preserves correctness even when the faster mount strategies are unavailable.
1025
+
1026
+ ### what the copy fallback means
1027
+
1028
+ in copy mode:
1029
+
1030
+ - the prepared lower filesystem is copied into the merged runtime directory
1031
+ - resets rebuild that merged directory by copying from the lowerdir again
1032
+ - the environment remains deterministic and functional
1033
+ - resets are typically slower than true overlay copy-on-write resets
1034
+
1035
+ this is a deliberate portability tradeoff: the benchmark prefers “runs correctly in restricted environments” over “requires privileged overlay support”.
1036
+
1037
+ ### additional candid limitations
1038
+
1039
+ - the tasks are realistic but still simplified; they use stub executables rather than full linux services.
1040
+ - grading is based on explicit filesystem state rather than black-box network/service behavior.
1041
+ - the baseline `success` flag in `inference.py` is a client summary heuristic, not an authoritative server-side evaluation primitive.
1042
+ - the environment currently models exactly three tasks; expanding benchmark breadth would require additional task modules and graders.
1043
+
1044
+ ## practical quickstart
1045
+
1046
+ if you just want the shortest useful path:
1047
+
1048
  ```bash
1049
+ uv sync --extra dev
1050
+ uv run server --host 0.0.0.0 --port 8000
1051
  ```
1052
 
1053
+ in another shell:
1054
+
1055
  ```bash
1056
+ uv run python inference.py
1057
  ```
1058
 
1059
+ before submission:
1060
 
1061
+ ```bash
1062
+ openenv validate
1063
+ bash scripts/validate-submission.sh https://your-space.hf.space .
1064
+ ```
1065
 
1066
+ that sequence exercises the main round 1 path from local development to deployment validation.
 
 
 
tests/test_packaginge.py CHANGED
@@ -147,7 +147,7 @@ def test_deployment_files_reference_hugging_face_and_openenv_workflow():
147
  assert 'EXPOSE 8000' in dockerfile
148
  assert 'EXPOSE 8000' in server_dockerfile
149
  assert 'sdk: docker' in readme
150
- assert 'Hugging Face' in readme
151
  assert 'openenv' in readme
152
  assert 'openenv init' in readme
153
  assert 'openenv validate' in readme
@@ -173,7 +173,8 @@ def test_deployment_files_reference_hugging_face_and_openenv_workflow():
173
  assert '/home/cyclops' not in readme
174
  assert 'mamba activate' not in readme
175
  assert 'legacy' not in readme.lower()
176
- assert 'compatib' not in readme.lower()
 
177
  assert 'sysadmin_env.server:app' not in readme
178
  assert 'from inference import main' in client
179
  assert 'from sysadmin_env.models import Action' in models
 
147
  assert 'EXPOSE 8000' in dockerfile
148
  assert 'EXPOSE 8000' in server_dockerfile
149
  assert 'sdk: docker' in readme
150
+ assert 'hugging face' in readme
151
  assert 'openenv' in readme
152
  assert 'openenv init' in readme
153
  assert 'openenv validate' in readme
 
173
  assert '/home/cyclops' not in readme
174
  assert 'mamba activate' not in readme
175
  assert 'legacy' not in readme.lower()
176
+ assert 'web shim routes' in readme
177
+ assert 'openai-compatible endpoint' in readme or 'openai-compatible endpoints' in readme
178
  assert 'sysadmin_env.server:app' not in readme
179
  assert 'from inference import main' in client
180
  assert 'from sysadmin_env.models import Action' in models