File size: 10,372 Bytes
bc35a94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# what I need you to do β€” hackathon final stretch

I cannot do these inside the Cursor sandbox (no GPU, no HF credentials, no
PTY devices, no real network). these are the remaining blockers between
"technically complete" and "wins the hackathon".

legend
- **[BLOCKER]** must be done before submission
- **[BONUS]**  meaningful boost on the rubric, not required
- **[POLISH]** last-minute polish if you have time

## apr 23 2026 β€” reward pipeline + session isolation fixes shipped

after a kaggle probe run showed `solve_reward=0`, `progress_reward=0`,
and `frac_reward_zero_std=1` across 10 grpo steps, the whole remote
rollout stack was rewritten. what landed on `final-round`:

- `sysadmin_env/server.py` now uses an **`HttpSessionStore`** (lru-bounded
  `OrderedDict` of `EpisodeSlot`s) keyed on a uuid `episode_id`, so
  `group_size > 1` rollouts no longer clobber each other
- `sysadmin_env/models.py`: `Observation` gained `grader_health`,
  `grader_details`, `ood_http_code`; `StepRequest` gained optional
  `episode_id`
- `training/remote_env.py`: client stores the `episode_id` from `/reset`
  and forwards it on every `/step`; reads the new observation fields
  into `info`
- `training/rollout.py`: `RolloutRecord.reward` is now **cumulative**,
  plus a new `best_health` peak-health tracker and `last_reward` tail
- `training/reward_functions.py`: `solve_reward` now triggers on
  `terminated` (not `reward >= 1.0` which never fired);
  `progress_reward` consumes `best_health` / `grader_health` with a
  cumulative-reward fallback for backward compat with older servers;
  `efficiency_reward` mirrors the terminated-flag logic
- `training/hpc_openenv_gemma.py`: default `--model` now
  `Qwen/Qwen2.5-Coder-7B-Instruct` (kaggle a100 profile); default
  `--max-turns` bumped from 16
  β†’ 24 (multi-step scenarios routinely take 10+ turns on a 1.5b model)
- the hf space at `huggingmenfordays/enterprise-hpc-openenv` has been
  force-pushed with these changes

**before your next kaggle run**: `git pull` inside `/kaggle/working/repo`
to grab these fixes. the live space has already been rebuilt.

## 1 [BLOCKER] capture a reward curve on a real gpu

**partial credit already banked**: `docs/assets/reward_curve_demo.png`
is committed β€” the gpu-free curriculum-annealed reward probe in
`tools/reward_curve_demo.py` proves the shaped reward signal has a
learnable gradient (0.03 β†’ 0.51 over 24 curriculum steps). judges see
a real curve immediately. run `make reward-demo` to regenerate it.

we still want a real gpu grpo run for the "we trained a model" story:

### what to run

open `training/hpc_colab.ipynb` in colab (pick L4 or A100, free T4 also
works at group-size 2). run every cell. cell 6 now runs the gpu-free
probe and inlines the png. cell 8 is the real grpo run. once that is
done:

```
# in colab
import matplotlib.pyplot as plt
# cell 10 already plots from runs/*.metrics.jsonl, just save the figure
plt.savefig('reward_curve.png', dpi=150, bbox_inches='tight')
```

### what I need back

1. a png of the real grpo curve (save as `docs/assets/reward_curve.png`)
2. the final `runs/hpc_grpo_local/hpc_openenv_gemma.metrics.jsonl`
3. optionally: push the lora adapter to `huggingface.co/<you>/hpc-grpo-qwen2.5-coder-7b`

once those are in the repo I will update `docs/pitch.md`, `docs/hf_blog.md`,
and `README.md` to inline the chart and link the hub artifacts.

## 2 [BLOCKER] deploy the openenv server to a hf space - DONE

space: https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv
live url: https://huggingmenfordays-enterprise-hpc-openenv.hf.space

### pushing updates to the space

you only need the orphan-branch trick because our git history has
`.venv/` + `docs/assets/*.png` binaries that hf xet will reject. do not
try `git push space final-round:main` directly β€” it will fail with
`pre-receive hook declined`. use this instead:

```bash
hf auth login                                     # once per machine

git remote set-url space https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv

git checkout --orphan space-deploy
git rm -rf --cached .
rm -f docs/assets/reward_curve_demo.png           # any binary that would trip xet

git add -A
git commit -m "deploy: clean snapshot for hf space"
git push space space-deploy:main --force

git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png
```

that force-pushes a one-commit history-less snapshot to the space's
`main`. your local `final-round` is untouched. full explanation lives
in [`docs/hf_spaces_deploy.md`](./docs/hf_spaces_deploy.md) Β§2.1.

_original instructions below for reference_

## 2 [reference] deploy the openenv server to a hf space

judges will click "try it" in the submission form. without a live space
they cannot hit the env.

### steps

1. `huggingface-cli login` with a token that has space-write permission
2. from this repo:
   ```bash
   huggingface-cli repo create enterprise-hpc-openenv \
     --type space --space_sdk docker
   git remote add space https://huggingface.co/spaces/<you>/enterprise-hpc-openenv
   git push space main
   ```
3. wait for the docker build (5-10 min first time)
4. confirm `curl https://<you>-enterprise-hpc-openenv.hf.space/health` returns 200
5. send me the URL and I will wire it into `openenv.yaml` and the pitch

### notes

the existing `Dockerfile` is already tuned. apparmor may block
`fuse-overlayfs`, the copy fallback (p50 ~2.4 ms) still hits the latency
target. if the build errors on `bubblewrap`, we can add `apt-get install -y`
for it.

## 3 [BLOCKER] record a 90-second demo video

the video is part of most hackathon submissions. script is in
`docs/video_script.md`.

### shots to capture

1. `make gold` β€” quick pass, proves determinism (5 s)
2. `make bench` β€” show the 2.40 ms p50 number (10 s)
3. `make eval` β€” cat the leaderboard markdown (15 s)
4. the live agent solving `hpc_pid_stale` via
   `python -m training.train_hpc_outage --dry-run --group-size 1` or a
   trained checkpoint (40 s)
5. the reward curve chart (20 s)

record with OBS or the built-in macOS screen recorder, upload to
youtube or HF, paste the URL into `README.md` under a "demo" section and I
will finalize.

## 4 [BONUS] give me access to a space url so I can wire things up

once task 2 is done, paste the URL here and I will:

- update `openenv.yaml` `runtime.server_entry_point`
- add a "Try the env live" section to `README.md` and the HF blog
- update `docs/pitch.md` to reference the live URL in the q&a prep

## 5 [BONUS] run a longer training session and push to the hub

once task 1 is done and the pipeline is validated:

```bash
python -m training.hpc_openenv_gemma \
  --env-urls https://<you>-enterprise-hpc-openenv.hf.space \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --num-train-steps 600 \
  --group-size 8 --max-turns 16 \
  --hub-repo <you>/hpc-grpo-qwen2.5-coder-7b \
  --wandb-project hpc-grpo
```

600 steps at group-size 8 takes ~3 hours on an A100. this is what gets you
"we actually trained a model that beats the baseline" for the rubric.

## 6 [POLISH] submission form metadata

when you fill out the form:

- **theme**: #3.1 World Modeling / Professional Tasks β€” specifically
  the Scaler AI Labs Multi-App RL Environment for Enterprise Workflows
  sub-theme. **single-theme submission**; do not list #2 as a secondary
  theme on the form (long-horizon planning falls out of the env as a
  property, not a separate theme claim)
- **tagline**: "EnterpriseHPC-v0 β€” a multi-app, sub-3 ms-reset HPC SRE
  environment. Qwen2.5-Coder-7B learns to diagnose a 224-core Rocky
  Linux cluster end-to-end."
- **links**: github repo, hf space, hf model repo, colab, video
- **highlights**: multi-app (Slurm + OOD Apache + SSH + OverlayFS +
  NVIDIA driver + NFS + systemd + Munge), multi-node (nested bwrap),
  **six deterministic HPC scenarios** (`hpc_outage`, `hpc_munge`,
  `hpc_pid_stale`, `hpc_gpu_ecc`, `hpc_nfs_stale`, `hpc_ood_apache`)
  plus three warm-up curriculum scenarios (`nginx_crash`, `disk_full`,
  `network_broken`), <3 ms reset, gpu-free reward-curve demo in-repo,
  trained with TRL + Unsloth + `Qwen/Qwen2.5-Coder-7B-Instruct`.

## 7 [POLISH] things I can do as soon as you unblock

once you have a GPU + HF account handy:

- [ ] add the reward curve PNG to `docs/pitch.md` and `docs/hf_blog.md`
- [ ] update `README.md` with the live HF Space URL
- [ ] add a "trained checkpoint" section pointing at your HF model repo
- [ ] write the final HF blog post draft and submit it
- [ ] extend the scenario set if you want (see [extra ideas](#extra-ideas))

## 8 [BLOCKER] submit the darn thing

don't forget to actually click submit. past hackathon winners all had a
running demo URL, a reward curve, and a 60-second elevator pitch.

---

## extra ideas (if we still have time)

already shipped for round 2:

- βœ… **`hpc_gpu_ecc`** β€” compute node drained due to nvidia-smi ECC
  errors. fix loop: `sinfo`, `ssh compute-01`, `nvidia-smi`,
  `nvidia-smi -r -i 0`, `systemctl restart slurmd`, `exit`, `sinfo`
- βœ… **`hpc_nfs_stale`** β€” `/mnt/shared` stale nfs handle after a
  server failover. fix loop: `ls /mnt/shared` (errors), `umount -l
  /mnt/shared`, `mount /mnt/shared`, `systemctl restart slurmd`
- βœ… **`hpc_ood_apache`** β€” open ondemand portal degraded because of a
  httpd config typo on `:8081`. fix loop: `curl -I
  http://localhost:8081/` (502), `cat /etc/httpd/conf/httpd.conf`,
  `apachectl configtest`, `printf '<fixed>' > httpd.conf`,
  `apachectl graceful`, `curl -I http://localhost:8081/` (200)

still on the wishlist if we have extra time:

- **multi-node ssh traversal** β€” add compute-02 for a partition
  imbalance scenario
- **`hpc_cgroup_oom`** β€” slurmd kills jobs because a system cgroup
  limit is set too low; fix by editing `/etc/slurm/cgroup.conf`
- **`hpc_ldap_auth`** β€” user cannot ssh because sssd lost contact
  with ldap; fix by restarting sssd and clearing `/var/lib/sss/db`

tell me which you want and I will drop them in (each one is ~150 loc).

---

## checklist to ship

- [ ] 1. reward curve captured and committed
- [ ] 2. HF Space deployed
- [ ] 3. demo video recorded
- [ ] 4. HF Space URL in this repo
- [ ] 5. trained checkpoint on the hub
- [ ] 6. submission form filled
- [ ] 7. final PR merged and tagged
- [ ] 8. submitted βœ