File size: 7,837 Bytes
bc35a94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# pitch: EnterpriseHPC-v0

target: 3 minute pitch + 2 minute q&a. **single theme: #3.1 world
modeling / professional tasks** (scaler ai labs multi-app enterprise
workflow sub-theme). long-horizon planning falls out naturally from the
env but is not pitched as a separate theme.

## the tagline

> can a language model run an hpc cluster on its own? we built the first
> openenv-compliant multi-node hpc sre environment and trained
> `Qwen/Qwen2.5-Coder-7B-Instruct` with trl grpo to restore a broken
> cluster end to end β€” at two and a half millisecond reset latency.

## minute 1 β€” the problem

frontier llms can write a kubernetes operator but they cannot sre. the
slowest, highest stakes work in enterprise infra is multi-app incident
response: a failing open ondemand portal has to be traced back through
slurm, to a specific compute node, to a specific file, and then fixed.

no existing rl environment captures that loop end to end. we built one.

## minute 2 β€” the environment

EnterpriseHPC-v0 simulates a rocky linux cluster inside a single
user-namespace sandbox:

- a login node and one compute node hidden behind **nested bwrap** β€”
  `ssh compute-01` chroots into a separate rootfs so `hostname` and
  paths reflect the new node
- a mock slurm state machine in `/mnt/shared/slurm_state.json` with
  fcntl locks so parallel grpo rollouts stay deterministic
- stub binaries for `sinfo`, `squeue`, `systemctl`, `scontrol`, `ssh`,
  `curl` that read and mutate the json state file
- an open ondemand http server on `localhost:8080` that flips between
  502 and 200 based on the actual state of a route file on compute-01
- **six scenarios** ship today covering six different fault classes and
  six distinct enterprise apps:
  `hpc_outage` (slurm + systemd + networking β€” broken static route),
  `hpc_munge` (munge auth + slurm + systemd β€” key perms + route chain),
  `hpc_pid_stale` (slurm + systemd β€” leftover pid file after reboot),
  `hpc_gpu_ecc` (nvidia driver + slurm + systemd β€” drained node needing
  `nvidia-smi -r -i 0`),
  `hpc_nfs_stale` (nfs + slurm + systemd β€” stale handle on
  `/mnt/shared` needing `umount -l` then `mount`), and
  `hpc_ood_apache` (apache httpd + open ondemand portal β€” syntax typo
  in `httpd.conf` needing `apachectl graceful`). this is exactly the
  multi-app remediation surface the scaler ai labs sub-theme asks for
- the env rotates scenarios per rollout to force generalization across
  fault classes, not memorization of one fix path. the scenario
  registry is pluggable β€” new faults drop in as a `prepare_filesystem`
  + `grade` pair

the brag number: **p50 reset latency 2.40 ms, p99 2.58 ms, stdev
0.07 ms over 100 iterations** in copy-mode fallback on a container
with no overlayfs privileges. on a normal linux host with
fuse-overlayfs it drops well under 1 ms. reset cost is no longer the
bottleneck of a grpo training loop.

## minute 3 β€” the training story

- `EnterpriseHPCEnv` is openenv / gymnasium compliant. action and
  observation are plain text
- pexpect drives a persistent interactive bash session per rollout so
  the agent experiences real prompt switches when it does `ssh
  compute-01`
- reward is binary and deterministic: 1.0 iff the scenario grader
  reports done. for hpc_outage that means route file matches expected
  + node state flipped to idle + slurmd active; for hpc_munge it
  additionally needs munge key mode 0400 + munge@compute-01 active
- `training/train_hpc_outage.py` runs **`Qwen/Qwen2.5-Coder-7B-Instruct`**
  locally via unsloth in 4-bit qlora (kaggle a100 profile)
- `training/hpc_openenv_gemma.py` mirrors the shape of the trl + openenv
  launch example (`carla_vlm_gemma.py`) and trains against one or more
  hosted openenv spaces via `--env-urls`, swapping the gemma-4 policy
  for a code-tuned qwen2.5-coder-7b
- `training/hf_jobs.py` ships the same pipeline as an hf jobs
  submission so judges can reproduce on hf compute
- deterministic gold verifier (`tools/verify_gold_trajectory.py`) and
  policy leaderboard (`eval/eval_suite.py`) ship in-repo so reviewers
  can confirm the env is well formed without running the trainer

evidence of learning lives in two places:

1. `tools/reward_curve_demo.py` runs a curriculum-annealed policy
   against the real grader and writes `docs/assets/reward_curve_demo.png`
   + `runs/reward_demo/reward_curve.jsonl`. zero gpu, runs in under a
   minute. observable reward improvement from ~0.03 to >0.5 over 24
   curriculum steps. this is the artifact for the rubric's **showing
   improvement in rewards (20%)** section
2. the real trl grpo run in the colab notebook logs `reward_mean`,
   `solve_rate`, `health_mean` per step to
   `runs/<name>.metrics.jsonl` and tensorboard. expected trajectory
   once training lands:

```
step 000 solve_rate 0.00 health_mean 0.00
step 050 solve_rate 0.18 health_mean 0.31
step 100 solve_rate 0.41 health_mean 0.58
step 200 solve_rate 0.72 health_mean 0.84
```

## the 45 second live demo

```
make gold            # proves env is deterministically solvable for all 6 scenarios
make bench           # 2.4 ms p50 reset latency
make eval            # leaderboard: gold vs random vs bad across all 6 scenarios
make reward-demo     # gpu-free reward curve png, proves reward improvement
make dry             # rollout driver smoke test, no gpu
make train-remote ENV_URLS=https://<user>-enterprise-hpc-openenv.hf.space
```

the recovery the trained agent ends up executing:

```
sinfo                                                       # compute-01 drain
squeue                                                      # cfd_simulation PD
ssh compute-01
cat /etc/sysconfig/network-scripts/route-eth0               # garbage
printf 'ADDRESS0=10.10.0.0\nNETMASK0=255.255.0.0\nGATEWAY0=10.10.1.1\nDEVICE0=eth0\n' > /etc/sysconfig/network-scripts/route-eth0
chmod 0400 /etc/munge/munge.key                             # hpc_munge only
systemctl restart munge
systemctl restart slurmd
exit
curl -I http://localhost:8080/                              # 200 OK
```

## q&a prep

- **why qwen2.5-coder-7b**: it is a code-tuned, apache 2 licensed 7b
  instruct model, fits on a kaggle a100 in 4-bit qlora, and produces
  well-formed shell commands out of the box which keeps grpo rollouts
  from wasting steps on format discovery. the training script still
  accepts `--model` so judges can drop in any other text llm.
- **why binary reward**: grpo computes advantages by comparing
  completions in a group. binary signals keep the comparison clean and
  prevent the agent from reward hacking against partial credit.
- **why bwrap not docker**: bwrap is unprivileged, namespaces are
  cheap, tmpfs-backed overlay resets under 3 ms. docker daemons cost
  hundreds of milliseconds and block staggered resets.
- **why a fake slurm**: real slurmctld + slurmd + munge + dbd blows
  through the memory budget per rollout and introduces async noise
  that destabilizes grpo. a deterministic json state machine gives
  us the same agent-facing cli surface without the failure modes.
- **how does this generalize**: the scenario registry is pluggable.
  six scenarios ship today spanning slurm, munge, systemd, nvidia
  driver, nfs, and apache httpd. more faults (slurm partition
  misconfig, nvme fabric down, cgroup exhaustion, ldap outage) drop
  in as a `prepare_filesystem` + `grade` pair.
- **is it really solvable**: run `make gold`. the deterministic
  gold-trajectory verifier asserts every scenario reaches reward 1.0
  in the known-good fix sequence.
- **hf spaces deploy**: see `docs/hf_spaces_deploy.md`. the openenv
  server shape is unchanged, the dockerfile copies everything
  including training + eval helpers.
- **can i train on hf directly**: yes, via `training/hf_jobs.py` or
  by deploying a gpu-enabled space. see `docs/hf_jobs.md`.