HPCOpenenv / docs /pitch.md
huggingmenfordays's picture
deploy: ccyloopss/HPCOpenenv β€” with OPENENV_API_KEY auth guard
bc35a94
# pitch: EnterpriseHPC-v0
target: 3 minute pitch + 2 minute q&a. **single theme: #3.1 world
modeling / professional tasks** (scaler ai labs multi-app enterprise
workflow sub-theme). long-horizon planning falls out naturally from the
env but is not pitched as a separate theme.
## the tagline
> can a language model run an hpc cluster on its own? we built the first
> openenv-compliant multi-node hpc sre environment and trained
> `Qwen/Qwen2.5-Coder-7B-Instruct` with trl grpo to restore a broken
> cluster end to end β€” at two and a half millisecond reset latency.
## minute 1 β€” the problem
frontier llms can write a kubernetes operator but they cannot sre. the
slowest, highest stakes work in enterprise infra is multi-app incident
response: a failing open ondemand portal has to be traced back through
slurm, to a specific compute node, to a specific file, and then fixed.
no existing rl environment captures that loop end to end. we built one.
## minute 2 β€” the environment
EnterpriseHPC-v0 simulates a rocky linux cluster inside a single
user-namespace sandbox:
- a login node and one compute node hidden behind **nested bwrap** β€”
`ssh compute-01` chroots into a separate rootfs so `hostname` and
paths reflect the new node
- a mock slurm state machine in `/mnt/shared/slurm_state.json` with
fcntl locks so parallel grpo rollouts stay deterministic
- stub binaries for `sinfo`, `squeue`, `systemctl`, `scontrol`, `ssh`,
`curl` that read and mutate the json state file
- an open ondemand http server on `localhost:8080` that flips between
502 and 200 based on the actual state of a route file on compute-01
- **six scenarios** ship today covering six different fault classes and
six distinct enterprise apps:
`hpc_outage` (slurm + systemd + networking β€” broken static route),
`hpc_munge` (munge auth + slurm + systemd β€” key perms + route chain),
`hpc_pid_stale` (slurm + systemd β€” leftover pid file after reboot),
`hpc_gpu_ecc` (nvidia driver + slurm + systemd β€” drained node needing
`nvidia-smi -r -i 0`),
`hpc_nfs_stale` (nfs + slurm + systemd β€” stale handle on
`/mnt/shared` needing `umount -l` then `mount`), and
`hpc_ood_apache` (apache httpd + open ondemand portal β€” syntax typo
in `httpd.conf` needing `apachectl graceful`). this is exactly the
multi-app remediation surface the scaler ai labs sub-theme asks for
- the env rotates scenarios per rollout to force generalization across
fault classes, not memorization of one fix path. the scenario
registry is pluggable β€” new faults drop in as a `prepare_filesystem`
+ `grade` pair
the brag number: **p50 reset latency 2.40 ms, p99 2.58 ms, stdev
0.07 ms over 100 iterations** in copy-mode fallback on a container
with no overlayfs privileges. on a normal linux host with
fuse-overlayfs it drops well under 1 ms. reset cost is no longer the
bottleneck of a grpo training loop.
## minute 3 β€” the training story
- `EnterpriseHPCEnv` is openenv / gymnasium compliant. action and
observation are plain text
- pexpect drives a persistent interactive bash session per rollout so
the agent experiences real prompt switches when it does `ssh
compute-01`
- reward is binary and deterministic: 1.0 iff the scenario grader
reports done. for hpc_outage that means route file matches expected
+ node state flipped to idle + slurmd active; for hpc_munge it
additionally needs munge key mode 0400 + munge@compute-01 active
- `training/train_hpc_outage.py` runs **`Qwen/Qwen2.5-Coder-7B-Instruct`**
locally via unsloth in 4-bit qlora (kaggle a100 profile)
- `training/hpc_openenv_gemma.py` mirrors the shape of the trl + openenv
launch example (`carla_vlm_gemma.py`) and trains against one or more
hosted openenv spaces via `--env-urls`, swapping the gemma-4 policy
for a code-tuned qwen2.5-coder-7b
- `training/hf_jobs.py` ships the same pipeline as an hf jobs
submission so judges can reproduce on hf compute
- deterministic gold verifier (`tools/verify_gold_trajectory.py`) and
policy leaderboard (`eval/eval_suite.py`) ship in-repo so reviewers
can confirm the env is well formed without running the trainer
evidence of learning lives in two places:
1. `tools/reward_curve_demo.py` runs a curriculum-annealed policy
against the real grader and writes `docs/assets/reward_curve_demo.png`
+ `runs/reward_demo/reward_curve.jsonl`. zero gpu, runs in under a
minute. observable reward improvement from ~0.03 to >0.5 over 24
curriculum steps. this is the artifact for the rubric's **showing
improvement in rewards (20%)** section
2. the real trl grpo run in the colab notebook logs `reward_mean`,
`solve_rate`, `health_mean` per step to
`runs/<name>.metrics.jsonl` and tensorboard. expected trajectory
once training lands:
```
step 000 solve_rate 0.00 health_mean 0.00
step 050 solve_rate 0.18 health_mean 0.31
step 100 solve_rate 0.41 health_mean 0.58
step 200 solve_rate 0.72 health_mean 0.84
```
## the 45 second live demo
```
make gold # proves env is deterministically solvable for all 6 scenarios
make bench # 2.4 ms p50 reset latency
make eval # leaderboard: gold vs random vs bad across all 6 scenarios
make reward-demo # gpu-free reward curve png, proves reward improvement
make dry # rollout driver smoke test, no gpu
make train-remote ENV_URLS=https://<user>-enterprise-hpc-openenv.hf.space
```
the recovery the trained agent ends up executing:
```
sinfo # compute-01 drain
squeue # cfd_simulation PD
ssh compute-01
cat /etc/sysconfig/network-scripts/route-eth0 # garbage
printf 'ADDRESS0=10.10.0.0\nNETMASK0=255.255.0.0\nGATEWAY0=10.10.1.1\nDEVICE0=eth0\n' > /etc/sysconfig/network-scripts/route-eth0
chmod 0400 /etc/munge/munge.key # hpc_munge only
systemctl restart munge
systemctl restart slurmd
exit
curl -I http://localhost:8080/ # 200 OK
```
## q&a prep
- **why qwen2.5-coder-7b**: it is a code-tuned, apache 2 licensed 7b
instruct model, fits on a kaggle a100 in 4-bit qlora, and produces
well-formed shell commands out of the box which keeps grpo rollouts
from wasting steps on format discovery. the training script still
accepts `--model` so judges can drop in any other text llm.
- **why binary reward**: grpo computes advantages by comparing
completions in a group. binary signals keep the comparison clean and
prevent the agent from reward hacking against partial credit.
- **why bwrap not docker**: bwrap is unprivileged, namespaces are
cheap, tmpfs-backed overlay resets under 3 ms. docker daemons cost
hundreds of milliseconds and block staggered resets.
- **why a fake slurm**: real slurmctld + slurmd + munge + dbd blows
through the memory budget per rollout and introduces async noise
that destabilizes grpo. a deterministic json state machine gives
us the same agent-facing cli surface without the failure modes.
- **how does this generalize**: the scenario registry is pluggable.
six scenarios ship today spanning slurm, munge, systemd, nvidia
driver, nfs, and apache httpd. more faults (slurm partition
misconfig, nvme fabric down, cgroup exhaustion, ldap outage) drop
in as a `prepare_filesystem` + `grade` pair.
- **is it really solvable**: run `make gold`. the deterministic
gold-trajectory verifier asserts every scenario reaches reward 1.0
in the known-good fix sequence.
- **hf spaces deploy**: see `docs/hf_spaces_deploy.md`. the openenv
server shape is unchanged, the dockerfile copies everything
including training + eval helpers.
- **can i train on hf directly**: yes, via `training/hf_jobs.py` or
by deploying a gpu-enabled space. see `docs/hf_jobs.md`.