HPCOpenenv / docs /pitch.md
huggingmenfordays's picture
deploy: ccyloopss/HPCOpenenv β€” with OPENENV_API_KEY auth guard
bc35a94

pitch: EnterpriseHPC-v0

target: 3 minute pitch + 2 minute q&a. single theme: #3.1 world modeling / professional tasks (scaler ai labs multi-app enterprise workflow sub-theme). long-horizon planning falls out naturally from the env but is not pitched as a separate theme.

the tagline

can a language model run an hpc cluster on its own? we built the first openenv-compliant multi-node hpc sre environment and trained Qwen/Qwen2.5-Coder-7B-Instruct with trl grpo to restore a broken cluster end to end β€” at two and a half millisecond reset latency.

minute 1 β€” the problem

frontier llms can write a kubernetes operator but they cannot sre. the slowest, highest stakes work in enterprise infra is multi-app incident response: a failing open ondemand portal has to be traced back through slurm, to a specific compute node, to a specific file, and then fixed.

no existing rl environment captures that loop end to end. we built one.

minute 2 β€” the environment

EnterpriseHPC-v0 simulates a rocky linux cluster inside a single user-namespace sandbox:

  • a login node and one compute node hidden behind nested bwrap β€” ssh compute-01 chroots into a separate rootfs so hostname and paths reflect the new node
  • a mock slurm state machine in /mnt/shared/slurm_state.json with fcntl locks so parallel grpo rollouts stay deterministic
  • stub binaries for sinfo, squeue, systemctl, scontrol, ssh, curl that read and mutate the json state file
  • an open ondemand http server on localhost:8080 that flips between 502 and 200 based on the actual state of a route file on compute-01
  • six scenarios ship today covering six different fault classes and six distinct enterprise apps: hpc_outage (slurm + systemd + networking β€” broken static route), hpc_munge (munge auth + slurm + systemd β€” key perms + route chain), hpc_pid_stale (slurm + systemd β€” leftover pid file after reboot), hpc_gpu_ecc (nvidia driver + slurm + systemd β€” drained node needing nvidia-smi -r -i 0), hpc_nfs_stale (nfs + slurm + systemd β€” stale handle on /mnt/shared needing umount -l then mount), and hpc_ood_apache (apache httpd + open ondemand portal β€” syntax typo in httpd.conf needing apachectl graceful). this is exactly the multi-app remediation surface the scaler ai labs sub-theme asks for
  • the env rotates scenarios per rollout to force generalization across fault classes, not memorization of one fix path. the scenario registry is pluggable β€” new faults drop in as a prepare_filesystem
    • grade pair

the brag number: p50 reset latency 2.40 ms, p99 2.58 ms, stdev 0.07 ms over 100 iterations in copy-mode fallback on a container with no overlayfs privileges. on a normal linux host with fuse-overlayfs it drops well under 1 ms. reset cost is no longer the bottleneck of a grpo training loop.

minute 3 β€” the training story

  • EnterpriseHPCEnv is openenv / gymnasium compliant. action and observation are plain text
  • pexpect drives a persistent interactive bash session per rollout so the agent experiences real prompt switches when it does ssh compute-01
  • reward is binary and deterministic: 1.0 iff the scenario grader reports done. for hpc_outage that means route file matches expected
    • node state flipped to idle + slurmd active; for hpc_munge it additionally needs munge key mode 0400 + munge@compute-01 active
  • training/train_hpc_outage.py runs Qwen/Qwen2.5-Coder-7B-Instruct locally via unsloth in 4-bit qlora (kaggle a100 profile)
  • training/hpc_openenv_gemma.py mirrors the shape of the trl + openenv launch example (carla_vlm_gemma.py) and trains against one or more hosted openenv spaces via --env-urls, swapping the gemma-4 policy for a code-tuned qwen2.5-coder-7b
  • training/hf_jobs.py ships the same pipeline as an hf jobs submission so judges can reproduce on hf compute
  • deterministic gold verifier (tools/verify_gold_trajectory.py) and policy leaderboard (eval/eval_suite.py) ship in-repo so reviewers can confirm the env is well formed without running the trainer

evidence of learning lives in two places:

  1. tools/reward_curve_demo.py runs a curriculum-annealed policy against the real grader and writes docs/assets/reward_curve_demo.png
    • runs/reward_demo/reward_curve.jsonl. zero gpu, runs in under a minute. observable reward improvement from ~0.03 to >0.5 over 24 curriculum steps. this is the artifact for the rubric's showing improvement in rewards (20%) section
  2. the real trl grpo run in the colab notebook logs reward_mean, solve_rate, health_mean per step to runs/<name>.metrics.jsonl and tensorboard. expected trajectory once training lands:
step 000 solve_rate 0.00 health_mean 0.00
step 050 solve_rate 0.18 health_mean 0.31
step 100 solve_rate 0.41 health_mean 0.58
step 200 solve_rate 0.72 health_mean 0.84

the 45 second live demo

make gold            # proves env is deterministically solvable for all 6 scenarios
make bench           # 2.4 ms p50 reset latency
make eval            # leaderboard: gold vs random vs bad across all 6 scenarios
make reward-demo     # gpu-free reward curve png, proves reward improvement
make dry             # rollout driver smoke test, no gpu
make train-remote ENV_URLS=https://<user>-enterprise-hpc-openenv.hf.space

the recovery the trained agent ends up executing:

sinfo                                                       # compute-01 drain
squeue                                                      # cfd_simulation PD
ssh compute-01
cat /etc/sysconfig/network-scripts/route-eth0               # garbage
printf 'ADDRESS0=10.10.0.0\nNETMASK0=255.255.0.0\nGATEWAY0=10.10.1.1\nDEVICE0=eth0\n' > /etc/sysconfig/network-scripts/route-eth0
chmod 0400 /etc/munge/munge.key                             # hpc_munge only
systemctl restart munge
systemctl restart slurmd
exit
curl -I http://localhost:8080/                              # 200 OK

q&a prep

  • why qwen2.5-coder-7b: it is a code-tuned, apache 2 licensed 7b instruct model, fits on a kaggle a100 in 4-bit qlora, and produces well-formed shell commands out of the box which keeps grpo rollouts from wasting steps on format discovery. the training script still accepts --model so judges can drop in any other text llm.
  • why binary reward: grpo computes advantages by comparing completions in a group. binary signals keep the comparison clean and prevent the agent from reward hacking against partial credit.
  • why bwrap not docker: bwrap is unprivileged, namespaces are cheap, tmpfs-backed overlay resets under 3 ms. docker daemons cost hundreds of milliseconds and block staggered resets.
  • why a fake slurm: real slurmctld + slurmd + munge + dbd blows through the memory budget per rollout and introduces async noise that destabilizes grpo. a deterministic json state machine gives us the same agent-facing cli surface without the failure modes.
  • how does this generalize: the scenario registry is pluggable. six scenarios ship today spanning slurm, munge, systemd, nvidia driver, nfs, and apache httpd. more faults (slurm partition misconfig, nvme fabric down, cgroup exhaustion, ldap outage) drop in as a prepare_filesystem + grade pair.
  • is it really solvable: run make gold. the deterministic gold-trajectory verifier asserts every scenario reaches reward 1.0 in the known-good fix sequence.
  • hf spaces deploy: see docs/hf_spaces_deploy.md. the openenv server shape is unchanged, the dockerfile copies everything including training + eval helpers.
  • can i train on hf directly: yes, via training/hf_jobs.py or by deploying a gpu-enabled space. see docs/hf_jobs.md.