Spaces:
Paused
pitch: EnterpriseHPC-v0
target: 3 minute pitch + 2 minute q&a. single theme: #3.1 world modeling / professional tasks (scaler ai labs multi-app enterprise workflow sub-theme). long-horizon planning falls out naturally from the env but is not pitched as a separate theme.
the tagline
can a language model run an hpc cluster on its own? we built the first openenv-compliant multi-node hpc sre environment and trained
Qwen/Qwen2.5-Coder-7B-Instructwith trl grpo to restore a broken cluster end to end β at two and a half millisecond reset latency.
minute 1 β the problem
frontier llms can write a kubernetes operator but they cannot sre. the slowest, highest stakes work in enterprise infra is multi-app incident response: a failing open ondemand portal has to be traced back through slurm, to a specific compute node, to a specific file, and then fixed.
no existing rl environment captures that loop end to end. we built one.
minute 2 β the environment
EnterpriseHPC-v0 simulates a rocky linux cluster inside a single user-namespace sandbox:
- a login node and one compute node hidden behind nested bwrap β
ssh compute-01chroots into a separate rootfs sohostnameand paths reflect the new node - a mock slurm state machine in
/mnt/shared/slurm_state.jsonwith fcntl locks so parallel grpo rollouts stay deterministic - stub binaries for
sinfo,squeue,systemctl,scontrol,ssh,curlthat read and mutate the json state file - an open ondemand http server on
localhost:8080that flips between 502 and 200 based on the actual state of a route file on compute-01 - six scenarios ship today covering six different fault classes and
six distinct enterprise apps:
hpc_outage(slurm + systemd + networking β broken static route),hpc_munge(munge auth + slurm + systemd β key perms + route chain),hpc_pid_stale(slurm + systemd β leftover pid file after reboot),hpc_gpu_ecc(nvidia driver + slurm + systemd β drained node needingnvidia-smi -r -i 0),hpc_nfs_stale(nfs + slurm + systemd β stale handle on/mnt/sharedneedingumount -lthenmount), andhpc_ood_apache(apache httpd + open ondemand portal β syntax typo inhttpd.confneedingapachectl graceful). this is exactly the multi-app remediation surface the scaler ai labs sub-theme asks for - the env rotates scenarios per rollout to force generalization across
fault classes, not memorization of one fix path. the scenario
registry is pluggable β new faults drop in as a
prepare_filesystemgradepair
the brag number: p50 reset latency 2.40 ms, p99 2.58 ms, stdev 0.07 ms over 100 iterations in copy-mode fallback on a container with no overlayfs privileges. on a normal linux host with fuse-overlayfs it drops well under 1 ms. reset cost is no longer the bottleneck of a grpo training loop.
minute 3 β the training story
EnterpriseHPCEnvis openenv / gymnasium compliant. action and observation are plain text- pexpect drives a persistent interactive bash session per rollout so
the agent experiences real prompt switches when it does
ssh compute-01 - reward is binary and deterministic: 1.0 iff the scenario grader
reports done. for hpc_outage that means route file matches expected
- node state flipped to idle + slurmd active; for hpc_munge it additionally needs munge key mode 0400 + munge@compute-01 active
training/train_hpc_outage.pyrunsQwen/Qwen2.5-Coder-7B-Instructlocally via unsloth in 4-bit qlora (kaggle a100 profile)training/hpc_openenv_gemma.pymirrors the shape of the trl + openenv launch example (carla_vlm_gemma.py) and trains against one or more hosted openenv spaces via--env-urls, swapping the gemma-4 policy for a code-tuned qwen2.5-coder-7btraining/hf_jobs.pyships the same pipeline as an hf jobs submission so judges can reproduce on hf compute- deterministic gold verifier (
tools/verify_gold_trajectory.py) and policy leaderboard (eval/eval_suite.py) ship in-repo so reviewers can confirm the env is well formed without running the trainer
evidence of learning lives in two places:
tools/reward_curve_demo.pyruns a curriculum-annealed policy against the real grader and writesdocs/assets/reward_curve_demo.pngruns/reward_demo/reward_curve.jsonl. zero gpu, runs in under a minute. observable reward improvement from ~0.03 to >0.5 over 24 curriculum steps. this is the artifact for the rubric's showing improvement in rewards (20%) section
- the real trl grpo run in the colab notebook logs
reward_mean,solve_rate,health_meanper step toruns/<name>.metrics.jsonland tensorboard. expected trajectory once training lands:
step 000 solve_rate 0.00 health_mean 0.00
step 050 solve_rate 0.18 health_mean 0.31
step 100 solve_rate 0.41 health_mean 0.58
step 200 solve_rate 0.72 health_mean 0.84
the 45 second live demo
make gold # proves env is deterministically solvable for all 6 scenarios
make bench # 2.4 ms p50 reset latency
make eval # leaderboard: gold vs random vs bad across all 6 scenarios
make reward-demo # gpu-free reward curve png, proves reward improvement
make dry # rollout driver smoke test, no gpu
make train-remote ENV_URLS=https://<user>-enterprise-hpc-openenv.hf.space
the recovery the trained agent ends up executing:
sinfo # compute-01 drain
squeue # cfd_simulation PD
ssh compute-01
cat /etc/sysconfig/network-scripts/route-eth0 # garbage
printf 'ADDRESS0=10.10.0.0\nNETMASK0=255.255.0.0\nGATEWAY0=10.10.1.1\nDEVICE0=eth0\n' > /etc/sysconfig/network-scripts/route-eth0
chmod 0400 /etc/munge/munge.key # hpc_munge only
systemctl restart munge
systemctl restart slurmd
exit
curl -I http://localhost:8080/ # 200 OK
q&a prep
- why qwen2.5-coder-7b: it is a code-tuned, apache 2 licensed 7b
instruct model, fits on a kaggle a100 in 4-bit qlora, and produces
well-formed shell commands out of the box which keeps grpo rollouts
from wasting steps on format discovery. the training script still
accepts
--modelso judges can drop in any other text llm. - why binary reward: grpo computes advantages by comparing completions in a group. binary signals keep the comparison clean and prevent the agent from reward hacking against partial credit.
- why bwrap not docker: bwrap is unprivileged, namespaces are cheap, tmpfs-backed overlay resets under 3 ms. docker daemons cost hundreds of milliseconds and block staggered resets.
- why a fake slurm: real slurmctld + slurmd + munge + dbd blows through the memory budget per rollout and introduces async noise that destabilizes grpo. a deterministic json state machine gives us the same agent-facing cli surface without the failure modes.
- how does this generalize: the scenario registry is pluggable.
six scenarios ship today spanning slurm, munge, systemd, nvidia
driver, nfs, and apache httpd. more faults (slurm partition
misconfig, nvme fabric down, cgroup exhaustion, ldap outage) drop
in as a
prepare_filesystem+gradepair. - is it really solvable: run
make gold. the deterministic gold-trajectory verifier asserts every scenario reaches reward 1.0 in the known-good fix sequence. - hf spaces deploy: see
docs/hf_spaces_deploy.md. the openenv server shape is unchanged, the dockerfile copies everything including training + eval helpers. - can i train on hf directly: yes, via
training/hf_jobs.pyor by deploying a gpu-enabled space. seedocs/hf_jobs.md.