Spaces:
Paused
Paused
| # pitch: EnterpriseHPC-v0 | |
| target: 3 minute pitch + 2 minute q&a. **single theme: #3.1 world | |
| modeling / professional tasks** (scaler ai labs multi-app enterprise | |
| workflow sub-theme). long-horizon planning falls out naturally from the | |
| env but is not pitched as a separate theme. | |
| ## the tagline | |
| > can a language model run an hpc cluster on its own? we built the first | |
| > openenv-compliant multi-node hpc sre environment and trained | |
| > `Qwen/Qwen2.5-Coder-7B-Instruct` with trl grpo to restore a broken | |
| > cluster end to end β at two and a half millisecond reset latency. | |
| ## minute 1 β the problem | |
| frontier llms can write a kubernetes operator but they cannot sre. the | |
| slowest, highest stakes work in enterprise infra is multi-app incident | |
| response: a failing open ondemand portal has to be traced back through | |
| slurm, to a specific compute node, to a specific file, and then fixed. | |
| no existing rl environment captures that loop end to end. we built one. | |
| ## minute 2 β the environment | |
| EnterpriseHPC-v0 simulates a rocky linux cluster inside a single | |
| user-namespace sandbox: | |
| - a login node and one compute node hidden behind **nested bwrap** β | |
| `ssh compute-01` chroots into a separate rootfs so `hostname` and | |
| paths reflect the new node | |
| - a mock slurm state machine in `/mnt/shared/slurm_state.json` with | |
| fcntl locks so parallel grpo rollouts stay deterministic | |
| - stub binaries for `sinfo`, `squeue`, `systemctl`, `scontrol`, `ssh`, | |
| `curl` that read and mutate the json state file | |
| - an open ondemand http server on `localhost:8080` that flips between | |
| 502 and 200 based on the actual state of a route file on compute-01 | |
| - **six scenarios** ship today covering six different fault classes and | |
| six distinct enterprise apps: | |
| `hpc_outage` (slurm + systemd + networking β broken static route), | |
| `hpc_munge` (munge auth + slurm + systemd β key perms + route chain), | |
| `hpc_pid_stale` (slurm + systemd β leftover pid file after reboot), | |
| `hpc_gpu_ecc` (nvidia driver + slurm + systemd β drained node needing | |
| `nvidia-smi -r -i 0`), | |
| `hpc_nfs_stale` (nfs + slurm + systemd β stale handle on | |
| `/mnt/shared` needing `umount -l` then `mount`), and | |
| `hpc_ood_apache` (apache httpd + open ondemand portal β syntax typo | |
| in `httpd.conf` needing `apachectl graceful`). this is exactly the | |
| multi-app remediation surface the scaler ai labs sub-theme asks for | |
| - the env rotates scenarios per rollout to force generalization across | |
| fault classes, not memorization of one fix path. the scenario | |
| registry is pluggable β new faults drop in as a `prepare_filesystem` | |
| + `grade` pair | |
| the brag number: **p50 reset latency 2.40 ms, p99 2.58 ms, stdev | |
| 0.07 ms over 100 iterations** in copy-mode fallback on a container | |
| with no overlayfs privileges. on a normal linux host with | |
| fuse-overlayfs it drops well under 1 ms. reset cost is no longer the | |
| bottleneck of a grpo training loop. | |
| ## minute 3 β the training story | |
| - `EnterpriseHPCEnv` is openenv / gymnasium compliant. action and | |
| observation are plain text | |
| - pexpect drives a persistent interactive bash session per rollout so | |
| the agent experiences real prompt switches when it does `ssh | |
| compute-01` | |
| - reward is binary and deterministic: 1.0 iff the scenario grader | |
| reports done. for hpc_outage that means route file matches expected | |
| + node state flipped to idle + slurmd active; for hpc_munge it | |
| additionally needs munge key mode 0400 + munge@compute-01 active | |
| - `training/train_hpc_outage.py` runs **`Qwen/Qwen2.5-Coder-7B-Instruct`** | |
| locally via unsloth in 4-bit qlora (kaggle a100 profile) | |
| - `training/hpc_openenv_gemma.py` mirrors the shape of the trl + openenv | |
| launch example (`carla_vlm_gemma.py`) and trains against one or more | |
| hosted openenv spaces via `--env-urls`, swapping the gemma-4 policy | |
| for a code-tuned qwen2.5-coder-7b | |
| - `training/hf_jobs.py` ships the same pipeline as an hf jobs | |
| submission so judges can reproduce on hf compute | |
| - deterministic gold verifier (`tools/verify_gold_trajectory.py`) and | |
| policy leaderboard (`eval/eval_suite.py`) ship in-repo so reviewers | |
| can confirm the env is well formed without running the trainer | |
| evidence of learning lives in two places: | |
| 1. `tools/reward_curve_demo.py` runs a curriculum-annealed policy | |
| against the real grader and writes `docs/assets/reward_curve_demo.png` | |
| + `runs/reward_demo/reward_curve.jsonl`. zero gpu, runs in under a | |
| minute. observable reward improvement from ~0.03 to >0.5 over 24 | |
| curriculum steps. this is the artifact for the rubric's **showing | |
| improvement in rewards (20%)** section | |
| 2. the real trl grpo run in the colab notebook logs `reward_mean`, | |
| `solve_rate`, `health_mean` per step to | |
| `runs/<name>.metrics.jsonl` and tensorboard. expected trajectory | |
| once training lands: | |
| ``` | |
| step 000 solve_rate 0.00 health_mean 0.00 | |
| step 050 solve_rate 0.18 health_mean 0.31 | |
| step 100 solve_rate 0.41 health_mean 0.58 | |
| step 200 solve_rate 0.72 health_mean 0.84 | |
| ``` | |
| ## the 45 second live demo | |
| ``` | |
| make gold # proves env is deterministically solvable for all 6 scenarios | |
| make bench # 2.4 ms p50 reset latency | |
| make eval # leaderboard: gold vs random vs bad across all 6 scenarios | |
| make reward-demo # gpu-free reward curve png, proves reward improvement | |
| make dry # rollout driver smoke test, no gpu | |
| make train-remote ENV_URLS=https://<user>-enterprise-hpc-openenv.hf.space | |
| ``` | |
| the recovery the trained agent ends up executing: | |
| ``` | |
| sinfo # compute-01 drain | |
| squeue # cfd_simulation PD | |
| ssh compute-01 | |
| cat /etc/sysconfig/network-scripts/route-eth0 # garbage | |
| printf 'ADDRESS0=10.10.0.0\nNETMASK0=255.255.0.0\nGATEWAY0=10.10.1.1\nDEVICE0=eth0\n' > /etc/sysconfig/network-scripts/route-eth0 | |
| chmod 0400 /etc/munge/munge.key # hpc_munge only | |
| systemctl restart munge | |
| systemctl restart slurmd | |
| exit | |
| curl -I http://localhost:8080/ # 200 OK | |
| ``` | |
| ## q&a prep | |
| - **why qwen2.5-coder-7b**: it is a code-tuned, apache 2 licensed 7b | |
| instruct model, fits on a kaggle a100 in 4-bit qlora, and produces | |
| well-formed shell commands out of the box which keeps grpo rollouts | |
| from wasting steps on format discovery. the training script still | |
| accepts `--model` so judges can drop in any other text llm. | |
| - **why binary reward**: grpo computes advantages by comparing | |
| completions in a group. binary signals keep the comparison clean and | |
| prevent the agent from reward hacking against partial credit. | |
| - **why bwrap not docker**: bwrap is unprivileged, namespaces are | |
| cheap, tmpfs-backed overlay resets under 3 ms. docker daemons cost | |
| hundreds of milliseconds and block staggered resets. | |
| - **why a fake slurm**: real slurmctld + slurmd + munge + dbd blows | |
| through the memory budget per rollout and introduces async noise | |
| that destabilizes grpo. a deterministic json state machine gives | |
| us the same agent-facing cli surface without the failure modes. | |
| - **how does this generalize**: the scenario registry is pluggable. | |
| six scenarios ship today spanning slurm, munge, systemd, nvidia | |
| driver, nfs, and apache httpd. more faults (slurm partition | |
| misconfig, nvme fabric down, cgroup exhaustion, ldap outage) drop | |
| in as a `prepare_filesystem` + `grade` pair. | |
| - **is it really solvable**: run `make gold`. the deterministic | |
| gold-trajectory verifier asserts every scenario reaches reward 1.0 | |
| in the known-good fix sequence. | |
| - **hf spaces deploy**: see `docs/hf_spaces_deploy.md`. the openenv | |
| server shape is unchanged, the dockerfile copies everything | |
| including training + eval helpers. | |
| - **can i train on hf directly**: yes, via `training/hf_jobs.py` or | |
| by deploying a gpu-enabled space. see `docs/hf_jobs.md`. | |