Spaces:
Paused
2 minute video script: EnterpriseHPC-v0
target length 110 seconds. shots labeled A through F. copy the voice over into a teleprompter, screen record with asciinema while narrating.
shot A, 0:00β0:10, title card
"can a language model run an hpc cluster? we built EnterpriseHPC-v0 to find out."
screen: repo readme header with the architecture diagram.
shot B, 0:10β0:30, the incident
"open ondemand returns five oh two. the compute partition is drained. a cfd job is stuck in pending auth fail. this is a real enterprise sre incident and we reproduce every signal of it inside a single unprivileged sandbox."
screen: split terminal showing sinfo drain, squeue pending,
curl -I http://localhost:8080 returning 502 Bad Gateway.
shot C, 0:30β0:55, architecture in one sentence
"no docker, no virtual machines. just bubblewrap with fuse overlayfs on tmpfs for two millisecond resets, nested bwrap for ssh lateral movement, and a mock slurm state machine that the stubbed binaries read under fcntl locks."
screen: left pane python -m bench.bench_reset -n 100, highlight
p50 2.40 ms. right pane tree nodes/ showing login and compute-01.
shot D, 0:55β1:25, the agent loop
"qwen two point five coder seven b instruct, trained with trl grpo on a single gpu. the reward is binary. the grader reads explicit filesystem state. no reward hacking. watch the trained agent take the remediation path end to end."
screen: speed ramp the following commands, one per prompt switch:
sinfo, ssh compute-01, cat route-eth0, printf default via 10.0.0.1 ... > route-eth0, systemctl restart slurmd, exit,
curl -I http://localhost:8080 flipping to 200 OK.
shot E, 1:25β1:45, reward curve
"solve rate climbs from zero to seventy percent across a hundred grpo steps on three scenarios, hpc outage, hpc munge, and hpc pid stale. the agent does not just memorize, it routes between fault modes."
screen: tensorboard reward curve from runs/hpc_grpo with
solve_rate overlaid.
shot F, 1:45β1:55, call to action
"spec, code, blog, space, colab. links in the description. go break something and teach a model to fix it."
screen: endcard with repo url, hf space url, colab url, blog url.