Spaces:

ccyloopss
/

HPCOpenenv

Paused

App Files Files Community

HPCOpenenv / docs /video_script.md

huggingmenfordays

deploy: ccyloopss/HPCOpenenv — with OPENENV_API_KEY auth guard

bc35a94 18 days ago

preview code

raw

history blame contribute delete

2.27 kB

2 minute video script: EnterpriseHPC-v0

target length 110 seconds. shots labeled A through F. copy the voice over into a teleprompter, screen record with asciinema while narrating.

shot A, 0:00–0:10, title card

"can a language model run an hpc cluster? we built EnterpriseHPC-v0 to find out."

screen: repo readme header with the architecture diagram.

shot B, 0:10–0:30, the incident

"open ondemand returns five oh two. the compute partition is drained. a cfd job is stuck in pending auth fail. this is a real enterprise sre incident and we reproduce every signal of it inside a single unprivileged sandbox."

screen: split terminal showing sinfo drain, squeue pending, curl -I http://localhost:8080 returning 502 Bad Gateway.

shot C, 0:30–0:55, architecture in one sentence

"no docker, no virtual machines. just bubblewrap with fuse overlayfs on tmpfs for two millisecond resets, nested bwrap for ssh lateral movement, and a mock slurm state machine that the stubbed binaries read under fcntl locks."

screen: left pane python -m bench.bench_reset -n 100, highlight p50 2.40 ms. right pane tree nodes/ showing login and compute-01.

shot D, 0:55–1:25, the agent loop

"qwen two point five coder seven b instruct, trained with trl grpo on a single gpu. the reward is binary. the grader reads explicit filesystem state. no reward hacking. watch the trained agent take the remediation path end to end."

screen: speed ramp the following commands, one per prompt switch: sinfo, ssh compute-01, cat route-eth0, printf default via 10.0.0.1 ... > route-eth0, systemctl restart slurmd, exit, curl -I http://localhost:8080 flipping to 200 OK.

shot E, 1:25–1:45, reward curve

"solve rate climbs from zero to seventy percent across a hundred grpo steps on three scenarios, hpc outage, hpc munge, and hpc pid stale. the agent does not just memorize, it routes between fault modes."

screen: tensorboard reward curve from runs/hpc_grpo with solve_rate overlaid.

shot F, 1:45–1:55, call to action

"spec, code, blog, space, colab. links in the description. go break something and teach a model to fix it."

screen: endcard with repo url, hf space url, colab url, blog url.