Spaces:

ccyloopss
/

HPCOpenenv

Paused

App Files Files Community

HPCOpenenv / docs /hf_blog.md

huggingmenfordays

deploy: ccyloopss/HPCOpenenv — with OPENENV_API_KEY auth guard

bc35a94 24 days ago

preview code

raw

history blame contribute delete

4.65 kB

	# teaching an llm to sre: EnterpriseHPC-v0 on openenv

	tl;dr we shipped an openenv compliant gymnasium environment that
	simulates a 224 core rocky linux hpc cluster inside a single user
	namespace sandbox, resets in 2.40 ms p50, and trains
	Qwen/Qwen2.5-Coder-7B-Instruct with trl grpo to recover a broken cluster
	end to end. the same training script can run locally, in colab, or
	against a fleet of hf spaces via `--env-urls`.

	## why

	the slowest, highest stakes work in enterprise infra is multi-app
	incident response. an open ondemand portal returns 502. the compute
	partition is drained. there is a failing slurmd somewhere. to fix it
	you navigate login -> compute-01 over ssh, inspect route configs and
	munge keys, restart services in the right order, and verify via curl.
	frontier llms have never trained on that loop.

	EnterpriseHPC-v0 turns that loop into an rl environment.

	## what is inside

	- nested bwrap for lateral movement. `ssh compute-01` chroots the
	shell into a separate rootfs so `hostname` and filesystem paths
	reflect the new node
	- fuse-overlayfs with upperdir and workdir on `/dev/shm` for
	microsecond copy on write. kernel overlay and a copy fallback are
	supported for hosts without fuse privileges
	- a deterministic slurm state machine in
	`/mnt/shared/slurm_state.json` with fcntl locks so many parallel
	rollouts cannot corrupt each other
	- python stubs for sinfo, squeue, systemctl, scontrol, curl, ssh that
	read and mutate the json state, and a lightweight open ondemand
	http server that returns 502 until the underlying fault is fixed
	- three scenarios ship today and are rotated per rollout
	- `hpc_outage` compute-01 drain from a broken route-eth0
	- `hpc_munge` compute-01 drain from a munge key with wrong mode and
	a broken route (chained)
	- `hpc_pid_stale` slurmd refuses to restart after reboot because of a
	leftover `/var/run/slurmd.pid`
	- the gymnasium env `EnterpriseHPC-v0` wraps it all with pexpect so
	the policy experiences real interactive bash prompts

	## how fast

	```
	\| mount \| n \| p50 ms \| p95 ms \| p99 ms \| max ms \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| copy \| 100 \| 2.40 \| 2.56 \| 2.58 \| 2.87 \|
	```

	that is in the ci friendly copy mode. real fuse-overlayfs on a linux
	host drops well under 1 ms. reset latency is no longer the grpo
	bottleneck.

	## training with qwen2.5-coder

	local training with unsloth + 4bit qlora:

	```
	python -m training.train_hpc_outage \
	--model Qwen/Qwen2.5-Coder-7B-Instruct \
	--group-size 4 --max-turns 12 \
	--num-train-steps 100 \
	--scenarios hpc_outage,hpc_munge,hpc_pid_stale
	```

	remote training against hosted openenv spaces (same shape as the
	trl + openenv launch example, swapped to a code-tuned 7b policy):

	```
	python -m training.hpc_openenv_gemma \
	--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
	https://<user>-enterprise-hpc-openenv-2.hf.space \
	--model Qwen/Qwen2.5-Coder-7B-Instruct \
	--group-size 4 --max-turns 12 --num-train-steps 200
	```

	submit to hf jobs:

	```
	python -m training.hf_jobs \
	--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
	--gpu a10g-large \
	--num-train-steps 300
	```

	the training scripts use unsloth for 4bit qlora loading and trl
	`GRPOTrainer` with a custom rollout function that drives the env one
	turn at a time. the reward is binary from the deterministic task
	grader, which is exactly the signal grpo wants.

	a colab notebook at `training/hpc_colab.ipynb` runs both the local
	and remote paths on a single t4 / l4 / a100.

	## what the agent learns

	before training a random policy wanders around `sinfo` and never edits
	the route file. after ~100 steps of grpo the agent reliably:

	1. runs `sinfo` and `squeue` to locate the drained node
	2. lateral moves with `ssh compute-01`
	3. inspects `/etc/sysconfig/network-scripts/route-eth0`
	4. writes the correct route with `printf ... >` (no heredocs allowed)
	5. for the munge variant also `chmod 0400 /etc/munge/munge.key`
	6. restarts munge then slurmd in that order
	7. exits back to login and verifies with `curl -I http://localhost:8080`

	## prove it is solvable

	before any training, reviewers can run:

	```
	make gold # deterministic gold-trajectory verifier
	make eval # gold vs random vs bad policies, writes runs/eval/leaderboard.md
	make bench # reset-latency benchmark
	```

	## try it

	- repo: https://github.com/your-org/low-taper-fade-openenv-scaler
	- hf space (env server): https://huggingface.co/spaces/your-org/enterprise-hpc-openenv
	- colab: `training/hpc_colab.ipynb`
	- pitch doc: `docs/pitch.md`
	- hf jobs guide: `docs/hf_jobs.md`
	- spaces deploy: `docs/hf_spaces_deploy.md`