Spaces:
Paused
getting started β EnterpriseHPC-v0
end-to-end setup guide. covers a fresh linux machine, colab, and hugging face spaces. pick the path that matches your situation.
tl;dr fastest possible path
git clone https://github.com/<your-user>/low-taper-fade-openenv-scaler.git
cd low-taper-fade-openenv-scaler
python3.13 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
make gold # deterministic proof all 6 scenarios are solvable
make bench # reset-latency benchmark (<3 ms p50 in copy mode)
make eval # gold vs random vs bad policies, writes runs/eval/leaderboard.md
make reward-demo # gpu-free reward-curve png, proves reward improvement
make dry # training rollout smoke test, no gpu required
if everything passes, skip to training paths.
1 prerequisites
system packages (linux)
these are only required for the local sandbox. colab and hf jobs handle them automatically.
sudo apt update
sudo apt install -y bubblewrap fuse-overlayfs fuse3 tini coreutils
bwrap --version # >= 0.6 recommended
fuse-overlayfs --version # optional, copy fallback also works
bubblewrap(thebwrapbinary) provides the user namespace sandboxfuse-overlayfsgives you sub-1 ms resets. missing it is fine, we fall back to a shutil-copy path that still hits ~2.4 ms p50
python
- python
>=3.12is required. python3.13is the current unsloth default (per their install docs) and the one used inDockerfile+server/Dockerfile pip install -e '.[dev]'installs the package in dev mode plus all runtime deps (fastapi, uvicorn, gymnasium, pexpect, httpx, matplotlib, numpy, etc.) and pytestpip install -e '.[train]'adds the gpu-training deps (torch, transformers, trl, accelerate, peft, bitsandbytes, tensorboard, datasets). only needed on the training host
2 sanity checks (no gpu, 15 seconds)
run these in order. any failure means the environment is misconfigured.
# proves every scenario is deterministically solvable
python -m tools.verify_gold_trajectory -v
# measures reset latency β should be under 10 ms
python -m bench.bench_reset -n 100
# runs gold/random/bad policies against every scenario,
# writes runs/eval/leaderboard.md
python -m eval.eval_suite --trials 2
3 run the openenv server locally
make serve # runs the server console script on 0.0.0.0:8000
# or equivalently (after pip install -e .)
server --host 0.0.0.0 --port 8000
smoke test in another terminal:
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/tasks
curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' \
-d '{"task_id": "hpc_outage"}'
curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' \
-d '{"action": {"command": "sinfo"}}'
4 deploy to hugging face spaces (for remote training)
this is required if you want to train via --env-urls https://.... the
reference deployment lives at
huggingmenfordays/enterprise-hpc-openenv
(public url: https://huggingmenfordays-enterprise-hpc-openenv.hf.space).
first-time push
- create a new space on huggingface.co β type
Docker, any hardware tier - push this repo to the space:
hf auth login # once huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker git remote add space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv git push space main - wait for the build. the space should expose your env at
https://<user>-enterprise-hpc-openenv.hf.space - smoke test:
curl https://<user>-enterprise-hpc-openenv.hf.space/health
redeploying updates (orphan-branch trick)
this repo has .venv/ and docs/assets/*.png binaries sitting in git
history that hf xet refuses to accept. a plain
git push space final-round:main will be rejected with
pre-receive hook declined. force-push a clean orphan snapshot instead:
hf auth login # ensure token is live
git remote set-url space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
git checkout --orphan space-deploy
git rm -rf --cached .
rm -f docs/assets/reward_curve_demo.png # drop binaries hf xet trips on
git add -A
git commit -m "deploy: clean snapshot for hf space"
git push space space-deploy:main --force
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png # restore the png locally
your local final-round history stays intact; only the space's main
is rewritten. the build takes 5-10 min; hit /health to confirm it
came up green.
full guide: docs/hf_spaces_deploy.md
5 training paths
path A β local gpu (colab / single workstation)
python -m training.train_hpc_outage \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--scenarios hpc_outage,hpc_munge,hpc_pid_stale,hpc_gpu_ecc,hpc_nfs_stale,hpc_ood_apache \
--group-size 4 --max-turns 12 --num-train-steps 100 \
--output-dir ./runs/hpc_grpo_local
on colab open training/hpc_colab.ipynb β
it handles all the setup. the t4 free tier works at --group-size 2,
l4 / a100 can push --group-size 4+.
path B β remote hosted openenv (multiple spaces = throughput)
python -m training.hpc_openenv_gemma \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
https://<user>-enterprise-hpc-openenv-2.hf.space \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--group-size 4 --max-turns 24 --num-train-steps 200 \
--curriculum --save-adapter-only
the pool round-robins across every --env-urls entry for parallel
rollouts. as of apr 23 2026 the remote server supports per-episode
sessions (keyed on episode_id), so group_size > 1 against a single
space no longer clobbers episode state. the default --max-turns is
now 24 β many scenarios need 10+ turns once format compliance and
diagnostic steps are accounted for.
path C β hf jobs (fully managed, gpu-on-demand)
python -m training.hf_jobs \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
--repo-url https://huggingface.co/spaces/<user>/enterprise-hpc-openenv \
--gpu a10g-large \
--num-train-steps 300 \
--hub-repo <user>/hpc-grpo-runs
see docs/hf_jobs.md for the full guide.
6 expected artifacts
every training run produces:
runs/<name>/<name>.metrics.jsonlβ reward curve time series- tensorboard event files β
tensorboard --logdir ./runs - optional wandb run if
--wandb-projectis set - optional lora adapter weights in
runs/<name>/
to plot the reward curve locally:
tensorboard --logdir ./runs
# or use the plot cell at the bottom of training/hpc_colab.ipynb
7 troubleshooting
| symptom | fix |
|---|---|
bwrap: setting up uid map: Permission denied |
enable unprivileged user namespaces: sudo sysctl -w kernel.unprivileged_userns_clone=1 |
fuse-overlayfs: not found |
harmless, we fall back to copy mode. apt install it for <1 ms resets |
OSError: out of pty devices |
pexpect cannot allocate a PTY. rerun on a host with /dev/ptmx accessible (colab, hf spaces, most linux hosts) |
ModuleNotFoundError: gymnasium / pexpect |
pip install -e . again, or pip install gymnasium pexpect httpx |
HF Space deploy: build fails on fuse-overlayfs install |
ignore β Spaces have apparmor restrictions, the copy fallback still works |
huggingface_hub.run_uv missing |
upgrade: pip install -U huggingface_hub. otherwise --dry-run-local prints the shell script |
| training OOM on T4 | lower --group-size 2 --max-new-tokens 256, or switch to Qwen/Qwen2.5-Coder-3B-Instruct / unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit |
| "no pty devices" when running training locally in a container | run on a linux host directly, or in colab |
8 one-line reproduction for judges
make help # list all targets
make gold # prove solvable
make bench # reset latency
make eval # policy leaderboard
make dry # training plumbing smoke test
make train # local grpo training
make train-remote ENV_URLS=https://your.hf.space # remote openenv training