Spaces:
Paused
Paused
| # getting started β EnterpriseHPC-v0 | |
| end-to-end setup guide. covers a fresh linux machine, colab, and hugging | |
| face spaces. pick the path that matches your situation. | |
| ## tl;dr fastest possible path | |
| ```bash | |
| git clone https://github.com/<your-user>/low-taper-fade-openenv-scaler.git | |
| cd low-taper-fade-openenv-scaler | |
| python3.13 -m venv .venv && source .venv/bin/activate | |
| pip install --upgrade pip setuptools wheel | |
| pip install -e '.[dev]' | |
| make gold # deterministic proof all 6 scenarios are solvable | |
| make bench # reset-latency benchmark (<3 ms p50 in copy mode) | |
| make eval # gold vs random vs bad policies, writes runs/eval/leaderboard.md | |
| make reward-demo # gpu-free reward-curve png, proves reward improvement | |
| make dry # training rollout smoke test, no gpu required | |
| ``` | |
| if everything passes, skip to [training paths](#training-paths). | |
| ## 1 prerequisites | |
| ### system packages (linux) | |
| these are only required for the local sandbox. colab and hf jobs handle | |
| them automatically. | |
| ```bash | |
| sudo apt update | |
| sudo apt install -y bubblewrap fuse-overlayfs fuse3 tini coreutils | |
| bwrap --version # >= 0.6 recommended | |
| fuse-overlayfs --version # optional, copy fallback also works | |
| ``` | |
| - `bubblewrap` (the `bwrap` binary) provides the user namespace sandbox | |
| - `fuse-overlayfs` gives you sub-1 ms resets. missing it is fine, we fall | |
| back to a shutil-copy path that still hits ~2.4 ms p50 | |
| ### python | |
| - python `>=3.12` is required. python `3.13` is the current unsloth | |
| default (per their install docs) and the one used in `Dockerfile` + | |
| `server/Dockerfile` | |
| - `pip install -e '.[dev]'` installs the package in dev mode plus all | |
| runtime deps (fastapi, uvicorn, gymnasium, pexpect, httpx, | |
| matplotlib, numpy, etc.) and pytest | |
| - `pip install -e '.[train]'` adds the gpu-training deps (torch, | |
| transformers, trl, accelerate, peft, bitsandbytes, tensorboard, | |
| datasets). only needed on the training host | |
| ## 2 sanity checks (no gpu, 15 seconds) | |
| run these in order. any failure means the environment is misconfigured. | |
| ```bash | |
| # proves every scenario is deterministically solvable | |
| python -m tools.verify_gold_trajectory -v | |
| # measures reset latency β should be under 10 ms | |
| python -m bench.bench_reset -n 100 | |
| # runs gold/random/bad policies against every scenario, | |
| # writes runs/eval/leaderboard.md | |
| python -m eval.eval_suite --trials 2 | |
| ``` | |
| ## 3 run the openenv server locally | |
| ```bash | |
| make serve # runs the server console script on 0.0.0.0:8000 | |
| # or equivalently (after pip install -e .) | |
| server --host 0.0.0.0 --port 8000 | |
| ``` | |
| smoke test in another terminal: | |
| ```bash | |
| curl http://127.0.0.1:8000/health | |
| curl http://127.0.0.1:8000/tasks | |
| curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' \ | |
| -d '{"task_id": "hpc_outage"}' | |
| curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' \ | |
| -d '{"action": {"command": "sinfo"}}' | |
| ``` | |
| ## 4 deploy to hugging face spaces (for remote training) | |
| this is required if you want to train via `--env-urls https://...`. the | |
| reference deployment lives at | |
| [`huggingmenfordays/enterprise-hpc-openenv`](https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv) | |
| (public url: `https://huggingmenfordays-enterprise-hpc-openenv.hf.space`). | |
| ### first-time push | |
| 1. create a new space on huggingface.co β type `Docker`, any hardware tier | |
| 2. push this repo to the space: | |
| ```bash | |
| hf auth login # once | |
| huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker | |
| git remote add space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv | |
| git push space main | |
| ``` | |
| 3. wait for the build. the space should expose your env at | |
| `https://<user>-enterprise-hpc-openenv.hf.space` | |
| 4. smoke test: | |
| ```bash | |
| curl https://<user>-enterprise-hpc-openenv.hf.space/health | |
| ``` | |
| ### redeploying updates (orphan-branch trick) | |
| this repo has `.venv/` and `docs/assets/*.png` binaries sitting in git | |
| history that hf xet refuses to accept. a plain | |
| `git push space final-round:main` will be rejected with | |
| `pre-receive hook declined`. force-push a clean orphan snapshot instead: | |
| ```bash | |
| hf auth login # ensure token is live | |
| git remote set-url space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv | |
| git checkout --orphan space-deploy | |
| git rm -rf --cached . | |
| rm -f docs/assets/reward_curve_demo.png # drop binaries hf xet trips on | |
| git add -A | |
| git commit -m "deploy: clean snapshot for hf space" | |
| git push space space-deploy:main --force | |
| git checkout final-round | |
| git branch -D space-deploy | |
| git checkout HEAD -- docs/assets/reward_curve_demo.png # restore the png locally | |
| ``` | |
| your local `final-round` history stays intact; only the space's `main` | |
| is rewritten. the build takes 5-10 min; hit `/health` to confirm it | |
| came up green. | |
| full guide: [`docs/hf_spaces_deploy.md`](./docs/hf_spaces_deploy.md) | |
| ## 5 training paths | |
| ### path A β local gpu (colab / single workstation) | |
| ```bash | |
| python -m training.train_hpc_outage \ | |
| --model Qwen/Qwen2.5-Coder-7B-Instruct \ | |
| --scenarios hpc_outage,hpc_munge,hpc_pid_stale,hpc_gpu_ecc,hpc_nfs_stale,hpc_ood_apache \ | |
| --group-size 4 --max-turns 12 --num-train-steps 100 \ | |
| --output-dir ./runs/hpc_grpo_local | |
| ``` | |
| on colab open [`training/hpc_colab.ipynb`](./training/hpc_colab.ipynb) β | |
| it handles all the setup. the t4 free tier works at `--group-size 2`, | |
| l4 / a100 can push `--group-size 4+`. | |
| ### path B β remote hosted openenv (multiple spaces = throughput) | |
| ```bash | |
| python -m training.hpc_openenv_gemma \ | |
| --env-urls https://<user>-enterprise-hpc-openenv.hf.space \ | |
| https://<user>-enterprise-hpc-openenv-2.hf.space \ | |
| --model Qwen/Qwen2.5-Coder-7B-Instruct \ | |
| --group-size 4 --max-turns 24 --num-train-steps 200 \ | |
| --curriculum --save-adapter-only | |
| ``` | |
| the pool round-robins across every `--env-urls` entry for parallel | |
| rollouts. as of apr 23 2026 the remote server supports per-episode | |
| sessions (keyed on `episode_id`), so `group_size > 1` against a single | |
| space no longer clobbers episode state. the default `--max-turns` is | |
| now `24` β many scenarios need 10+ turns once format compliance and | |
| diagnostic steps are accounted for. | |
| ### path C β hf jobs (fully managed, gpu-on-demand) | |
| ```bash | |
| python -m training.hf_jobs \ | |
| --env-urls https://<user>-enterprise-hpc-openenv.hf.space \ | |
| --repo-url https://huggingface.co/spaces/<user>/enterprise-hpc-openenv \ | |
| --gpu a10g-large \ | |
| --num-train-steps 300 \ | |
| --hub-repo <user>/hpc-grpo-runs | |
| ``` | |
| see [`docs/hf_jobs.md`](./docs/hf_jobs.md) for the full guide. | |
| ## 6 expected artifacts | |
| every training run produces: | |
| - `runs/<name>/<name>.metrics.jsonl` β reward curve time series | |
| - tensorboard event files β `tensorboard --logdir ./runs` | |
| - optional wandb run if `--wandb-project` is set | |
| - optional lora adapter weights in `runs/<name>/` | |
| to plot the reward curve locally: | |
| ```bash | |
| tensorboard --logdir ./runs | |
| # or use the plot cell at the bottom of training/hpc_colab.ipynb | |
| ``` | |
| ## 7 troubleshooting | |
| | symptom | fix | | |
| | --- | --- | | |
| | `bwrap: setting up uid map: Permission denied` | enable unprivileged user namespaces: `sudo sysctl -w kernel.unprivileged_userns_clone=1` | | |
| | `fuse-overlayfs: not found` | harmless, we fall back to copy mode. apt install it for <1 ms resets | | |
| | `OSError: out of pty devices` | pexpect cannot allocate a PTY. rerun on a host with `/dev/ptmx` accessible (colab, hf spaces, most linux hosts) | | |
| | `ModuleNotFoundError: gymnasium` / `pexpect` | `pip install -e .` again, or `pip install gymnasium pexpect httpx` | | |
| | HF Space deploy: build fails on `fuse-overlayfs` install | ignore β Spaces have apparmor restrictions, the copy fallback still works | | |
| | `huggingface_hub.run_uv` missing | upgrade: `pip install -U huggingface_hub`. otherwise `--dry-run-local` prints the shell script | | |
| | training OOM on T4 | lower `--group-size 2 --max-new-tokens 256`, or switch to `Qwen/Qwen2.5-Coder-3B-Instruct` / `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` | | |
| | "no pty devices" when running training locally in a container | run on a linux host directly, or in colab | | |
| ## 8 one-line reproduction for judges | |
| ```bash | |
| make help # list all targets | |
| make gold # prove solvable | |
| make bench # reset latency | |
| make eval # policy leaderboard | |
| make dry # training plumbing smoke test | |
| make train # local grpo training | |
| make train-remote ENV_URLS=https://your.hf.space # remote openenv training | |
| ``` | |