Spaces:
Paused
Paused
| # deploying EnterpriseHPC-v0 to hugging face spaces | |
| this guide walks through hosting the openenv server on a hugging face | |
| space so a remote agent can hit the environment over http. the space uses | |
| the existing `Dockerfile` at the repo root. | |
| ## prerequisites | |
| - a hugging face account | |
| - the hub cli installed locally: `pip install huggingface_hub` | |
| - `hf auth login` with a token that has write access to spaces | |
| ## 1 create the space | |
| ``` | |
| huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker | |
| ``` | |
| alternative: create it manually at | |
| https://huggingface.co/new-space with sdk set to docker and | |
| visibility public. | |
| ## 2 push the repo | |
| ``` | |
| git remote add space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv | |
| git push space main | |
| ``` | |
| the space will pick up `Dockerfile` automatically. the build takes a | |
| few minutes because `pip install .` pulls the full dependency tree on | |
| python 3.13. you do not need `app.py`; the `CMD` at the bottom of the | |
| Dockerfile starts the openenv server on `:8000`. | |
| ### 2.1 redeploying a dirty / history-heavy repo (orphan-branch trick) | |
| hugging face xet rejects pushes whose git history contains binary | |
| blobs that were never tracked via lfs / xet (old `.venv/` artifacts, | |
| `docs/assets/*.png`, etc). if `git push space final-round:main` fails | |
| with: | |
| ``` | |
| ! [remote rejected] final-round -> main (pre-receive hook declined) | |
| Your push was rejected because it contains binary files. | |
| ``` | |
| the fix is to force-push a clean history-less orphan branch: | |
| ```bash | |
| # 1 make sure you're logged in with a write token | |
| hf auth login | |
| # 2 remote should point at the space's git endpoint | |
| git remote set-url space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv | |
| # 3 carve out a fresh orphan branch with zero history | |
| git checkout --orphan space-deploy | |
| git rm -rf --cached . | |
| # keep source + docs, drop any png/binary that would blow up xet again | |
| rm -f docs/assets/reward_curve_demo.png | |
| # 4 stage everything still tracked and commit | |
| git add -A | |
| git commit -m "deploy: clean snapshot for hf space" | |
| # 5 force-push the orphan to the space's main branch | |
| git push space space-deploy:main --force | |
| # 6 restore your working branch and nuke the temp branch | |
| git checkout final-round | |
| git branch -D space-deploy | |
| git checkout HEAD -- docs/assets/reward_curve_demo.png | |
| ``` | |
| after the force push the space rebuilds from a one-commit history and | |
| the binary-rejection disappears. you still develop on `final-round` | |
| normally; only the space's `main` is rewritten. | |
| > **live url**: https://huggingmenfordays-enterprise-hpc-openenv.hf.space | |
| > (`huggingmenfordays/enterprise-hpc-openenv`) | |
| ## 3 expose the port correctly | |
| spaces proxy everything to `:7860` by default. override with a space | |
| level secret or env var: | |
| ``` | |
| PORT=7860 | |
| ``` | |
| and adjust the Dockerfile `CMD` to read `$PORT` or override with a | |
| space setting. or simpler, change the last line of the Dockerfile to: | |
| ``` | |
| CMD ["sh", "-c", "server --host 0.0.0.0 --port ${PORT:-7860}"] | |
| ``` | |
| ## 4 user namespaces on spaces | |
| spaces kernel policy can change over time. if `bwrap` starts failing | |
| with `Creating new namespace failed: Operation not permitted`, set the | |
| runtime to auto (default) and keep `proot` installed in the image. | |
| `Sandbox` now probes `bwrap` at startup and automatically falls back to | |
| `proot` when namespace creation is denied. | |
| filesystem layering still follows the same chain in `OverlayFSManager`: | |
| kernel overlay first, `fuse-overlayfs` second, copy fallback last. | |
| expect copy fallback on spaces, which still benches within the reset | |
| latency budget for this environment. | |
| ## 5 smoke test from your laptop | |
| the minimal openenv client lives in `client.py`. hit the space with: | |
| ``` | |
| python - <<'PY' | |
| from client import ClientError, SysadminEnvClient | |
| c = SysadminEnvClient("https://<your-user>-enterprise-hpc-openenv.hf.space") | |
| ep = c.start_episode(task_id="hpc_outage") | |
| print("episode", ep.episode_id, "max_steps", ep.max_steps) | |
| out = c.run_command(ep.episode_id, "sinfo") | |
| print(out.stdout) | |
| PY | |
| ``` | |
| expected first response includes `compute-01 drain IB fabric fault`. | |
| ## 6 point the gym wrapper at the space | |
| the `EnterpriseHPCEnv` gym wrapper talks to the sandbox via local | |
| pexpect, not over http. for a spaces deployment, clients should use | |
| the openenv rest api exposed by `server/` via `SysadminEnvClient`. | |
| treat the space as the environment provider and run the training | |
| loop anywhere with network access. | |
| `training/remote_env.py` (`HttpEnterpriseHPCEnv`) is the thin | |
| `RemoteEnterpriseHPCEnv` that forwards `reset` and `step` calls to | |
| the http api, and pools multiple spaces via `RemoteEndpointPool` for | |
| parallel rollouts. as of apr 23 2026 the server supports **per-episode | |
| sessions** keyed on `episode_id`, so multiple concurrent rollouts | |
| against a single space no longer clobber each other's state — the | |
| client forwards the `episode_id` it received from `/reset` on every | |
| subsequent `/step`, and observations now carry `grader_health`, | |
| `grader_details`, and `ood_http_code` so the rollout driver can | |
| compute `progress_reward` without running the grader a second time. | |
| ## 7 troubleshooting | |
| - space fails to build on fuse-overlayfs apt install: remove the | |
| `fuse-overlayfs` line from the Dockerfile. the env will still work | |
| via kernel overlay or copy fallback | |
| - pexpect errors about pty devices: the gym wrapper is only exercised | |
| inside the openenv container so this is usually not triggered from | |
| the space itself. it shows up when running `hpc_gym.main()` directly | |
| and is a signal the container was not allocated enough pty slots | |
| ## 8 what a winning submission looks like | |
| - openenv server running on a space with a public url | |
| - mini blog on hf with the architecture diagram and reward curve, | |
| linking to `docs/hf_blog.md` as the source | |
| - colab notebook link that reproduces a training run in under an hour | |
| - video under two minutes on youtube or linkedin with the script from | |
| `docs/video_script.md` | |
| - pitch doc `docs/pitch.md` as the presentation backbone | |