Spaces:
Paused
Paused
File size: 6,005 Bytes
bc35a94 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | # deploying EnterpriseHPC-v0 to hugging face spaces
this guide walks through hosting the openenv server on a hugging face
space so a remote agent can hit the environment over http. the space uses
the existing `Dockerfile` at the repo root.
## prerequisites
- a hugging face account
- the hub cli installed locally: `pip install huggingface_hub`
- `hf auth login` with a token that has write access to spaces
## 1 create the space
```
huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
```
alternative: create it manually at
https://huggingface.co/new-space with sdk set to docker and
visibility public.
## 2 push the repo
```
git remote add space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv
git push space main
```
the space will pick up `Dockerfile` automatically. the build takes a
few minutes because `pip install .` pulls the full dependency tree on
python 3.13. you do not need `app.py`; the `CMD` at the bottom of the
Dockerfile starts the openenv server on `:8000`.
### 2.1 redeploying a dirty / history-heavy repo (orphan-branch trick)
hugging face xet rejects pushes whose git history contains binary
blobs that were never tracked via lfs / xet (old `.venv/` artifacts,
`docs/assets/*.png`, etc). if `git push space final-round:main` fails
with:
```
! [remote rejected] final-round -> main (pre-receive hook declined)
Your push was rejected because it contains binary files.
```
the fix is to force-push a clean history-less orphan branch:
```bash
# 1 make sure you're logged in with a write token
hf auth login
# 2 remote should point at the space's git endpoint
git remote set-url space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv
# 3 carve out a fresh orphan branch with zero history
git checkout --orphan space-deploy
git rm -rf --cached .
# keep source + docs, drop any png/binary that would blow up xet again
rm -f docs/assets/reward_curve_demo.png
# 4 stage everything still tracked and commit
git add -A
git commit -m "deploy: clean snapshot for hf space"
# 5 force-push the orphan to the space's main branch
git push space space-deploy:main --force
# 6 restore your working branch and nuke the temp branch
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png
```
after the force push the space rebuilds from a one-commit history and
the binary-rejection disappears. you still develop on `final-round`
normally; only the space's `main` is rewritten.
> **live url**: https://huggingmenfordays-enterprise-hpc-openenv.hf.space
> (`huggingmenfordays/enterprise-hpc-openenv`)
## 3 expose the port correctly
spaces proxy everything to `:7860` by default. override with a space
level secret or env var:
```
PORT=7860
```
and adjust the Dockerfile `CMD` to read `$PORT` or override with a
space setting. or simpler, change the last line of the Dockerfile to:
```
CMD ["sh", "-c", "server --host 0.0.0.0 --port ${PORT:-7860}"]
```
## 4 user namespaces on spaces
spaces kernel policy can change over time. if `bwrap` starts failing
with `Creating new namespace failed: Operation not permitted`, set the
runtime to auto (default) and keep `proot` installed in the image.
`Sandbox` now probes `bwrap` at startup and automatically falls back to
`proot` when namespace creation is denied.
filesystem layering still follows the same chain in `OverlayFSManager`:
kernel overlay first, `fuse-overlayfs` second, copy fallback last.
expect copy fallback on spaces, which still benches within the reset
latency budget for this environment.
## 5 smoke test from your laptop
the minimal openenv client lives in `client.py`. hit the space with:
```
python - <<'PY'
from client import ClientError, SysadminEnvClient
c = SysadminEnvClient("https://<your-user>-enterprise-hpc-openenv.hf.space")
ep = c.start_episode(task_id="hpc_outage")
print("episode", ep.episode_id, "max_steps", ep.max_steps)
out = c.run_command(ep.episode_id, "sinfo")
print(out.stdout)
PY
```
expected first response includes `compute-01 drain IB fabric fault`.
## 6 point the gym wrapper at the space
the `EnterpriseHPCEnv` gym wrapper talks to the sandbox via local
pexpect, not over http. for a spaces deployment, clients should use
the openenv rest api exposed by `server/` via `SysadminEnvClient`.
treat the space as the environment provider and run the training
loop anywhere with network access.
`training/remote_env.py` (`HttpEnterpriseHPCEnv`) is the thin
`RemoteEnterpriseHPCEnv` that forwards `reset` and `step` calls to
the http api, and pools multiple spaces via `RemoteEndpointPool` for
parallel rollouts. as of apr 23 2026 the server supports **per-episode
sessions** keyed on `episode_id`, so multiple concurrent rollouts
against a single space no longer clobber each other's state — the
client forwards the `episode_id` it received from `/reset` on every
subsequent `/step`, and observations now carry `grader_health`,
`grader_details`, and `ood_http_code` so the rollout driver can
compute `progress_reward` without running the grader a second time.
## 7 troubleshooting
- space fails to build on fuse-overlayfs apt install: remove the
`fuse-overlayfs` line from the Dockerfile. the env will still work
via kernel overlay or copy fallback
- pexpect errors about pty devices: the gym wrapper is only exercised
inside the openenv container so this is usually not triggered from
the space itself. it shows up when running `hpc_gym.main()` directly
and is a signal the container was not allocated enough pty slots
## 8 what a winning submission looks like
- openenv server running on a space with a public url
- mini blog on hf with the architecture diagram and reward curve,
linking to `docs/hf_blog.md` as the source
- colab notebook link that reproduces a training run in under an hour
- video under two minutes on youtube or linkedin with the script from
`docs/video_script.md`
- pitch doc `docs/pitch.md` as the presentation backbone
|