Spaces:
Paused
deploying EnterpriseHPC-v0 to hugging face spaces
this guide walks through hosting the openenv server on a hugging face
space so a remote agent can hit the environment over http. the space uses
the existing Dockerfile at the repo root.
prerequisites
- a hugging face account
- the hub cli installed locally:
pip install huggingface_hub hf auth loginwith a token that has write access to spaces
1 create the space
huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
alternative: create it manually at https://huggingface.co/new-space with sdk set to docker and visibility public.
2 push the repo
git remote add space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv
git push space main
the space will pick up Dockerfile automatically. the build takes a
few minutes because pip install . pulls the full dependency tree on
python 3.13. you do not need app.py; the CMD at the bottom of the
Dockerfile starts the openenv server on :8000.
2.1 redeploying a dirty / history-heavy repo (orphan-branch trick)
hugging face xet rejects pushes whose git history contains binary
blobs that were never tracked via lfs / xet (old .venv/ artifacts,
docs/assets/*.png, etc). if git push space final-round:main fails
with:
! [remote rejected] final-round -> main (pre-receive hook declined)
Your push was rejected because it contains binary files.
the fix is to force-push a clean history-less orphan branch:
# 1 make sure you're logged in with a write token
hf auth login
# 2 remote should point at the space's git endpoint
git remote set-url space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv
# 3 carve out a fresh orphan branch with zero history
git checkout --orphan space-deploy
git rm -rf --cached .
# keep source + docs, drop any png/binary that would blow up xet again
rm -f docs/assets/reward_curve_demo.png
# 4 stage everything still tracked and commit
git add -A
git commit -m "deploy: clean snapshot for hf space"
# 5 force-push the orphan to the space's main branch
git push space space-deploy:main --force
# 6 restore your working branch and nuke the temp branch
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png
after the force push the space rebuilds from a one-commit history and
the binary-rejection disappears. you still develop on final-round
normally; only the space's main is rewritten.
live url: https://huggingmenfordays-enterprise-hpc-openenv.hf.space (
huggingmenfordays/enterprise-hpc-openenv)
3 expose the port correctly
spaces proxy everything to :7860 by default. override with a space
level secret or env var:
PORT=7860
and adjust the Dockerfile CMD to read $PORT or override with a
space setting. or simpler, change the last line of the Dockerfile to:
CMD ["sh", "-c", "server --host 0.0.0.0 --port ${PORT:-7860}"]
4 user namespaces on spaces
spaces kernel policy can change over time. if bwrap starts failing
with Creating new namespace failed: Operation not permitted, set the
runtime to auto (default) and keep proot installed in the image.
Sandbox now probes bwrap at startup and automatically falls back to
proot when namespace creation is denied.
filesystem layering still follows the same chain in OverlayFSManager:
kernel overlay first, fuse-overlayfs second, copy fallback last.
expect copy fallback on spaces, which still benches within the reset
latency budget for this environment.
5 smoke test from your laptop
the minimal openenv client lives in client.py. hit the space with:
python - <<'PY'
from client import ClientError, SysadminEnvClient
c = SysadminEnvClient("https://<your-user>-enterprise-hpc-openenv.hf.space")
ep = c.start_episode(task_id="hpc_outage")
print("episode", ep.episode_id, "max_steps", ep.max_steps)
out = c.run_command(ep.episode_id, "sinfo")
print(out.stdout)
PY
expected first response includes compute-01 drain IB fabric fault.
6 point the gym wrapper at the space
the EnterpriseHPCEnv gym wrapper talks to the sandbox via local
pexpect, not over http. for a spaces deployment, clients should use
the openenv rest api exposed by server/ via SysadminEnvClient.
treat the space as the environment provider and run the training
loop anywhere with network access.
training/remote_env.py (HttpEnterpriseHPCEnv) is the thin
RemoteEnterpriseHPCEnv that forwards reset and step calls to
the http api, and pools multiple spaces via RemoteEndpointPool for
parallel rollouts. as of apr 23 2026 the server supports per-episode
sessions keyed on episode_id, so multiple concurrent rollouts
against a single space no longer clobber each other's state — the
client forwards the episode_id it received from /reset on every
subsequent /step, and observations now carry grader_health,
grader_details, and ood_http_code so the rollout driver can
compute progress_reward without running the grader a second time.
7 troubleshooting
- space fails to build on fuse-overlayfs apt install: remove the
fuse-overlayfsline from the Dockerfile. the env will still work via kernel overlay or copy fallback - pexpect errors about pty devices: the gym wrapper is only exercised
inside the openenv container so this is usually not triggered from
the space itself. it shows up when running
hpc_gym.main()directly and is a signal the container was not allocated enough pty slots
8 what a winning submission looks like
- openenv server running on a space with a public url
- mini blog on hf with the architecture diagram and reward curve,
linking to
docs/hf_blog.mdas the source - colab notebook link that reproduces a training run in under an hour
- video under two minutes on youtube or linkedin with the script from
docs/video_script.md - pitch doc
docs/pitch.mdas the presentation backbone