# deploying EnterpriseHPC-v0 to hugging face spaces

this guide walks through hosting the openenv server on a hugging face
space so a remote agent can hit the environment over http. the space uses
the existing `Dockerfile` at the repo root.

## prerequisites

- a hugging face account
- the hub cli installed locally: `pip install huggingface_hub`
- `hf auth login` with a token that has write access to spaces

## 1 create the space

```
huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
```

alternative: create it manually at
https://huggingface.co/new-space with sdk set to docker and
visibility public.

## 2 push the repo

```
git remote add space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv
git push space main
```

the space will pick up `Dockerfile` automatically. the build takes a
few minutes because `pip install .` pulls the full dependency tree on
python 3.13. you do not need `app.py`; the `CMD` at the bottom of the
Dockerfile starts the openenv server on `:8000`.

### 2.1 redeploying a dirty / history-heavy repo (orphan-branch trick)

hugging face xet rejects pushes whose git history contains binary
blobs that were never tracked via lfs / xet (old `.venv/` artifacts,
`docs/assets/*.png`, etc). if `git push space final-round:main` fails
with:

```
! [remote rejected] final-round -> main (pre-receive hook declined)
Your push was rejected because it contains binary files.
```

the fix is to force-push a clean history-less orphan branch:

```bash
# 1 make sure you're logged in with a write token
hf auth login

# 2 remote should point at the space's git endpoint
git remote set-url space https://huggingface.co/spaces/<your-user>/enterprise-hpc-openenv

# 3 carve out a fresh orphan branch with zero history
git checkout --orphan space-deploy
git rm -rf --cached .
# keep source + docs, drop any png/binary that would blow up xet again
rm -f docs/assets/reward_curve_demo.png

# 4 stage everything still tracked and commit
git add -A
git commit -m "deploy: clean snapshot for hf space"

# 5 force-push the orphan to the space's main branch
git push space space-deploy:main --force

# 6 restore your working branch and nuke the temp branch
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png
```

after the force push the space rebuilds from a one-commit history and
the binary-rejection disappears. you still develop on `final-round`
normally; only the space's `main` is rewritten.

> **live url**: https://huggingmenfordays-enterprise-hpc-openenv.hf.space
> (`huggingmenfordays/enterprise-hpc-openenv`)

## 3 expose the port correctly

spaces proxy everything to `:7860` by default. override with a space
level secret or env var:

```
PORT=7860
```

and adjust the Dockerfile `CMD` to read `$PORT` or override with a
space setting. or simpler, change the last line of the Dockerfile to:

```
CMD ["sh", "-c", "server --host 0.0.0.0 --port ${PORT:-7860}"]
```

## 4 user namespaces on spaces

spaces kernel policy can change over time. if `bwrap` starts failing
with `Creating new namespace failed: Operation not permitted`, set the
runtime to auto (default) and keep `proot` installed in the image.
`Sandbox` now probes `bwrap` at startup and automatically falls back to
`proot` when namespace creation is denied.

filesystem layering still follows the same chain in `OverlayFSManager`:
kernel overlay first, `fuse-overlayfs` second, copy fallback last.
expect copy fallback on spaces, which still benches within the reset
latency budget for this environment.

## 5 smoke test from your laptop

the minimal openenv client lives in `client.py`. hit the space with:

```
python - <<'PY'
from client import ClientError, SysadminEnvClient
c = SysadminEnvClient("https://<your-user>-enterprise-hpc-openenv.hf.space")
ep = c.start_episode(task_id="hpc_outage")
print("episode", ep.episode_id, "max_steps", ep.max_steps)
out = c.run_command(ep.episode_id, "sinfo")
print(out.stdout)
PY
```

expected first response includes `compute-01   drain   IB fabric fault`.

## 6 point the gym wrapper at the space

the `EnterpriseHPCEnv` gym wrapper talks to the sandbox via local
pexpect, not over http. for a spaces deployment, clients should use
the openenv rest api exposed by `server/` via `SysadminEnvClient`.
treat the space as the environment provider and run the training
loop anywhere with network access.

`training/remote_env.py` (`HttpEnterpriseHPCEnv`) is the thin
`RemoteEnterpriseHPCEnv` that forwards `reset` and `step` calls to
the http api, and pools multiple spaces via `RemoteEndpointPool` for
parallel rollouts. as of apr 23 2026 the server supports **per-episode
sessions** keyed on `episode_id`, so multiple concurrent rollouts
against a single space no longer clobber each other's state — the
client forwards the `episode_id` it received from `/reset` on every
subsequent `/step`, and observations now carry `grader_health`,
`grader_details`, and `ood_http_code` so the rollout driver can
compute `progress_reward` without running the grader a second time.

## 7 troubleshooting

- space fails to build on fuse-overlayfs apt install: remove the
  `fuse-overlayfs` line from the Dockerfile. the env will still work
  via kernel overlay or copy fallback
- pexpect errors about pty devices: the gym wrapper is only exercised
  inside the openenv container so this is usually not triggered from
  the space itself. it shows up when running `hpc_gym.main()` directly
  and is a signal the container was not allocated enough pty slots

## 8 what a winning submission looks like

- openenv server running on a space with a public url
- mini blog on hf with the architecture diagram and reward curve,
  linking to `docs/hf_blog.md` as the source
- colab notebook link that reproduces a training run in under an hour
- video under two minutes on youtube or linkedin with the script from
  `docs/video_script.md`
- pitch doc `docs/pitch.md` as the presentation backbone