HPCOpenenv / GETTING_STARTED.md
huggingmenfordays's picture
deploy: ccyloopss/HPCOpenenv β€” with OPENENV_API_KEY auth guard
bc35a94
# getting started β€” EnterpriseHPC-v0
end-to-end setup guide. covers a fresh linux machine, colab, and hugging
face spaces. pick the path that matches your situation.
## tl;dr fastest possible path
```bash
git clone https://github.com/<your-user>/low-taper-fade-openenv-scaler.git
cd low-taper-fade-openenv-scaler
python3.13 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
make gold # deterministic proof all 6 scenarios are solvable
make bench # reset-latency benchmark (<3 ms p50 in copy mode)
make eval # gold vs random vs bad policies, writes runs/eval/leaderboard.md
make reward-demo # gpu-free reward-curve png, proves reward improvement
make dry # training rollout smoke test, no gpu required
```
if everything passes, skip to [training paths](#training-paths).
## 1 prerequisites
### system packages (linux)
these are only required for the local sandbox. colab and hf jobs handle
them automatically.
```bash
sudo apt update
sudo apt install -y bubblewrap fuse-overlayfs fuse3 tini coreutils
bwrap --version # >= 0.6 recommended
fuse-overlayfs --version # optional, copy fallback also works
```
- `bubblewrap` (the `bwrap` binary) provides the user namespace sandbox
- `fuse-overlayfs` gives you sub-1 ms resets. missing it is fine, we fall
back to a shutil-copy path that still hits ~2.4 ms p50
### python
- python `>=3.12` is required. python `3.13` is the current unsloth
default (per their install docs) and the one used in `Dockerfile` +
`server/Dockerfile`
- `pip install -e '.[dev]'` installs the package in dev mode plus all
runtime deps (fastapi, uvicorn, gymnasium, pexpect, httpx,
matplotlib, numpy, etc.) and pytest
- `pip install -e '.[train]'` adds the gpu-training deps (torch,
transformers, trl, accelerate, peft, bitsandbytes, tensorboard,
datasets). only needed on the training host
## 2 sanity checks (no gpu, 15 seconds)
run these in order. any failure means the environment is misconfigured.
```bash
# proves every scenario is deterministically solvable
python -m tools.verify_gold_trajectory -v
# measures reset latency β€” should be under 10 ms
python -m bench.bench_reset -n 100
# runs gold/random/bad policies against every scenario,
# writes runs/eval/leaderboard.md
python -m eval.eval_suite --trials 2
```
## 3 run the openenv server locally
```bash
make serve # runs the server console script on 0.0.0.0:8000
# or equivalently (after pip install -e .)
server --host 0.0.0.0 --port 8000
```
smoke test in another terminal:
```bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/tasks
curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' \
-d '{"task_id": "hpc_outage"}'
curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' \
-d '{"action": {"command": "sinfo"}}'
```
## 4 deploy to hugging face spaces (for remote training)
this is required if you want to train via `--env-urls https://...`. the
reference deployment lives at
[`huggingmenfordays/enterprise-hpc-openenv`](https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv)
(public url: `https://huggingmenfordays-enterprise-hpc-openenv.hf.space`).
### first-time push
1. create a new space on huggingface.co β€” type `Docker`, any hardware tier
2. push this repo to the space:
```bash
hf auth login # once
huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
git remote add space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
git push space main
```
3. wait for the build. the space should expose your env at
`https://<user>-enterprise-hpc-openenv.hf.space`
4. smoke test:
```bash
curl https://<user>-enterprise-hpc-openenv.hf.space/health
```
### redeploying updates (orphan-branch trick)
this repo has `.venv/` and `docs/assets/*.png` binaries sitting in git
history that hf xet refuses to accept. a plain
`git push space final-round:main` will be rejected with
`pre-receive hook declined`. force-push a clean orphan snapshot instead:
```bash
hf auth login # ensure token is live
git remote set-url space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
git checkout --orphan space-deploy
git rm -rf --cached .
rm -f docs/assets/reward_curve_demo.png # drop binaries hf xet trips on
git add -A
git commit -m "deploy: clean snapshot for hf space"
git push space space-deploy:main --force
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png # restore the png locally
```
your local `final-round` history stays intact; only the space's `main`
is rewritten. the build takes 5-10 min; hit `/health` to confirm it
came up green.
full guide: [`docs/hf_spaces_deploy.md`](./docs/hf_spaces_deploy.md)
## 5 training paths
### path A β€” local gpu (colab / single workstation)
```bash
python -m training.train_hpc_outage \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--scenarios hpc_outage,hpc_munge,hpc_pid_stale,hpc_gpu_ecc,hpc_nfs_stale,hpc_ood_apache \
--group-size 4 --max-turns 12 --num-train-steps 100 \
--output-dir ./runs/hpc_grpo_local
```
on colab open [`training/hpc_colab.ipynb`](./training/hpc_colab.ipynb) β€”
it handles all the setup. the t4 free tier works at `--group-size 2`,
l4 / a100 can push `--group-size 4+`.
### path B β€” remote hosted openenv (multiple spaces = throughput)
```bash
python -m training.hpc_openenv_gemma \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
https://<user>-enterprise-hpc-openenv-2.hf.space \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--group-size 4 --max-turns 24 --num-train-steps 200 \
--curriculum --save-adapter-only
```
the pool round-robins across every `--env-urls` entry for parallel
rollouts. as of apr 23 2026 the remote server supports per-episode
sessions (keyed on `episode_id`), so `group_size > 1` against a single
space no longer clobbers episode state. the default `--max-turns` is
now `24` β€” many scenarios need 10+ turns once format compliance and
diagnostic steps are accounted for.
### path C β€” hf jobs (fully managed, gpu-on-demand)
```bash
python -m training.hf_jobs \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
--repo-url https://huggingface.co/spaces/<user>/enterprise-hpc-openenv \
--gpu a10g-large \
--num-train-steps 300 \
--hub-repo <user>/hpc-grpo-runs
```
see [`docs/hf_jobs.md`](./docs/hf_jobs.md) for the full guide.
## 6 expected artifacts
every training run produces:
- `runs/<name>/<name>.metrics.jsonl` β€” reward curve time series
- tensorboard event files β€” `tensorboard --logdir ./runs`
- optional wandb run if `--wandb-project` is set
- optional lora adapter weights in `runs/<name>/`
to plot the reward curve locally:
```bash
tensorboard --logdir ./runs
# or use the plot cell at the bottom of training/hpc_colab.ipynb
```
## 7 troubleshooting
| symptom | fix |
| --- | --- |
| `bwrap: setting up uid map: Permission denied` | enable unprivileged user namespaces: `sudo sysctl -w kernel.unprivileged_userns_clone=1` |
| `fuse-overlayfs: not found` | harmless, we fall back to copy mode. apt install it for <1 ms resets |
| `OSError: out of pty devices` | pexpect cannot allocate a PTY. rerun on a host with `/dev/ptmx` accessible (colab, hf spaces, most linux hosts) |
| `ModuleNotFoundError: gymnasium` / `pexpect` | `pip install -e .` again, or `pip install gymnasium pexpect httpx` |
| HF Space deploy: build fails on `fuse-overlayfs` install | ignore β€” Spaces have apparmor restrictions, the copy fallback still works |
| `huggingface_hub.run_uv` missing | upgrade: `pip install -U huggingface_hub`. otherwise `--dry-run-local` prints the shell script |
| training OOM on T4 | lower `--group-size 2 --max-new-tokens 256`, or switch to `Qwen/Qwen2.5-Coder-3B-Instruct` / `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` |
| "no pty devices" when running training locally in a container | run on a linux host directly, or in colab |
## 8 one-line reproduction for judges
```bash
make help # list all targets
make gold # prove solvable
make bench # reset latency
make eval # policy leaderboard
make dry # training plumbing smoke test
make train # local grpo training
make train-remote ENV_URLS=https://your.hf.space # remote openenv training
```