Spaces:
Paused
Paused
File size: 8,778 Bytes
bc35a94 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | # getting started β EnterpriseHPC-v0
end-to-end setup guide. covers a fresh linux machine, colab, and hugging
face spaces. pick the path that matches your situation.
## tl;dr fastest possible path
```bash
git clone https://github.com/<your-user>/low-taper-fade-openenv-scaler.git
cd low-taper-fade-openenv-scaler
python3.13 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
make gold # deterministic proof all 6 scenarios are solvable
make bench # reset-latency benchmark (<3 ms p50 in copy mode)
make eval # gold vs random vs bad policies, writes runs/eval/leaderboard.md
make reward-demo # gpu-free reward-curve png, proves reward improvement
make dry # training rollout smoke test, no gpu required
```
if everything passes, skip to [training paths](#training-paths).
## 1 prerequisites
### system packages (linux)
these are only required for the local sandbox. colab and hf jobs handle
them automatically.
```bash
sudo apt update
sudo apt install -y bubblewrap fuse-overlayfs fuse3 tini coreutils
bwrap --version # >= 0.6 recommended
fuse-overlayfs --version # optional, copy fallback also works
```
- `bubblewrap` (the `bwrap` binary) provides the user namespace sandbox
- `fuse-overlayfs` gives you sub-1 ms resets. missing it is fine, we fall
back to a shutil-copy path that still hits ~2.4 ms p50
### python
- python `>=3.12` is required. python `3.13` is the current unsloth
default (per their install docs) and the one used in `Dockerfile` +
`server/Dockerfile`
- `pip install -e '.[dev]'` installs the package in dev mode plus all
runtime deps (fastapi, uvicorn, gymnasium, pexpect, httpx,
matplotlib, numpy, etc.) and pytest
- `pip install -e '.[train]'` adds the gpu-training deps (torch,
transformers, trl, accelerate, peft, bitsandbytes, tensorboard,
datasets). only needed on the training host
## 2 sanity checks (no gpu, 15 seconds)
run these in order. any failure means the environment is misconfigured.
```bash
# proves every scenario is deterministically solvable
python -m tools.verify_gold_trajectory -v
# measures reset latency β should be under 10 ms
python -m bench.bench_reset -n 100
# runs gold/random/bad policies against every scenario,
# writes runs/eval/leaderboard.md
python -m eval.eval_suite --trials 2
```
## 3 run the openenv server locally
```bash
make serve # runs the server console script on 0.0.0.0:8000
# or equivalently (after pip install -e .)
server --host 0.0.0.0 --port 8000
```
smoke test in another terminal:
```bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/tasks
curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' \
-d '{"task_id": "hpc_outage"}'
curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' \
-d '{"action": {"command": "sinfo"}}'
```
## 4 deploy to hugging face spaces (for remote training)
this is required if you want to train via `--env-urls https://...`. the
reference deployment lives at
[`huggingmenfordays/enterprise-hpc-openenv`](https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv)
(public url: `https://huggingmenfordays-enterprise-hpc-openenv.hf.space`).
### first-time push
1. create a new space on huggingface.co β type `Docker`, any hardware tier
2. push this repo to the space:
```bash
hf auth login # once
huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
git remote add space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
git push space main
```
3. wait for the build. the space should expose your env at
`https://<user>-enterprise-hpc-openenv.hf.space`
4. smoke test:
```bash
curl https://<user>-enterprise-hpc-openenv.hf.space/health
```
### redeploying updates (orphan-branch trick)
this repo has `.venv/` and `docs/assets/*.png` binaries sitting in git
history that hf xet refuses to accept. a plain
`git push space final-round:main` will be rejected with
`pre-receive hook declined`. force-push a clean orphan snapshot instead:
```bash
hf auth login # ensure token is live
git remote set-url space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
git checkout --orphan space-deploy
git rm -rf --cached .
rm -f docs/assets/reward_curve_demo.png # drop binaries hf xet trips on
git add -A
git commit -m "deploy: clean snapshot for hf space"
git push space space-deploy:main --force
git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png # restore the png locally
```
your local `final-round` history stays intact; only the space's `main`
is rewritten. the build takes 5-10 min; hit `/health` to confirm it
came up green.
full guide: [`docs/hf_spaces_deploy.md`](./docs/hf_spaces_deploy.md)
## 5 training paths
### path A β local gpu (colab / single workstation)
```bash
python -m training.train_hpc_outage \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--scenarios hpc_outage,hpc_munge,hpc_pid_stale,hpc_gpu_ecc,hpc_nfs_stale,hpc_ood_apache \
--group-size 4 --max-turns 12 --num-train-steps 100 \
--output-dir ./runs/hpc_grpo_local
```
on colab open [`training/hpc_colab.ipynb`](./training/hpc_colab.ipynb) β
it handles all the setup. the t4 free tier works at `--group-size 2`,
l4 / a100 can push `--group-size 4+`.
### path B β remote hosted openenv (multiple spaces = throughput)
```bash
python -m training.hpc_openenv_gemma \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
https://<user>-enterprise-hpc-openenv-2.hf.space \
--model Qwen/Qwen2.5-Coder-7B-Instruct \
--group-size 4 --max-turns 24 --num-train-steps 200 \
--curriculum --save-adapter-only
```
the pool round-robins across every `--env-urls` entry for parallel
rollouts. as of apr 23 2026 the remote server supports per-episode
sessions (keyed on `episode_id`), so `group_size > 1` against a single
space no longer clobbers episode state. the default `--max-turns` is
now `24` β many scenarios need 10+ turns once format compliance and
diagnostic steps are accounted for.
### path C β hf jobs (fully managed, gpu-on-demand)
```bash
python -m training.hf_jobs \
--env-urls https://<user>-enterprise-hpc-openenv.hf.space \
--repo-url https://huggingface.co/spaces/<user>/enterprise-hpc-openenv \
--gpu a10g-large \
--num-train-steps 300 \
--hub-repo <user>/hpc-grpo-runs
```
see [`docs/hf_jobs.md`](./docs/hf_jobs.md) for the full guide.
## 6 expected artifacts
every training run produces:
- `runs/<name>/<name>.metrics.jsonl` β reward curve time series
- tensorboard event files β `tensorboard --logdir ./runs`
- optional wandb run if `--wandb-project` is set
- optional lora adapter weights in `runs/<name>/`
to plot the reward curve locally:
```bash
tensorboard --logdir ./runs
# or use the plot cell at the bottom of training/hpc_colab.ipynb
```
## 7 troubleshooting
| symptom | fix |
| --- | --- |
| `bwrap: setting up uid map: Permission denied` | enable unprivileged user namespaces: `sudo sysctl -w kernel.unprivileged_userns_clone=1` |
| `fuse-overlayfs: not found` | harmless, we fall back to copy mode. apt install it for <1 ms resets |
| `OSError: out of pty devices` | pexpect cannot allocate a PTY. rerun on a host with `/dev/ptmx` accessible (colab, hf spaces, most linux hosts) |
| `ModuleNotFoundError: gymnasium` / `pexpect` | `pip install -e .` again, or `pip install gymnasium pexpect httpx` |
| HF Space deploy: build fails on `fuse-overlayfs` install | ignore β Spaces have apparmor restrictions, the copy fallback still works |
| `huggingface_hub.run_uv` missing | upgrade: `pip install -U huggingface_hub`. otherwise `--dry-run-local` prints the shell script |
| training OOM on T4 | lower `--group-size 2 --max-new-tokens 256`, or switch to `Qwen/Qwen2.5-Coder-3B-Instruct` / `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` |
| "no pty devices" when running training locally in a container | run on a linux host directly, or in colab |
## 8 one-line reproduction for judges
```bash
make help # list all targets
make gold # prove solvable
make bench # reset latency
make eval # policy leaderboard
make dry # training plumbing smoke test
make train # local grpo training
make train-remote ENV_URLS=https://your.hf.space # remote openenv training
```
|