Spaces:

ccyloopss
/

HPCOpenenv

Paused

File size: 8,778 Bytes

bc35a94

# getting started — EnterpriseHPC-v0

end-to-end setup guide. covers a fresh linux machine, colab, and hugging
face spaces. pick the path that matches your situation.

## tl;dr fastest possible path

```bash
git clone https://github.com/<your-user>/low-taper-fade-openenv-scaler.git
cd low-taper-fade-openenv-scaler
python3.13 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -e '.[dev]'
make gold         # deterministic proof all 6 scenarios are solvable
make bench        # reset-latency benchmark (<3 ms p50 in copy mode)
make eval         # gold vs random vs bad policies, writes runs/eval/leaderboard.md
make reward-demo  # gpu-free reward-curve png, proves reward improvement
make dry          # training rollout smoke test, no gpu required
```

if everything passes, skip to [training paths](#training-paths).

## 1 prerequisites

### system packages (linux)

these are only required for the local sandbox. colab and hf jobs handle
them automatically.

```bash
sudo apt update
sudo apt install -y bubblewrap fuse-overlayfs fuse3 tini coreutils
bwrap --version           # >= 0.6 recommended
fuse-overlayfs --version  # optional, copy fallback also works
```

- `bubblewrap` (the `bwrap` binary) provides the user namespace sandbox
- `fuse-overlayfs` gives you sub-1 ms resets. missing it is fine, we fall
  back to a shutil-copy path that still hits ~2.4 ms p50

### python

- python `>=3.12` is required. python `3.13` is the current unsloth
  default (per their install docs) and the one used in `Dockerfile` +
  `server/Dockerfile`
- `pip install -e '.[dev]'` installs the package in dev mode plus all
  runtime deps (fastapi, uvicorn, gymnasium, pexpect, httpx,
  matplotlib, numpy, etc.) and pytest
- `pip install -e '.[train]'` adds the gpu-training deps (torch,
  transformers, trl, accelerate, peft, bitsandbytes, tensorboard,
  datasets). only needed on the training host

## 2 sanity checks (no gpu, 15 seconds)

run these in order. any failure means the environment is misconfigured.

```bash
# proves every scenario is deterministically solvable
python -m tools.verify_gold_trajectory -v

# measures reset latency — should be under 10 ms
python -m bench.bench_reset -n 100

# runs gold/random/bad policies against every scenario,
# writes runs/eval/leaderboard.md
python -m eval.eval_suite --trials 2
```

## 3 run the openenv server locally

```bash
make serve                 # runs the server console script on 0.0.0.0:8000
# or equivalently (after pip install -e .)
server --host 0.0.0.0 --port 8000
```

smoke test in another terminal:

```bash
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/tasks
curl -X POST http://127.0.0.1:8000/reset -H 'content-type: application/json' \
  -d '{"task_id": "hpc_outage"}'
curl -X POST http://127.0.0.1:8000/step -H 'content-type: application/json' \
  -d '{"action": {"command": "sinfo"}}'
```

## 4 deploy to hugging face spaces (for remote training)

this is required if you want to train via `--env-urls https://...`. the
reference deployment lives at
[`huggingmenfordays/enterprise-hpc-openenv`](https://huggingface.co/spaces/huggingmenfordays/enterprise-hpc-openenv)
(public url: `https://huggingmenfordays-enterprise-hpc-openenv.hf.space`).

### first-time push

1. create a new space on huggingface.co — type `Docker`, any hardware tier
2. push this repo to the space:
   ```bash
   hf auth login           # once
   huggingface-cli repo create enterprise-hpc-openenv --type space --space_sdk docker
   git remote add space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv
   git push space main
   ```
3. wait for the build. the space should expose your env at
   `https://<user>-enterprise-hpc-openenv.hf.space`
4. smoke test:
   ```bash
   curl https://<user>-enterprise-hpc-openenv.hf.space/health
   ```

### redeploying updates (orphan-branch trick)

this repo has `.venv/` and `docs/assets/*.png` binaries sitting in git
history that hf xet refuses to accept. a plain
`git push space final-round:main` will be rejected with
`pre-receive hook declined`. force-push a clean orphan snapshot instead:

```bash
hf auth login                                                                  # ensure token is live
git remote set-url space https://huggingface.co/spaces/<user>/enterprise-hpc-openenv

git checkout --orphan space-deploy
git rm -rf --cached .
rm -f docs/assets/reward_curve_demo.png                                        # drop binaries hf xet trips on
git add -A
git commit -m "deploy: clean snapshot for hf space"
git push space space-deploy:main --force

git checkout final-round
git branch -D space-deploy
git checkout HEAD -- docs/assets/reward_curve_demo.png                         # restore the png locally
```

your local `final-round` history stays intact; only the space's `main`
is rewritten. the build takes 5-10 min; hit `/health` to confirm it
came up green.

full guide: [`docs/hf_spaces_deploy.md`](./docs/hf_spaces_deploy.md)

## 5 training paths

### path A — local gpu (colab / single workstation)

```bash
python -m training.train_hpc_outage \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --scenarios hpc_outage,hpc_munge,hpc_pid_stale,hpc_gpu_ecc,hpc_nfs_stale,hpc_ood_apache \
  --group-size 4 --max-turns 12 --num-train-steps 100 \
  --output-dir ./runs/hpc_grpo_local
```

on colab open [`training/hpc_colab.ipynb`](./training/hpc_colab.ipynb) —
it handles all the setup. the t4 free tier works at `--group-size 2`,
l4 / a100 can push `--group-size 4+`.

### path B — remote hosted openenv (multiple spaces = throughput)

```bash
python -m training.hpc_openenv_gemma \
  --env-urls https://<user>-enterprise-hpc-openenv.hf.space \
             https://<user>-enterprise-hpc-openenv-2.hf.space \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --group-size 4 --max-turns 24 --num-train-steps 200 \
  --curriculum --save-adapter-only
```

the pool round-robins across every `--env-urls` entry for parallel
rollouts. as of apr 23 2026 the remote server supports per-episode
sessions (keyed on `episode_id`), so `group_size > 1` against a single
space no longer clobbers episode state. the default `--max-turns` is
now `24` — many scenarios need 10+ turns once format compliance and
diagnostic steps are accounted for.

### path C — hf jobs (fully managed, gpu-on-demand)

```bash
python -m training.hf_jobs \
  --env-urls https://<user>-enterprise-hpc-openenv.hf.space \
  --repo-url https://huggingface.co/spaces/<user>/enterprise-hpc-openenv \
  --gpu a10g-large \
  --num-train-steps 300 \
  --hub-repo <user>/hpc-grpo-runs
```

see [`docs/hf_jobs.md`](./docs/hf_jobs.md) for the full guide.

## 6 expected artifacts

every training run produces:

- `runs/<name>/<name>.metrics.jsonl` — reward curve time series
- tensorboard event files — `tensorboard --logdir ./runs`
- optional wandb run if `--wandb-project` is set
- optional lora adapter weights in `runs/<name>/`

to plot the reward curve locally:

```bash
tensorboard --logdir ./runs
# or use the plot cell at the bottom of training/hpc_colab.ipynb
```

## 7 troubleshooting

| symptom | fix |
| --- | --- |
| `bwrap: setting up uid map: Permission denied` | enable unprivileged user namespaces: `sudo sysctl -w kernel.unprivileged_userns_clone=1` |
| `fuse-overlayfs: not found` | harmless, we fall back to copy mode. apt install it for <1 ms resets |
| `OSError: out of pty devices` | pexpect cannot allocate a PTY. rerun on a host with `/dev/ptmx` accessible (colab, hf spaces, most linux hosts) |
| `ModuleNotFoundError: gymnasium` / `pexpect` | `pip install -e .` again, or `pip install gymnasium pexpect httpx` |
| HF Space deploy: build fails on `fuse-overlayfs` install | ignore — Spaces have apparmor restrictions, the copy fallback still works |
| `huggingface_hub.run_uv` missing | upgrade: `pip install -U huggingface_hub`. otherwise `--dry-run-local` prints the shell script |
| training OOM on T4 | lower `--group-size 2 --max-new-tokens 256`, or switch to `Qwen/Qwen2.5-Coder-3B-Instruct` / `unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit` |
| "no pty devices" when running training locally in a container | run on a linux host directly, or in colab |

## 8 one-line reproduction for judges

```bash
make help                                         # list all targets
make gold                                         # prove solvable
make bench                                        # reset latency
make eval                                         # policy leaderboard
make dry                                          # training plumbing smoke test
make train                                        # local grpo training
make train-remote ENV_URLS=https://your.hf.space  # remote openenv training
```