Spaces:
Paused
judges' self-serve guide compliance map
this document cross-references the apr 2026 openenv hackathon self-serve guide (22 sections + 58 faq entries + 59 unsloth recipe pointers) to concrete artifacts in this repo. every section of the guide is covered here, with the file paths, commands, and rationale a judge can follow in under five minutes.
tl;dr every explicit "must do" from the guide is implemented. the only items the repo cannot self-complete are the two blockers tracked in
TODO_FOR_USER.md: a real gpu grpo training curve and the 90-second demo video. the live hugging face space (huggingmenfordays/enterprise-hpc-openenv) is deployed. gpu-free evidence of reward improvement already lives indocs/assets/reward_curve_demo.png.
apr 23 2026 update: the remote rollout pipeline was rewritten so
group_size > 1against a single hf space no longer clobbers episode state. the server (sysadmin_env/server.py) now runs an lru-boundedHttpSessionStorekeyed on a uuidepisode_id;Observationcarriesgrader_health,grader_details, andood_http_code; andtraining/reward_functions.pynow triggerssolve_rewardonterminated(not a reward threshold) and consumes the propagatedgrader_healthforprogress_reward. this fixed afrac_reward_zero_std = 1stall observed on the first full kaggle probe run.
0. what you are building → environment + verifier + trainer + deployment
| layer | repo artifact |
|---|---|
| environment | sysadmin_env/ fastapi server, hpc_gym.py gymnasium wrapper, nine scenarios in sysadmin_env/tasks/ |
| verifier / reward | sysadmin_env/rewards.py, tools/verify_gold_trajectory.py, training/reward_functions.py |
| trl trainer | training/train_hpc_outage.py local, training/hpc_openenv_gemma.py remote via --env-urls |
| unsloth efficiency | FastLanguageModel + 4-bit qlora in both training scripts |
| openenv deploy | Dockerfile, server/Dockerfile, docs/hf_spaces_deploy.md, openenv.yaml |
1. pick the right project idea (verifiable, step-by-step, hard-but-solvable)
the task is linux hpc incident response. the agent acts one shell command
at a time, every scenario ships with a deterministic grader, and every
scenario has a sub-14-step gold trajectory proven by
python -m tools.verify_gold_trajectory (make gold).
2. minimum rl loop
the loop is wired end-to-end in training/rollout.py:
- prompt →
training/agent_prompt.py - model generates
<bash>...</bash> - action executed in
Sandboxvia bwrap + overlayfs - reward computed by
RewardEngineand the sixreward_funcs - grpo update in
trl.GRPOTrainerwithnum_generations=group_size
3. sft vs rl
we train from Qwen/Qwen2.5-Coder-7B-Instruct, a code-tuned
instruction-tuned warm start, then run grpo on top. this matches the
guide's "add light formatting or task scaffolding if needed. use rl for
improvement, not as magic from scratch". the policy already emits
well-formed shell commands so grpo does not burn samples on format
discovery. any other text instruct model can be dropped in via
--model.
4 & 5. design & build the environment first
- action / observation / state types:
sysadmin_env/models.py reset,step,state,tasks,health,ws:sysadmin_env/server.py- openenv scaffold:
openenv.yaml+ docker entrypoints
6. start simple (curriculum)
training/train_hpc_outage.py --curriculum and
training/hpc_openenv_gemma.py --curriculum unlock scenarios in three
difficulty buckets:
hpc_pid_stale,hpc_gpu_ecc,hpc_ood_apache(short, single-fix)hpc_nfs_stale(two-step mount fix)hpc_outage,hpc_munge(multi-app, branching)
this prevents the zero-reward stall the guide warns about in sections 6 and 14.
7. design rewards carefully (multiple independent components)
"use multiple independent reward functions, not just one" — section 7.
the grpo trainers in this repo pass six independent reward functions to
trl.GRPOTrainer, all defined in training/reward_functions.py:
| reward fn | purpose | guide tie-in |
|---|---|---|
solve_reward |
binary rlvr signal from grader | §7 correctness / §4 env-based reward |
format_reward |
rewards well-formed <bash> action |
§7 format compliance |
safety_reward |
penalizes destructive shell commands | §8 reward hacking / §7 safety |
progress_reward |
terminal grader health, capped at 0.5 | §7 partial progress |
efficiency_reward |
bounded bonus for short solves | §7 timeouts / resource usage |
anti_hack_reward |
penalizes edits to grader-owned paths | §8 anti-cheating |
trl sums them into the advantage, but each column is still logged
independently so reviewers can see which signal is driving updates.
8. reward hacking protection
- multiple independent signals: see §7 above
- locked-down execution:
sysadmin_env/sandbox.pyuses bubblewrap with unshared namespaces, read-only binds, and optional--unshare-net - per-episode session isolation: the server's
HttpSessionStorekeyed on uuidepisode_idmeans one rollout cannot observe or corrupt another rollout's sandbox even when many clients share the same space — no cross-episode information leak - time limits:
DEFAULT_STEP_TIMEOUT = 60s,DEFAULT_SHELL_TIMEOUT = 30s,max_runtime_minutes: 20inopenenv.yaml - avoid unrestricted globals: slurm state is a json file guarded with
fcntllocks, not a python global - sample + inspect:
RewardLoggernow writesruns/<run>/transcripts/step_NNNN.jsonleverytranscript_sample_everysteps (default 5). seetraining/logger.py - rollback on drift: catastrophic commands end the episode immediately with
catastrophic_penalty = -1.0inRewardEngine - forbidden globals / protected paths:
anti_hack_rewardchecks every<bash>command againstGRADER_PROTECTED_PATTERNS(includesslurm_state.json,/grader/,ECC_RESET_SENTINEL)
9. process-aware feedback
the per-step RewardEngine already supports:
health_delta— partial progress from the graderknowledge_delta— one-time reward for discovering diagnostic facts (section 9's "step-level verifier")action_penalty— per-step cost to discourage idle loops
plus anti_hack_reward and safety_reward apply stepwise filters inside each
rollout, so feedback is not only final-outcome.
10. the right training stack
- trl
GRPOTrainerimported in both training scripts - unsloth
FastLanguageModelwithload_in_4bit=True, lorar=16 - openenv for the env interface (server + client) with
--env-urlspointing at one or more hosted spaces for rollout parallelism
11. grpo / rlvr style
reward is rlvr: the grader is a deterministic file-system check, not a
learned reward model. solve_reward is binary, all shaping terms are
bounded, and the grader's grade() is pure python with no llm in the loop.
12. keep inference fast
- reset latency: p50 2.40 ms in copy-mode, <1 ms on fuse-overlayfs
hosts. bench:
bench/bench_reset.pyviamake bench - unsloth 4-bit inference path enabled in both trainers (
FastLanguageModel.for_inference) - rollouts distributed across multiple hf spaces via
RemoteEndpointPoolround-robin intraining/remote_env.py
13. deploy early
- live space:
huggingmenfordays/enterprise-hpc-openenv— public urlhttps://huggingmenfordays-enterprise-hpc-openenv.hf.space Dockerfiles are already tuned for hf spacesdocs/hf_spaces_deploy.mdcovers both the first-time push and the orphan-branch redeploy trick needed to push over our history (xet rejects the.venv/+ png binaries in thefinal-roundhistory)TODO_FOR_USER.mdsection 2 has the exact copy-pasteable push recipe
14. scale after stable
Makefile encodes the guide's recommended order:
make gold— every scenario is deterministically solvablemake bench— reset latency under 3 msmake eval— gold vs random vs bad policy leaderboardmake dry— rollout plumbing works without gpumake train— tiny grpo runmake train-remote ENV_URLS=...— scale to multiple hosted spaces
only step 6 requires gpu + cloud credentials.
15. monitor the right things
training/logger.py writes per-grpo-step metrics to
runs/<run>/<run>.metrics.jsonl with:
reward_mean,reward_maxsolve_rate(critical "function works" column called out in §15)health_meansteps_meantask_mixwall_seconds
plus transcripts are sampled every 5 steps into
runs/<run>/transcripts/step_*.jsonl. optional tensorboard + wandb + hf hub
uploads happen automatically when --wandb-project / --hub-repo are set.
16. save models correctly
both trainers accept --save-adapter-only. when set, only the lora adapter is
saved via model.save_pretrained(...) and the risky "upcast 4-bit to 16-bit
then merge" path is skipped, matching the guide's explicit warning.
python -m training.train_hpc_outage --save-adapter-only ...
python -m training.hpc_openenv_gemma --save-adapter-only --env-urls ...
17. team split
the repo naturally maps onto the guide's recommended four-person split:
- person a (environment): owns
sysadmin_env/,hpc_gym.py,bench/ - person b (verifier / rewards): owns
sysadmin_env/rewards.py,training/reward_functions.py,tools/verify_gold_trajectory.py - person c (training): owns
training/,Makefiletargets - person d (demo / product): owns
docs/pitch.md,docs/hf_blog.md,docs/video_script.md
18. 1-day execution plan
covered phase-by-phase in GETTING_STARTED.md.
19. what judges will find compelling
| compelling factor | repo evidence |
|---|---|
| clear environment design | nine tasks, dataclasses + fastapi, openenv standard contract |
| objective reward functions | six-component rlvr reward stack |
| evidence the model improved | docs/assets/reward_curve_demo.png (gpu-free) + the real grpo curve from training/hpc_colab.ipynb (tracked in TODO #1) |
| reward-hacking prevention | destructive command patterns, anti_hack_reward, grader-owned paths, transcript sampling |
| reproducible deployment | Dockerfile, openenv.yaml, hf spaces recipe |
| sharp demo | docs/video_script.md, make gold && make bench && make eval && make reward-demo |
20. theme directions
we target #3.1 world modeling / professional tasks (primary), the scaler ai labs multi-app rl environment for enterprise workflows bonus (six apps: slurm, munge, systemd, nvidia driver, nfs, apache ood), and #2 long-horizon planning & instruction following (8-14 step gold trajectories).
21. common mistakes to avoid — self-check
| mistake | how we avoid it |
|---|---|
| task so hard success probability is zero | make gold proves every scenario is solvable; curriculum flag ramps difficulty |
| using only one reward function | six independent reward functions (training/reward_functions.py) |
| not checking for reward hacking | anti_hack_reward + safety_reward + periodic transcript dumps |
| training before env is stable | make gold && make bench && make eval run without any gpu |
| relying only on average reward | logger tracks solve_rate, steps_mean, task_mix, and dumps transcripts |
| forgetting timeouts / sandbox limits | DEFAULT_STEP_TIMEOUT, DEFAULT_SHELL_TIMEOUT, max_runtime_minutes: 20 |
| saving lora/qlora incorrectly | --save-adapter-only flag + warning in this doc |
22. learning resources checklist
we reference every primary link from the guide in README.md
and docs/hf_blog.md, including openenv core, the hf hub
org, the tutorial examples, and the mega-lecture modules.
faq coverage highlights (1-58)
- rlvr vs learned reward model (§4, §11, §24): we use rlvr; the grader is pure python
- why rl environments matter (§5, §7 of faq, §25): we expose the full act/observe/act loop via fastapi, not a static dataset
- trl + grpo (§7, §8, §25):
GRPOTrainerwith six reward functions - unsloth (§8, §59):
FastLanguageModel4-bit qlora,for_inference(...) - curriculum (§14):
--curriculumflag, three-bucket unlock schedule - process supervision (§11): per-step
health_delta+knowledge_delta+safety_reward+anti_hack_reward - goodhart / specification gaming (§38, §42): binary
solve_rewardprimary + bounded shaping caps - long-horizon problems (§51): curriculum + 16-turn cap +
steps_meantracking - identical runs diverging (§49): seeds plumbed everywhere (
args.seed,random.randrangerollout seed,GRPOConfig.seed,FastLanguageModel.random_state) - dataset staleness (§48, rlve): six scenarios rotated per rollout; the registry is pluggable
unsloth recipe references
- gpt-oss 2048 game rl (§59.2): we use the same env-driven pattern — our env is the hpc cluster, not a 2048 board
- advanced qwen3 grpo reward shaping (§59.1): our six-way reward stack plays the same role
- scheduler grpo (§59.4): reward tied to output format + task correctness is
mirrored by our
format_reward+solve_reward
what still requires a human
items in TODO_FOR_USER.md:
- capture a real gpu grpo reward curve (colab / kaggle notebook is ready; apr 23 reward-pipeline fixes land on next
git pull) deploy to hf spaces✅ live athuggingmenfordays/enterprise-hpc-openenv- record the 90-second demo video
- submit the form
everything the guide describes at the code, reward, env, and training-loop level is already shipped in this repo.