CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b
Training and evaluation notebooks belong here.
This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.
Training policy:
- train on the live low-fidelity environment surface, including explicit `submit`
- keep the standard `training/llm_rollout.py` monitor/evaluate workflow on the same live contract as the notebook
- keep high-fidelity validation in offline tooling such as `baselines/high_fidelity_validation.py`
## Status
- [ ] Northflank notebook artifacts saved
- [x] repository GRPO notebook saved
- [ ] Colab mirror or public notebook link saved if required by the submission surface
- [x] tiny low-fi PPO smoke artifact saved
- [ ] fixed-seed untrained baseline artifact saved
- [ ] before/after trained-policy evidence saved
## Runnable paths
- install the training dependencies: `uv sync --extra training`
- tiny low-fi PPO smoke run: `uv run --extra training python training/ppo_smoke.py`
- generate an LLM-ready prompt payload: `uv run python training/llm_rollout.py prompt --seed 0`
- replay an LLM completion or action plan: `uv run python training/llm_rollout.py replay --seed 0 --completion-file <path>`
- monitor reward terms, action clamping, and verifier outcomes across seeds:
`uv run python training/llm_rollout.py monitor --completion-file <path> --seeds 0,1,2`
- generate fresh model completions per seed and save aggregate reward/outcome metrics:
`uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
Use `monitor` when you already have one completion or one action plan and want a fixed replay across seeds.
Use `evaluate` for before/after policy comparison because it generates a fresh completion per seed.
## Current validation target
- save one untrained fixed-seed baseline with `evaluate`
- run one short GRPO pass on Northflank H100 with the repository notebook
- rerun the same seeds and compare reward plus low-fidelity feasibility before versus after
## Shared LLM Contract
The prompt/action/replay contract for LLM training lives in:
- `fusion_lab/llm_agent.py`
Use that module as the source of truth for:
- prompt formatting
- action-plan parsing
- local rollout replay
- rollout telemetry structure used by the monitor command
For `prompt`, `monitor`, `evaluate`, and the notebook, the shared helper contract now includes the live `submit` action.
Use offline validation scripts when you explicitly want high-fidelity checks outside the environment loop.
For `evaluate`, the completion command reads the prompt from `stdin` and writes a raw completion to `stdout`.
The current seed is exposed as the `FUSION_LAB_SEED` environment variable so the same command can be used
for fixed-seed before/after comparisons of untrained and trained checkpoints.