File size: 2,837 Bytes
65b799e
 
27d58b3
ba716cf
513a2e2
 
cdc237b
 
 
513a2e2
ba716cf
 
 
e8e5af5
 
97fc141
e8e5af5
 
97fc141
 
 
 
 
ebd0ff3
 
5e0e606
 
ede4c5c
 
ebd0ff3
e8e5af5
 
 
 
 
 
 
 
 
ebd0ff3
 
 
 
 
 
 
 
 
 
 
5e0e606
ede4c5c
cdc237b
 
9c3599b
ede4c5c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Training and evaluation notebooks belong here.

This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.

Training policy:

- train on the live low-fidelity environment surface, including explicit `submit`
- keep the standard `training/llm_rollout.py` monitor/evaluate workflow on the same live contract as the notebook
- keep high-fidelity validation in offline tooling such as `baselines/high_fidelity_validation.py`

## Status

- [ ] Northflank notebook artifacts saved
- [x] repository GRPO notebook saved
- [ ] Colab mirror or public notebook link saved if required by the submission surface
- [x] tiny low-fi PPO smoke artifact saved
- [ ] fixed-seed untrained baseline artifact saved
- [ ] before/after trained-policy evidence saved

## Runnable paths

- install the training dependencies: `uv sync --extra training`
- tiny low-fi PPO smoke run: `uv run --extra training python training/ppo_smoke.py`
- generate an LLM-ready prompt payload: `uv run python training/llm_rollout.py prompt --seed 0`
- replay an LLM completion or action plan: `uv run python training/llm_rollout.py replay --seed 0 --completion-file <path>`
- monitor reward terms, action clamping, and verifier outcomes across seeds:
  `uv run python training/llm_rollout.py monitor --completion-file <path> --seeds 0,1,2`
- generate fresh model completions per seed and save aggregate reward/outcome metrics:
  `uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`

Use `monitor` when you already have one completion or one action plan and want a fixed replay across seeds.
Use `evaluate` for before/after policy comparison because it generates a fresh completion per seed.

## Current validation target

- save one untrained fixed-seed baseline with `evaluate`
- run one short GRPO pass on Northflank H100 with the repository notebook
- rerun the same seeds and compare reward plus low-fidelity feasibility before versus after

## Shared LLM Contract

The prompt/action/replay contract for LLM training lives in:

- `fusion_lab/llm_agent.py`

Use that module as the source of truth for:

- prompt formatting
- action-plan parsing
- local rollout replay
- rollout telemetry structure used by the monitor command

For `prompt`, `monitor`, `evaluate`, and the notebook, the shared helper contract now includes the live `submit` action.
Use offline validation scripts when you explicitly want high-fidelity checks outside the environment loop.

For `evaluate`, the completion command reads the prompt from `stdin` and writes a raw completion to `stdout`.
The current seed is exposed as the `FUSION_LAB_SEED` environment variable so the same command can be used
for fixed-seed before/after comparisons of untrained and trained checkpoints.