CreativeEngineer commited on
Commit
e8e5af5
·
1 Parent(s): ddcb837

docs: align training workflow and plan ssot

Browse files
README.md CHANGED
@@ -15,7 +15,7 @@ An RL environment where agents optimize stellarator fusion reactor designs by ad
15
  | `average_triangularity` | ≤ -0.5 |
16
  | `edge_iota_over_nfp` | ≥ 0.3 |
17
 
18
- The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. Each episode has a budget of **6 evaluations** across **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit).
19
 
20
  ## Architecture
21
 
@@ -30,8 +30,10 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
30
  - `P1` is locked as the benchmark task with `constellaration` as verifier of record
31
  - The repaired 4-knob low-dimensional boundary family is wired into the runtime path
32
  - Environment deployed to HF Spaces and verified (health, reset, step all operational)
33
- - GRPO training notebook created with Unsloth + TRL integration
 
34
  - Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
 
35
 
36
  ## Execution Status
37
 
@@ -65,7 +67,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
65
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
66
  - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
67
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
68
- - High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
69
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
70
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
71
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
@@ -135,7 +137,9 @@ uv sync --extra notebooks
135
  - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
136
  - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
137
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
138
- - [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
 
 
139
  - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
140
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
141
  - [x] Deploy the environment to HF Space.
 
15
  | `average_triangularity` | ≤ -0.5 |
16
  | `edge_iota_over_nfp` | ≥ 0.3 |
17
 
18
+ The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
19
 
20
  ## Architecture
21
 
 
30
  - `P1` is locked as the benchmark task with `constellaration` as verifier of record
31
  - The repaired 4-knob low-dimensional boundary family is wired into the runtime path
32
  - Environment deployed to HF Spaces and verified (health, reset, step all operational)
33
+ - GRPO training notebook is checked into the repo and aligned with the shared `fusion_lab/llm_agent.py` contract
34
+ - LLM rollout tooling can now generate fresh model completions per seed and save fixed-seed reward/outcome summaries
35
  - Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
36
+ - Before/after trained-policy evidence on the current low-fidelity-only workflow is still open
37
 
38
  ## Execution Status
39
 
 
67
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
68
  - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
69
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
70
+ - The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
71
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
72
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
73
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 
137
  - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
138
  - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
139
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
140
+ - [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
141
+ - [ ] Run one short H100 GRPO pass with the repository notebook on the same low-fidelity-only workflow.
142
+ - [ ] Re-run the same seeds after training and save one before/after artifact.
143
  - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
144
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
145
  - [x] Deploy the environment to HF Space.
docs/{findings/FUSION_DESIGN_LAB_PLAN_V2.md → FUSION_DESIGN_LAB_PLAN_V2.md} RENAMED
@@ -37,19 +37,23 @@ Completed:
37
  - an initial low-fidelity manual playtest note now exists
38
  - paired high-fidelity fixture checks for those tracked fixtures now exist
39
  - one submit-side manual playtest trace exists
 
 
40
 
41
  Still open:
42
 
43
  - decision on whether reset-seed pool should change from paired checks
44
  - HF Space deployment evidence
45
- - public notebook artifact wiring
 
46
  - demo and README polish after the artifacts are real
47
 
48
  Current caution:
49
 
50
  - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
51
  - do not narrate low-fidelity rollout metrics as final submission truth
52
- - do not move high-fidelity VMEC-backed `submit` into the normal RL inner loop; keep it for truth checks and sparse evaluation
 
53
 
54
  ## 3. Locked Decisions
55
 
@@ -98,8 +102,9 @@ Use the docs like this:
98
 
99
  Visible artifacts:
100
 
101
- - [ ] HF Space environment
102
- - [ ] Public submission notebook
 
103
  - [ ] 1-minute demo video
104
  - [x] Public repo and README
105
 
@@ -116,10 +121,12 @@ Evidence order:
116
  - [x] fixture checks
117
  - [x] manual playtest log
118
  - [x] tiny low-fi PPO smoke trace
 
 
119
  - [ ] reward iteration note
120
  - [ ] stable local and remote episodes
121
  - [x] random and heuristic baselines
122
- - [ ] notebook evidence
123
  - [ ] demo and repo polish
124
 
125
  ## 7. Environment Summary
@@ -141,11 +148,14 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
141
  - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
142
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
143
  - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 
 
 
144
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
145
  - [x] Refresh the heuristic baseline using the repaired-family evidence.
146
  - [ ] Prove a stable local episode path.
147
  - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
148
- - [ ] Wire the public notebook artifact to the live environment.
149
  - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
150
  - [ ] Polish the public repo only after the artifacts above exist.
151
 
@@ -189,6 +199,12 @@ Gate 8: submission artifacts exist
189
 
190
  - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
191
 
 
 
 
 
 
 
192
  ## 10. Fallback Rules
193
 
194
  If training evidence is weak:
@@ -196,6 +212,7 @@ If training evidence is weak:
196
  - keep claims conservative about policy quality
197
  - still ship a trained-policy demonstration and document its limitations plainly
198
  - do not skip the paired high-fidelity checks or submit-side manual trace
 
199
 
200
  If HF Space deployment is delayed:
201
 
@@ -225,4 +242,7 @@ If the repaired family is too easy:
225
  - [x] Pair the tracked fixtures with high-fidelity submit checks.
226
  - [x] Record one submit-side manual trace.
227
  - [x] Refresh the heuristic baseline from that playtest evidence.
 
 
 
228
  - [ ] Verify one clean HF Space episode with the same contract.
 
37
  - an initial low-fidelity manual playtest note now exists
38
  - paired high-fidelity fixture checks for those tracked fixtures now exist
39
  - one submit-side manual playtest trace exists
40
+ - the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
41
+ - model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines
42
 
43
  Still open:
44
 
45
  - decision on whether reset-seed pool should change from paired checks
46
  - HF Space deployment evidence
47
+ - public Colab mirror or notebook submission link, if the submission surface still requires it
48
+ - before/after trained-policy evidence on the current low-fidelity-only workflow
49
  - demo and README polish after the artifacts are real
50
 
51
  Current caution:
52
 
53
  - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
54
  - do not narrate low-fidelity rollout metrics as final submission truth
55
+ - the standard notebook and `training/llm_rollout.py` `monitor` / `evaluate` paths now stay on low-fidelity `run` only and ignore `submit` by default
56
+ - reserve VMEC-backed `submit` for replay/debug work, paired fixture checks, submit-side traces, and final evidence
57
 
58
  ## 3. Locked Decisions
59
 
 
102
 
103
  Visible artifacts:
104
 
105
+ - [x] HF Space environment
106
+ - [x] Repository training notebook
107
+ - [ ] Public Colab mirror or submission notebook link if required
108
  - [ ] 1-minute demo video
109
  - [x] Public repo and README
110
 
 
121
  - [x] fixture checks
122
  - [x] manual playtest log
123
  - [x] tiny low-fi PPO smoke trace
124
+ - [x] shared-helper notebook alignment
125
+ - [x] model-driven low-fi LLM evaluation tooling
126
  - [ ] reward iteration note
127
  - [ ] stable local and remote episodes
128
  - [x] random and heuristic baselines
129
+ - [ ] before/after trained-policy evidence
130
  - [ ] demo and repo polish
131
 
132
  ## 7. Environment Summary
 
148
  - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
149
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
150
  - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
151
+ - [ ] Save one fixed-seed untrained baseline with the low-fidelity-only `training/llm_rollout.py evaluate` workflow.
152
+ - [ ] Run one short H100 GRPO pass with the repository notebook on that same low-fidelity-only workflow.
153
+ - [ ] Re-run the same seeds after training and save one before/after artifact.
154
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
155
  - [x] Refresh the heuristic baseline using the repaired-family evidence.
156
  - [ ] Prove a stable local episode path.
157
  - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
158
+ - [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
159
  - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
160
  - [ ] Polish the public repo only after the artifacts above exist.
161
 
 
199
 
200
  - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
201
 
202
+ Gate 9: trained-policy evidence is real
203
+
204
+ - one fixed-seed untrained baseline exists
205
+ - one short low-fidelity training pass exists on the same workflow
206
+ - the repo can show a before/after comparison on the same seeds without relying on `submit`
207
+
208
  ## 10. Fallback Rules
209
 
210
  If training evidence is weak:
 
212
  - keep claims conservative about policy quality
213
  - still ship a trained-policy demonstration and document its limitations plainly
214
  - do not skip the paired high-fidelity checks or submit-side manual trace
215
+ - do not swap back to submit-included reward traces and present them as the current GRPO path
216
 
217
  If HF Space deployment is delayed:
218
 
 
242
  - [x] Pair the tracked fixtures with high-fidelity submit checks.
243
  - [x] Record one submit-side manual trace.
244
  - [x] Refresh the heuristic baseline from that playtest evidence.
245
+ - [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
246
+ - [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
247
+ - [ ] Re-run the same seeds and save a before/after artifact.
248
  - [ ] Verify one clean HF Space episode with the same contract.
docs/P1_ENV_CONTRACT_V1.md CHANGED
@@ -1,7 +1,7 @@
1
  # P1 Environment Contract V1
2
 
3
  **Role:** Live technical contract SSOT for the current implementation phase
4
- **Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](FUSION_DESIGN_LAB_PLAN_V2.md)
5
  **Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
6
 
7
  ## 1. Scope
@@ -179,7 +179,8 @@ VMEC preset mapping:
179
  Training and evaluation rule:
180
 
181
  - use low-fidelity `run` as the RL inner-loop surface
182
- - keep higher-fidelity `submit` for terminal truth, paired fixture checks, submit-side manual traces, and sparse checkpoint evaluation
 
183
  - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
184
 
185
  ## 9. Reward V0
 
1
  # P1 Environment Contract V1
2
 
3
  **Role:** Live technical contract SSOT for the current implementation phase
4
+ **Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](./FUSION_DESIGN_LAB_PLAN_V2.md)
5
  **Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
6
 
7
  ## 1. Scope
 
179
  Training and evaluation rule:
180
 
181
  - use low-fidelity `run` as the RL inner-loop surface
182
+ - the standard repository notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on low-fidelity `run` only and ignore `submit` by default
183
+ - keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
184
  - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
185
 
186
  ## 9. Reward V0
docs/VERIFIER_REWARD_REVIEW_2026-03-08.md CHANGED
@@ -2,6 +2,10 @@
2
 
3
  Date: 2026-03-08
4
 
 
 
 
 
5
  ## Scope
6
 
7
  This note reviews how well the current verifier path and reward function serve the repo's stated goal:
@@ -30,26 +34,26 @@ The verifier implementation is directionally strong and mostly correct now. The
30
 
31
  The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
32
 
33
- The main gap is no longer basic verifier wiring. The main gap is that the repo still lacks the validation evidence that its own docs require before calling the environment "good":
34
 
35
- - no tracked fixtures
36
- - no manual playtest log
37
- - no documented reward pathology/fix or explicit "Reward V0 survived playtest" note
38
 
39
  ## Validated Findings
40
 
41
- ### 1. Missing validation artifacts are still the biggest repo-level gap
42
 
43
- The planning docs explicitly say the environment artifact is the product, not just the code. Current repo state still lacks:
44
 
45
  - tracked `P1` fixtures
46
  - manual playtest evidence
47
  - a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
48
 
 
 
49
  Relevant references:
50
 
51
  - [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
52
- - [FUSION_DELIVERABLES_MAP.md](./FUSION_DELIVERABLES_MAP.md)
53
  - [server/data/p1/README.md](../server/data/p1/README.md)
54
 
55
  ### 2. Observation legibility still conflicts with the official tolerance semantics
 
2
 
3
  Date: 2026-03-08
4
 
5
+ Historical note:
6
+
7
+ This review captured repo state earlier on March 8, 2026, before the later notebook/helper alignment and low-fidelity-only LLM evaluation workflow landed. Fixtures, manual playtest notes, notebook wiring, and model-driven low-fidelity rollout tooling were added later the same day. Treat the repo-level gap callouts below as a dated snapshot; the main remaining validation gap is before/after trained-policy evidence on the current low-fidelity-only workflow.
8
+
9
  ## Scope
10
 
11
  This note reviews how well the current verifier path and reward function serve the repo's stated goal:
 
34
 
35
  The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
36
 
37
+ The main gap at review time was no longer basic verifier wiring. The main gap was that the repo lacked the validation evidence its own docs required before calling the environment "good."
38
 
39
+ Later on March 8, 2026, the repo closed the fixture/manual-playtest/notebook-wiring gaps. The remaining open validation gap is still before/after trained-policy evidence on the current low-fidelity-only workflow.
 
 
40
 
41
  ## Validated Findings
42
 
43
+ ### 1. Missing validation artifacts were the biggest repo-level gap at review time
44
 
45
+ At review time, the planning docs explicitly said the environment artifact was the product, not just the code. The repo still lacked:
46
 
47
  - tracked `P1` fixtures
48
  - manual playtest evidence
49
  - a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
50
 
51
+ Later updates on March 8, 2026 closed the tracked-fixture and manual-playtest parts of that gap. The live open item is still trained-policy evidence on the current low-fidelity-only workflow.
52
+
53
  Relevant references:
54
 
55
  - [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
56
+ - [archive/FUSION_DELIVERABLES_MAP.md](./archive/FUSION_DELIVERABLES_MAP.md)
57
  - [server/data/p1/README.md](../server/data/p1/README.md)
58
 
59
  ### 2. Observation legibility still conflicts with the official tolerance semantics
training/README.md CHANGED
@@ -11,9 +11,11 @@ Training policy:
11
  ## Status
12
 
13
  - [ ] Northflank notebook artifacts saved
14
- - [ ] Colab notebook saved
 
15
  - [x] tiny low-fi PPO smoke artifact saved
16
- - [ ] trained-policy evidence saved
 
17
 
18
  ## Runnable paths
19
 
@@ -26,6 +28,15 @@ Training policy:
26
  - generate fresh model completions per seed and save aggregate reward/outcome metrics:
27
  `uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
28
 
 
 
 
 
 
 
 
 
 
29
  ## Shared LLM Contract
30
 
31
  The prompt/action/replay contract for LLM training lives in:
 
11
  ## Status
12
 
13
  - [ ] Northflank notebook artifacts saved
14
+ - [x] repository GRPO notebook saved
15
+ - [ ] Colab mirror or public notebook link saved if required by the submission surface
16
  - [x] tiny low-fi PPO smoke artifact saved
17
+ - [ ] fixed-seed untrained baseline artifact saved
18
+ - [ ] before/after trained-policy evidence saved
19
 
20
  ## Runnable paths
21
 
 
28
  - generate fresh model completions per seed and save aggregate reward/outcome metrics:
29
  `uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
30
 
31
+ Use `monitor` when you already have one completion or one action plan and want a fixed replay across seeds.
32
+ Use `evaluate` for before/after policy comparison because it generates a fresh completion per seed.
33
+
34
+ ## Current validation target
35
+
36
+ - save one untrained fixed-seed baseline with `evaluate`
37
+ - run one short GRPO pass on Northflank H100 with the repository notebook
38
+ - rerun the same seeds and compare reward plus low-fidelity feasibility before versus after
39
+
40
  ## Shared LLM Contract
41
 
42
  The prompt/action/replay contract for LLM training lives in:
training/notebooks/README.md CHANGED
@@ -18,8 +18,9 @@ Recommended split:
18
  - [x] Northflank smoke notebook note saved
19
  - [x] runnable Northflank smoke script saved
20
  - [x] Northflank smoke test passed on the team H100
 
21
  - [ ] manual-playtest notebook or trace notebook saved
22
- - [ ] public submission notebook link saved
23
 
24
  Operational defaults:
25
 
@@ -27,6 +28,8 @@ Operational defaults:
27
  - keep heavy verifier and training work on Northflank
28
  - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
29
  - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
 
 
30
  - keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
31
  - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
32
 
@@ -47,4 +50,8 @@ LLM notebook helpers should use the packaged prompt/action contract in:
47
 
48
  - `fusion_lab/llm_agent.py`
49
 
 
 
 
 
50
  The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.
 
18
  - [x] Northflank smoke notebook note saved
19
  - [x] runnable Northflank smoke script saved
20
  - [x] Northflank smoke test passed on the team H100
21
+ - [x] repository GRPO notebook saved
22
  - [ ] manual-playtest notebook or trace notebook saved
23
+ - [ ] public Colab mirror or submission notebook link saved if still required
24
 
25
  Operational defaults:
26
 
 
28
  - keep heavy verifier and training work on Northflank
29
  - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
30
  - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
31
+ - keep the repository GRPO notebook aligned to the shared helper contract in `fusion_lab/llm_agent.py`
32
+ - the standard notebook reward/eval path is low-fidelity-only and ignores `submit` by default
33
  - keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
34
  - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
35
 
 
50
 
51
  - `fusion_lab/llm_agent.py`
52
 
53
+ Current repository notebook:
54
+
55
+ - `training/notebooks/fusion_design_lab_training.ipynb`
56
+
57
  The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.