File size: 10,946 Bytes
65b799e
 
98ffb4a
 
e815b38
 
918007b
65b799e
 
27d58b3
65b799e
e815b38
65b799e
5354ca9
e815b38
5354ca9
e815b38
 
65b799e
27d58b3
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
 
 
cdc237b
 
e815b38
 
 
 
 
 
 
c3a24db
 
e8e5af5
 
65b799e
e815b38
acb992c
c3a24db
e815b38
e8e5af5
cdc237b
e815b38
acb992c
e815b38
65b799e
e815b38
 
cdc237b
 
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
 
 
 
 
 
65b799e
e815b38
65b799e
e815b38
 
65b799e
e815b38
65b799e
e815b38
 
 
 
 
 
65b799e
1c1f314
 
 
 
8bf0155
 
1c1f314
513a2e2
1c1f314
e815b38
65b799e
e815b38
65b799e
e815b38
 
 
 
65b799e
e815b38
65b799e
e815b38
65b799e
e8e5af5
 
 
e815b38
 
65b799e
e815b38
65b799e
e815b38
 
2348d3e
cdc237b
65b799e
e815b38
65b799e
e815b38
c3a24db
e815b38
c3a24db
e8e5af5
 
e815b38
 
 
e8e5af5
e815b38
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
 
cdc237b
e815b38
cdc237b
65b799e
e815b38
65b799e
e815b38
5354ca9
c3a24db
cdc237b
e815b38
c3a24db
cdc237b
 
e8e5af5
e815b38
f238af4
e815b38
 
e8e5af5
e815b38
 
5354ca9
e815b38
5354ca9
e815b38
5354ca9
e815b38
65b799e
567ff67
1c1f314
 
 
8bf0155
c3a24db
1c1f314
567ff67
 
 
8bf0155
567ff67
1c1f314
65b799e
e815b38
 
 
65b799e
1c1f314
65b799e
e815b38
65b799e
1c1f314
65b799e
e815b38
65b799e
1c1f314
65b799e
e815b38
65b799e
1c1f314
65b799e
2348d3e
65b799e
e8e5af5
 
 
 
cdc237b
e8e5af5
e815b38
65b799e
e815b38
65b799e
27d58b3
 
cdc237b
 
65b799e
e815b38
65b799e
e815b38
 
 
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
 
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
65b799e
e815b38
 
 
c3a24db
cdc237b
c3a24db
f238af4
e8e5af5
 
 
e815b38
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# Fusion Design Lab — Plan V2

**Hackathon:** OpenEnv Hackathon, March 7-8, 2026
**Track:** Statement 3.1 (World Modeling — Professional Tasks)
**Role:** Planning and execution SSOT for this repo
**Updated:** March 8, 2026

## 1. Submission Thesis

Fusion Design Lab is not only a "trained model for fusion" submission.

It is a clear, reproducible environment for one constrained scientific design task:

- official `P1` benchmark semantics
- narrow, human-playable action space
- real verifier feedback from `constellaration`
- explicit constraints and failure semantics
- reward logic that can be explained and iterated

The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable.

## 2. Current State

Completed:

- `P1` is locked as the single benchmark task
- the repaired 4-knob low-dimensional runtime is live in code
- the official `constellaration` verifier path is wired
- the live environment is now unified onto one low-fidelity reward and verifier surface
- `submit` remains an explicit terminal action on that same live contract
- explicit VMEC failure semantics are implemented
- the Northflank smoke workflow is committed
- the Northflank smoke test passed on the team H100
- baseline comparison has been rerun on the real verifier path
- a coarse measured sweep note now exists
- the first tracked low-fidelity fixtures now exist
- an initial low-fidelity manual playtest note now exists
- paired high-fidelity fixture checks for those tracked fixtures now exist
- one submit-side manual playtest trace exists
- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines

Still open:

- decision on whether reset-seed pool should change from paired checks
- HF Space deployment evidence
- public Colab mirror or notebook submission link, if the submission surface still requires it
- before/after trained-policy evidence on the current unified low-fidelity workflow
- demo and README polish after the artifacts are real

Current caution:

- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
- do not narrate low-fidelity rollout metrics as final submission truth
- the standard notebook and `training/llm_rollout.py` paths should stay on the same live low-fidelity contract as the environment, including explicit `submit`
- reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence

## 3. Locked Decisions

These decisions are fixed unless a hard blocker appears:

- benchmark task: `P1`
- submission framing: `Statement 3.1`
- verifier of record: `constellaration.problems.GeometricalProblem`
- repo strategy: fresh wiring in this repo
- reuse policy: do not port the old `ai-sci-feasible-designs` harness
- scope rule: one stable task only

Execution rule:

- do not reopen strategy unless a real blocker appears
- convert decisions into code, fixtures, traces, baselines, or deployment work

## 4. Non-Negotiables

- Keep scope to one stable task.
- Keep claims conservative and evidence-backed.
- Do not let training-first work outrun environment stability.
- Do not rely on reward curves alone; keep trajectory evidence.
- Do not use reward complexity to hide a blocked action family.
- Do not polish repo or video before the environment and baselines are real.

Practical fail-fast rule:

- allow a tiny low-fidelity PPO smoke run before full submit-side validation
- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
- stop after a few readable trajectories or one clear failure mode
- run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
- keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop

## 5. Document Roles

Use the docs like this:

- this file defines planning order, status, gates, and fallback rules
- [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
- [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
- archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces

## 6. Artifact Plan

Visible artifacts:

- [x] HF Space environment
- [x] Repository training notebook
- [ ] Public Colab mirror or submission notebook link if required
- [ ] 1-minute demo video
- [x] Public repo and README

Compute surfaces:

- Northflank is the main compute workspace for verifier-heavy work
- HF Space is the hosted environment surface
- the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
- trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly

Evidence order:

- [x] measured sweep note
- [x] fixture checks
- [x] manual playtest log
- [x] tiny low-fi PPO smoke trace
- [x] shared-helper notebook alignment
- [x] model-driven low-fi LLM evaluation tooling
- [ ] reward iteration note
- [ ] stable local and remote episodes
- [x] random and heuristic baselines
- [ ] before/after trained-policy evidence
- [ ] demo and repo polish

## 7. Environment Summary

The environment contract must stay narrow and legible:

- one repaired low-dimensional boundary family derived from a rotating-ellipse seed
- discrete `run | submit | restore_best` interaction
- one low-fidelity verifier surface for all live environment actions
- readable observation surface with explicit fidelity labeling
- `Reward V2` keeps the verifier-native `Reward V1` core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop

The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.

## 8. Execution Order

- [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
- [x] Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass.
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
- [ ] Save one fixed-seed untrained baseline with the unified live `training/llm_rollout.py evaluate` workflow.
- [ ] Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow.
- [ ] Re-run the same seeds after training and save one before/after artifact.
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
- [x] Refresh the heuristic baseline using the repaired-family evidence.
- [ ] Prove a stable local episode path.
- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
- [ ] Polish the public repo only after the artifacts above exist.

## 9. Success Gates

Gate 1: measured sweep exists

- repaired-family ranges, deltas, and reset seeds are justified by recorded evidence

Gate 2: tiny PPO smoke is sane

- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
- trajectories are readable enough to debug
- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
- current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)

Gate 3: fixture checks pass

- good, boundary, and bad references behave as expected
- the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work

Gate 4: manual playtest passes

- a human can read the observation
- a human can choose a plausible next action
- a human can explain the reward change

Gate 5: local episode is stable

- one clean trajectory is reproducible enough for demo use

Gate 6: baseline story is credible

- heuristic behavior is at least interpretable and preferable to random on the repaired task

Gate 7: remote surface is real

- HF Space preserves the same task contract as local

Gate 8: submission artifacts exist

- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one

Gate 9: trained-policy evidence is real

- one fixed-seed untrained baseline exists
- one short low-fidelity training pass exists on the same workflow
- the repo can show a before/after comparison on the same seeds using the live environment contract, including `submit`

## 10. Fallback Rules

If training evidence is weak:

- keep claims conservative about policy quality
- still ship a trained-policy demonstration and document its limitations plainly
- do not skip the paired higher-fidelity validation artifacts
- do not split the notebook back onto a different submit contract than the live environment

If HF Space deployment is delayed:

- keep local and Northflank evidence first
- document the deployment blocker plainly
- do not invent remote claims without a real run

If reward behavior is confusing:

- fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity

If the repaired family is too hard:

- adjust ranges, deltas, or seeds from measured evidence
- do not expand into a broad Fourier action space just to rescue the hackathon scope

If the repaired family is too easy:

- prefer fixture and seed adjustments before broadening the action schema

## 11. Immediate Next Actions

- [x] Record the measured sweep and choose provisional defaults from evidence.
- [x] Check in tracked fixtures.
- [x] Record the first manual playtest log.
- [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
- [x] Pair the tracked fixtures with higher-fidelity validation checks.
- [x] Record one submit-side manual trace.
- [x] Refresh the heuristic baseline from that playtest evidence.
- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
- [ ] Re-run the same seeds and save a before/after artifact.
- [ ] Verify one clean HF Space episode with the same contract.