CreativeEngineer commited on
Commit
e815b38
·
1 Parent(s): 6deaccc

docs: simplify and archive planning surfaces

Browse files
AGENTS.md CHANGED
@@ -18,15 +18,15 @@ Training is supporting evidence. Do not let the repo drift into training-first w
18
 
19
  ## Source of Truth
20
 
21
- Use these docs as the planning SSOT:
22
 
23
- - `docs/FUSION_DESIGN_LAB_PLAN_V2.md`
24
- - `docs/FUSION_DELIVERABLES_MAP.md`
25
- - `docs/FUSION_NEXT_12_HOURS_CHECKLIST.md`
26
 
27
- `docs/PIVOT_P1_ROTATING_ELLIPSE.md` is a supporting decision record, not a planning SSOT. If it disagrees with the three docs above, the three SSOT docs win.
28
 
29
- `docs/P1_ENV_CONTRACT_V1.md` is a supporting technical spec for the current implementation phase. It should refine the SSOT docs, not silently diverge from them.
30
 
31
  If code and docs disagree, either:
32
 
 
18
 
19
  ## Source of Truth
20
 
21
+ Use these docs as the repo documentation SSOT:
22
 
23
+ - `docs/FUSION_DESIGN_LAB_PLAN_V2.md` for planning and execution order
24
+ - `docs/P1_ENV_CONTRACT_V1.md` for the live technical contract
25
+ - `docs/P1_PARAMETERIZATION_DEEPDIVE.md` for blocker evidence and supporting rationale
26
 
27
+ Legacy planning docs are archived under `docs/archive/`. They are not active SSOT surfaces.
28
 
29
+ `docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md` is a short supporting decision record, not a planning SSOT.
30
 
31
  If code and docs disagree, either:
32
 
README.md CHANGED
@@ -16,7 +16,11 @@ A trained model is optional for this repo's submission story. A public Colab not
16
 
17
  ## Current Status
18
 
19
- This repository is the clean hackathon workspace. The detailed planning docs live in `docs/FUSION_DESIGN_LAB_PLAN_V2.md`, `docs/FUSION_DELIVERABLES_MAP.md`, and `docs/FUSION_NEXT_12_HOURS_CHECKLIST.md`.
 
 
 
 
20
 
21
  Implementation status:
22
 
@@ -25,7 +29,8 @@ Implementation status:
25
  - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
26
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
27
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
28
- - the next runtime work is measured sweep validation, fixtures, manual playtesting, heuristic refresh, and deployment evidence
 
29
 
30
  ## Execution Status
31
 
@@ -46,7 +51,7 @@ Implementation status:
46
  - [x] Add explicit VMEC failure semantics to the environment contract
47
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
48
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
49
- - [ ] Add tracked `P1` fixtures under `server/data/p1/`
50
  - [ ] Run manual playtesting and record the first reward pathology
51
  - [ ] Refresh the heuristic baseline for the real verifier path
52
  - [ ] Deploy the real environment to HF Space
@@ -54,19 +59,20 @@ Implementation status:
54
  ## Known Gaps
55
 
56
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
57
- - The repaired family now uses frozen exact seeds with explicit triangularity control. Those seeds are near-boundary references, not yet tracked fixtures.
58
- - The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
59
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
60
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
61
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
62
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
63
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
64
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
 
65
 
66
  Current mode:
67
 
68
  - strategic task choice is already locked
69
- - the next work is measured sweep validation, then fixtures, manual playtesting, heuristic refresh, smoke validation, and deployment
70
  - new planning text should only appear when a real blocker forces a decision change
71
 
72
  ## Planned Repository Layout
@@ -120,14 +126,13 @@ uv sync --extra notebooks
120
 
121
  ## Immediate Next Steps
122
 
123
- 1. Run a small measured sweep on the repaired family to choose useful ranges, deltas, and reset seeds.
124
- 2. Verify that observation semantics are human-readable and that low-fi `run` versus high-fi `submit` best-state reporting is not ambiguous.
125
- 3. Add tracked `P1` fixtures under `server/data/p1`.
126
- 4. Run manual playtest episodes and record the first real reward pathology, if any.
127
- 5. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
128
- 6. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
129
- 7. Deploy the environment to HF Space.
130
- 8. Add the Colab notebook under `training/notebooks`.
131
 
132
  These are implementation steps, not another planning phase.
133
 
 
16
 
17
  ## Current Status
18
 
19
+ This repository is the clean hackathon workspace. The live docs now split cleanly by role:
20
+
21
+ - planning and execution: `docs/FUSION_DESIGN_LAB_PLAN_V2.md`
22
+ - technical contract: `docs/P1_ENV_CONTRACT_V1.md`
23
+ - blocker and sweep evidence: `docs/P1_PARAMETERIZATION_DEEPDIVE.md`
24
 
25
  Implementation status:
26
 
 
29
  - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
+ - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
+ - the next runtime work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, and deployment evidence
34
 
35
  ## Execution Status
36
 
 
51
  - [x] Add explicit VMEC failure semantics to the environment contract
52
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
53
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
54
+ - [x] Add tracked `P1` fixtures under `server/data/p1/`
55
  - [ ] Run manual playtesting and record the first reward pathology
56
  - [ ] Refresh the heuristic baseline for the real verifier path
57
  - [ ] Deploy the real environment to HF Space
 
59
  ## Known Gaps
60
 
61
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
62
+ - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
63
+ - The tracked fixtures in `server/data/p1/` are currently low-fidelity-calibrated. Do not narrate them as fully paired low-fi/high-fi references until the submit-side spot checks land.
64
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
65
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
66
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
67
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
68
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
69
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
70
+ - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next meaningful playtest step is a real `submit` trace, not more abstract reward debate.
71
 
72
  Current mode:
73
 
74
  - strategic task choice is already locked
75
+ - the next work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, smoke validation, and deployment
76
  - new planning text should only appear when a real blocker forces a decision change
77
 
78
  ## Planned Repository Layout
 
126
 
127
  ## Immediate Next Steps
128
 
129
+ - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
130
+ - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
131
+ - [ ] Run submit-side manual playtest episodes and record the first real reward pathology, if any.
132
+ - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
133
+ - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
134
+ - [ ] Deploy the environment to HF Space.
135
+ - [ ] Add the Colab notebook under `training/notebooks`.
 
136
 
137
  These are implementation steps, not another planning phase.
138
 
TODO.md CHANGED
@@ -2,19 +2,24 @@
2
 
3
  This is the execution tracker for the hackathon repo.
4
 
5
- Use this file for day-of build progress. Use the linked docs for rationale, sequencing, and submission framing:
6
 
7
  - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
8
- - [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
9
- - [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
10
  - [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
11
- - [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
12
  - [Repo Guardrails](AGENTS.md)
13
 
 
 
 
 
 
 
14
  Priority source:
15
 
16
  - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md) is the planning SSOT
17
- - [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md) is the execution order SSOT
 
18
  - this file should track execution progress only
19
 
20
  ## Current State
@@ -34,8 +39,8 @@ Priority source:
34
  - [x] add explicit VMEC failure semantics
35
  - [x] label low-fi vs high-fi truth in the observation/task surface
36
  - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
37
- - [ ] tracked `P1` fixtures
38
- - [ ] manual playtest log
39
  - [x] settle the non-submit terminal reward policy
40
  - [x] baseline comparison has been re-run on the `constellaration` branch state
41
  - [ ] refresh the heuristic baseline for the real verifier path
@@ -64,12 +69,12 @@ flowchart TD
64
  freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
65
  Related:
66
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
67
- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
68
 
69
  - [x] Pass the Northflank smoke test
70
  Related:
71
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
72
- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md),
73
  [training/notebooks/README.md](training/notebooks/README.md)
74
 
75
  - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
@@ -77,7 +82,7 @@ flowchart TD
77
  resolve the historical gating question about whether parameterization repair was required before more reward work
78
  Related:
79
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
80
- [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
81
 
82
  ## Fresh Wiring
83
 
@@ -90,7 +95,7 @@ flowchart TD
90
  Files:
91
  [server/environment.py](server/environment.py),
92
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
93
- [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
94
 
95
  - [x] Add a post-terminal guard to the environment loop
96
  Files:
@@ -158,7 +163,7 @@ flowchart TD
158
 
159
  ## Validation and Reward
160
 
161
- - [ ] Run a small measured sweep on the repaired low-dimensional family
162
  Goal:
163
  choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose
164
  Related:
@@ -170,26 +175,26 @@ flowchart TD
170
  Related:
171
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
172
 
173
- - [ ] Add 1-2 tracked `P1` fixtures
174
  Files:
175
  [server/data/p1/README.md](server/data/p1/README.md),
176
- [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
177
  Note:
178
- add fixtures only after the repaired family is calibrated into a meaningful near-boundary region
179
 
180
  - [ ] Run fixture sanity checks
181
  Goal:
182
- confirm verifier outputs, objective direction, and reward ordering
183
  Related:
184
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
185
- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
186
 
187
  - [ ] Manual-playtest 5-10 episodes
188
  Goal:
189
- verify a human can act coherently and surface at least one pathology or ambiguity
190
  Related:
191
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
192
- [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
193
 
194
  - [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
195
  Goal:
@@ -240,7 +245,7 @@ flowchart TD
240
 
241
  - [ ] Deploy the environment to HF Space
242
  Related:
243
- [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md),
244
  [README.md](README.md)
245
 
246
  - [ ] Create the thin public Colab notebook
@@ -258,7 +263,7 @@ flowchart TD
258
  - [ ] Only add training evidence if it is actually persuasive
259
  Related:
260
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
261
- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
262
 
263
  ## Guardrails
264
 
 
2
 
3
  This is the execution tracker for the hackathon repo.
4
 
5
+ Use this file for day-of build progress. Use the linked docs for rationale, contract truth, and submission framing:
6
 
7
  - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 
 
8
  - [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
9
+ - [P1 Parameterization Deep-Dive](docs/P1_PARAMETERIZATION_DEEPDIVE.md)
10
  - [Repo Guardrails](AGENTS.md)
11
 
12
+ Archived legacy references:
13
+
14
+ - [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
15
+ - [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
16
+ - [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
17
+
18
  Priority source:
19
 
20
  - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md) is the planning SSOT
21
+ - [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md) is the technical contract SSOT
22
+ - [P1 Parameterization Deep-Dive](docs/P1_PARAMETERIZATION_DEEPDIVE.md) is the evidence and rationale record
23
  - this file should track execution progress only
24
 
25
  ## Current State
 
39
  - [x] add explicit VMEC failure semantics
40
  - [x] label low-fi vs high-fi truth in the observation/task surface
41
  - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
42
+ - [x] tracked `P1` fixtures
43
+ - [x] manual playtest log
44
  - [x] settle the non-submit terminal reward policy
45
  - [x] baseline comparison has been re-run on the `constellaration` branch state
46
  - [ ] refresh the heuristic baseline for the real verifier path
 
69
  freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
70
  Related:
71
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
72
+ [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
73
 
74
  - [x] Pass the Northflank smoke test
75
  Related:
76
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
77
+ [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md),
78
  [training/notebooks/README.md](training/notebooks/README.md)
79
 
80
  - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
 
82
  resolve the historical gating question about whether parameterization repair was required before more reward work
83
  Related:
84
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
85
+ [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
86
 
87
  ## Fresh Wiring
88
 
 
95
  Files:
96
  [server/environment.py](server/environment.py),
97
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
98
+ [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
99
 
100
  - [x] Add a post-terminal guard to the environment loop
101
  Files:
 
163
 
164
  ## Validation and Reward
165
 
166
+ - [x] Run a small measured sweep on the repaired low-dimensional family
167
  Goal:
168
  choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose
169
  Related:
 
175
  Related:
176
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
177
 
178
+ - [x] Add 1-2 tracked `P1` fixtures
179
  Files:
180
  [server/data/p1/README.md](server/data/p1/README.md),
181
+ [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
182
  Note:
183
+ first tracked fixtures are low-fidelity-calibrated; add paired high-fidelity submit checks next
184
 
185
  - [ ] Run fixture sanity checks
186
  Goal:
187
+ confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
188
  Related:
189
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
190
+ [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
191
 
192
  - [ ] Manual-playtest 5-10 episodes
193
  Goal:
194
+ expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
195
  Related:
196
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
197
+ [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
198
 
199
  - [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
200
  Goal:
 
245
 
246
  - [ ] Deploy the environment to HF Space
247
  Related:
248
+ [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md),
249
  [README.md](README.md)
250
 
251
  - [ ] Create the thin public Colab notebook
 
263
  - [ ] Only add training evidence if it is actually persuasive
264
  Related:
265
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
266
+ [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
267
 
268
  ## Guardrails
269
 
docs/FUSION_DELIVERABLES_MAP.md DELETED
@@ -1,121 +0,0 @@
1
- # Fusion Design Lab Deliverables Map
2
-
3
- This is the output-first map for the hackathon. It is aligned to Plan V2: `P1` is locked, the environment is built fresh in this repo, the old harness is not ported, and training claims stay conservative. Everything branches from the four final artifacts the judges and submission flow will actually see.
4
-
5
- Northflank is the recommended compute workspace behind those artifacts. HF Space and Colab remain the actual submission surfaces.
6
-
7
- Use this map to sequence execution, not to reopen already-locked task choices.
8
-
9
- ## Current Branch Status
10
-
11
- - [x] `P1` contract is frozen in code
12
- - [x] official `constellaration` verifier loop is wired
13
- - [x] baseline comparison has been rerun on the real verifier path
14
- - [x] Northflank smoke workflow and note are committed
15
- - [x] Northflank smoke test has passed on the team H100
16
- - [x] historical upstream 3-knob family has been verified as blocked on P1 triangularity
17
- - [x] repaired low-dimensional boundary builder is implemented
18
- - [x] explicit VMEC failure semantics are implemented
19
- - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
20
- - [x] terminal submit scoring/reporting is fidelity-consistent
21
- - [ ] tracked fixtures are checked in
22
- - [ ] manual playtest evidence exists
23
- - [ ] heuristic baseline has been refreshed for the real verifier path
24
- - [ ] HF Space deployment is live
25
-
26
- ## Deliverables Tree
27
-
28
- ```mermaid
29
- flowchart TD
30
- A["Fusion Design Lab Submission"] --> B["HF Space Environment"]
31
- A --> C["Colab Eval / Training Notebook"]
32
- A --> D["1-Minute Demo"]
33
- A --> E["Public Repo + README"]
34
- A --> N["Northflank H100 Workspace"]
35
-
36
- B --> B0["P1 environment contract frozen"]
37
- B --> B1["Remote reset/step works"]
38
- B --> B2["Reward V0 -> V1 documented"]
39
- B --> B3["One stable task runs end-to-end"]
40
- B --> B4["Clear rules + reproducible episodes"]
41
-
42
- C --> C1["Connects to HF Space"]
43
- C --> C2["Runs multi-turn episodes"]
44
- C --> C3["Logs behavior + reward traces"]
45
-
46
- D --> D1["Clear problem statement"]
47
- D --> D2["Manual playtest + agent trajectory"]
48
- D --> D3["Reward shaping story"]
49
-
50
- E --> E1["Readable project summary"]
51
- E --> E2["Setup + run instructions"]
52
- E --> E3["Submission links and artifacts"]
53
-
54
- N --> N1["Jupyter Notebook with PyTorch live"]
55
- N --> N2["Persistent storage attached"]
56
- N --> N3["Verifier + baseline runs happen here"]
57
- N --> N4["Northflank smoke test passes"]
58
-
59
- B0 --> F["Observation + action schema frozen"]
60
- B3 --> G["Fresh P1 verifier loop proven"]
61
- G --> G1["Parameterization can actually reach P1 feasibility"]
62
- G --> G2["VMEC failures are explicit and penalized"]
63
- B2 --> H["Exploit observed -> penalty added"]
64
- B4 --> I0["Deterministic action schema"]
65
- D2 --> I["Human can act coherently in env"]
66
- C3 --> J["Random baseline"]
67
- C3 --> K["Heuristic baseline"]
68
- G --> L["Official constellaration P1 verifier wired correctly"]
69
- L --> M["Good / boundary / bad fixture checks pass"]
70
- N4 --> N3
71
- N3 --> G
72
- ```
73
-
74
- ## Reverse Timeline
75
-
76
- ```mermaid
77
- flowchart LR
78
- S["Submit by Sun 1:00 PM"] --> V["Video finalized"]
79
- S --> R["Repo public and readable"]
80
- S --> T["Training / eval evidence exported"]
81
- S --> H["HF Space live"]
82
- S --> N1["Northflank compute ready"]
83
-
84
- V --> V1["Recorded clean demo trajectory"]
85
- V --> V2["Scripted 60-second story"]
86
-
87
- T --> T1["Behavior trace image"]
88
- T --> T2["Baseline comparison numbers"]
89
- T --> T3["Colab notebook runs end-to-end"]
90
-
91
- H --> H1["OpenEnv P1 environment packaged"]
92
- H --> H2["Remote client can reset and step"]
93
- H --> H3["Verifier and reward stable"]
94
- H --> H4["Rules are clear and reproducible"]
95
-
96
- H4 --> P["Environment contract locked first"]
97
- N1 --> N2["Jupyter with PyTorch up first"]
98
- N2 --> N3["Persistent storage attached"]
99
- N3 --> N4["Import + low-fi verifier smoke passes"]
100
- N4 --> M0
101
- P --> Q["Manual playtest completed first"]
102
- H3 --> M0["Local verifier loop proven first"]
103
- T2 --> B["Random + heuristic baselines done"]
104
- T3 --> X["Training included only if persuasive"]
105
- V1 --> Y["One stable task only"]
106
- V2 --> Z["Explain reward fix, not just reward gain"]
107
- M0 --> N["Fresh wiring, not legacy harness port"]
108
- ```
109
-
110
- ## Priority Order
111
-
112
- Northflank compute bring-up and smoke validation are complete.
113
-
114
- 1. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
115
- 2. Add tracked fixtures and run fixture sanity checks.
116
- 3. Manual-playtest the environment and record the first real pathology, if any.
117
- 4. Refresh the heuristic baseline from that evidence.
118
- 5. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
119
- 6. Use the notebook to show traces and comparisons; include training only if it adds signal.
120
- 7. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
121
- 8. Polish the repo only after the artifacts are real.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -2,778 +2,202 @@
2
 
3
  **Hackathon:** OpenEnv Hackathon, March 7-8, 2026
4
  **Track:** Statement 3.1 (World Modeling — Professional Tasks)
5
- **Status:** Judge-aligned plan with `P1` locked
6
-
7
- ## 0. Current Branch Status
8
-
9
- - [x] `P1` task family is locked
10
- - [x] repaired 4-knob low-dimensional `P1` contract is implemented in code
11
- - [x] real `constellaration` verifier wiring is in place
12
- - [x] low-fidelity `run` plus high-fidelity `submit` split is documented
13
- - [x] post-terminal `step()` guard is in place
14
- - [x] baseline comparison has been rerun on the real verifier path
15
- - [x] Northflank smoke workflow and note are committed
16
- - [x] Northflank smoke test has passed on the team H100
17
- - [x] historical upstream 3-knob family has been checked against the real low-fidelity verifier
18
- - [x] parameterization repair is implemented so triangularity is controllable
19
- - [x] explicit VMEC failure semantics are implemented
20
- - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
21
- - [x] terminal scoring/reporting is fidelity-consistent between low-fi rollout state and high-fi submit truth
22
- - [ ] tracked `P1` fixtures are added
23
- - [ ] manual playtest evidence is recorded
24
- - [ ] heuristic baseline is refreshed for the real verifier path
25
- - [ ] HF Space deployment evidence is recorded
26
-
27
- Current caution:
28
-
29
- - the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
30
- - terminal scoring/reporting now uses a fidelity-consistent basis at episode end: high-fi `submit` comparisons are no longer anchored to low-fi rollout score state
31
 
32
  ## 1. Submission Thesis
33
 
34
- We are not primarily submitting "a trained model for fusion."
35
 
36
- We are submitting a clear, reproducible training environment for a constrained scientific design task:
37
 
38
  - official `P1` benchmark semantics
39
- - a narrow, human-playable action space
40
  - real verifier feedback from `constellaration`
41
- - explicit constraints
42
- - a reward function that is understandable and iteratively improved
43
 
44
  Training is supporting evidence. The environment is the product.
45
 
46
- A trained model is optional. The Colab notebook is still a required public artifact, and it can remain evaluation-first if training evidence is weak.
47
-
48
- ## 2. Locked Decisions
49
-
50
- These decisions are now fixed unless a hard blocker appears:
51
-
52
- - benchmark task: `P1`
53
- - submission framing: `Statement 3.1`
54
- - verifier of record: `constellaration.problems.GeometricalProblem`
55
- - implementation strategy: fresh wiring in this repo
56
- - reuse policy: do not port the old `ai-sci-feasible-designs` harness; only reuse selected JSON artifacts or boundaries when useful
57
-
58
- Execution rule after lock:
59
-
60
- - do not reopen these decisions in new planning passes unless a real blocker appears
61
- - once a decision is locked, translate it into code, fixtures, baselines, or deployment work
62
-
63
- ## 3. What Changed From V1
64
-
65
- This version changes the center of gravity:
66
-
67
- - `environment quality > training effort`
68
- - `reward shaping story > polished final reward formula`
69
- - `manual playtesting > training-first iteration`
70
- - `clarity and reproducibility > broad unsupported transfer claims`
71
- - `fresh, minimal environment wiring > transplanting legacy orchestration`
72
-
73
- This version also separates:
74
-
75
- - what is already decided
76
- - what is a working hypothesis
77
- - what must be validated before it becomes part of the final pitch
78
-
79
- ## 4. Judge-Aligned Priorities
80
-
81
- The judging signal now implies four priorities:
82
-
83
- 1. The environment itself must be strong.
84
- 2. The reward function must be explainable and visibly iterated.
85
- 3. A human should be able to act in the environment coherently before we invest heavily in training.
86
- 4. The final story should emphasize a clear, reproducible environment, not just a reward curve.
87
-
88
- ## 5. Final Artifacts
89
-
90
- The four visible artifacts remain:
91
-
92
- 1. HF Space environment
93
- 2. Required Colab notebook for evaluation or training
94
- 3. 1-minute demo video
95
- 4. Public repo and README
96
-
97
- The primary compute workspace should be Northflank:
98
-
99
- - Northflank Jupyter Notebook with PyTorch on the team H100 for development, verifier integration, baselines, and training/debugging
100
- - HF Space as the hosted environment surface
101
- - Colab as the minimal required public notebook artifact, even if it ships as an evaluation-first notebook instead of a training-first notebook
102
-
103
- But the evidence order is:
104
-
105
- 1. environment contract
106
- 2. manual playtest log
107
- 3. reward iteration note
108
- 4. stable local and remote episodes
109
- 5. random and heuristic baselines
110
- 6. training or eval notebook evidence
111
- 7. demo and repo polish
112
-
113
- ## 6. Non-Negotiables
114
-
115
- - One stable task only.
116
- - No broad cross-science claims unless evidence exists.
117
- - No training-first drift.
118
- - No dependence on reward curves alone.
119
- - No repo/video polish before environment and baselines are real.
120
- - No harness transplant from `ai-sci-feasible-designs`.
121
- - No new strategy churn after `P1` + rotating-ellipse is locked unless a blocker forces it.
122
-
123
- ## 7. Single Stable Task
124
-
125
- We intentionally narrow the scope to one environment family:
126
-
127
- - `P1` geometrical benchmark
128
- - repaired low-dimensional boundary family derived from rotating-ellipse seeds
129
- - official `constellaration` verifier
130
- - low-fidelity evaluation for ordinary interaction
131
- - optional high-fidelity verification for final checks or `submit`
132
-
133
- The task is:
134
-
135
- > improve a stellarator boundary on the `P1` benchmark under explicit constraints and limited evaluation budget
136
-
137
- ### Constraints
138
-
139
- Use the official `P1` constraints:
140
-
141
- - aspect ratio `<= 4.0`
142
- - average triangularity `<= -0.5`
143
- - edge rotational transform over field periods `>= 0.3`
144
-
145
- ### Objective
146
-
147
- Use the official `P1` objective:
148
-
149
- - minimize `max_elongation`
150
-
151
- ### Why This Task
152
-
153
- - it is official rather than invented
154
- - it is cheaper than `P2` and `P3` because `P1` skips QI
155
- - it maps cleanly to a tool-using scientific workflow
156
- - it is easier to explain than a broader fusion-design claim
157
-
158
- ## 8. Fresh Wiring Rule
159
-
160
- This repo should implement a minimal environment directly for the hackathon.
161
-
162
- That means:
163
-
164
- - define our own environment contract
165
- - define our own reward logic on top of the official verifier
166
- - define our own baselines
167
- - define our own HF Space interface
168
-
169
- That does not mean:
170
-
171
- - importing the old governor
172
- - importing the old planner
173
- - importing the old experiment harness
174
- - recreating the old agent-as-coder stack
175
-
176
- Allowed reuse:
177
-
178
- - official `constellaration` library behavior
179
- - selected JSON artifacts or seed boundaries
180
- - problem notes as human reference
181
-
182
- Implementation handoff:
183
-
184
- - the remaining work is now fixture coverage, manual playtesting, heuristic refresh, smoke validation, and deployment
185
- - do not treat supporting decision notes as a new planning backlog
186
-
187
- ## 8.1 Compute Surfaces
188
-
189
- Use each surface for one clear purpose:
190
-
191
- - Northflank Jupyter Notebook with PyTorch:
192
- - main development and compute workspace
193
- - verifier sanity checks
194
- - manual playtesting
195
- - baseline runs
196
- - optional RL fine-tuning
197
- - HF Space:
198
- - public OpenEnv environment surface
199
- - remote `reset` and `step` endpoint for the final demo path
200
- - Colab:
201
- - minimal reproducible evaluation or training notebook required by the hackathon
202
- - the notebook itself is mandatory; a trained model inside it is not
203
-
204
- Northflank-specific constraint:
205
-
206
- - containers are ephemeral, so persistent storage must be attached before relying on saved models, caches, or fixture downloads
207
-
208
- Deployment path:
209
-
210
- - develop and verify in Northflank or local
211
- - commit and push changes to the public GitHub repo
212
- - have HF Space build and serve from that repo path
213
- - do not rely on manual copy-paste deployment as the default path
214
-
215
- Auth stance:
216
-
217
- - prefer a public HF Space for the hackathon to keep the Colab artifact simple
218
- - if the Space must be private, the notebook must explicitly document token-based access
219
-
220
- ## 9. Environment Contract
221
-
222
- The environment contract must be frozen before meaningful evaluation.
223
-
224
- Historical blocker that drove the repair:
225
-
226
- - the upstream 3-knob `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` family does not expose triangularity control
227
- - on the real low-fidelity verifier path, sampled points stayed at roughly `average_triangularity=+0.004975` and `p1_feasibility=1.00995`
228
- - that blocker is why the repo now uses a repaired 4-knob low-dimensional family with explicit `triangularity_scale`
229
-
230
- ### Observation
231
-
232
- The observation should expose:
233
-
234
- - current `max_elongation`
235
- - current aspect ratio
236
- - current average triangularity
237
- - current edge rotational transform over field periods
238
- - current `p1_score`
239
- - current `p1_feasibility`
240
- - current `constraints_satisfied`
241
- - current `vacuum_well`
242
- - `evaluation_fidelity`
243
- - `evaluation_failed`
244
- - `failure_reason`
245
- - `step_number`
246
- - `budget_remaining`
247
- - `best_low_fidelity_score`
248
- - `best_low_fidelity_feasibility`
249
- - `best_high_fidelity_score`
250
- - `best_high_fidelity_feasibility`
251
- - `target_spec`
252
- - concise textual summary of the last action outcome in `diagnostics_text`
253
-
254
- The observation must be interpretable by a human without additional hidden state.
255
-
256
- Current runtime note:
257
-
258
- - the live observation surface now exposes explicit low-fidelity and high-fidelity best-state fields
259
- - low-fi run steps and high-fi submit steps no longer overload one generic `best_score` field
260
- - traces and baselines should use the explicit fields instead of reconstructing a mixed best-state story
261
-
262
- ### Action Space
263
-
264
- The live action space stays intentionally small and discrete while exposing the repaired 4-knob low-dimensional family.
265
-
266
- Current contract:
267
-
268
- - `run`
269
- - `submit`
270
- - `restore_best`
271
-
272
- For `run`, the controllable fields are:
273
-
274
- - parameter: one of
275
- - `aspect_ratio`
276
- - `elongation`
277
- - `rotational_transform`
278
- - `triangularity_scale`
279
- - direction: increase or decrease
280
- - magnitude: small, medium, large
281
-
282
- This is not trying to expose the full Fourier-boundary space. The goal is a legible environment, not maximal realism. The verifier stays official; the custom logic belongs in the low-dimensional boundary builder, not in reward semantics.
283
-
284
- ### Episode Flow
285
-
286
- 1. Reset from one rotating-ellipse initial state or a small frozen set of initial states.
287
- 2. Agent chooses one action.
288
- 3. Low-fidelity verifier runs for normal interaction.
289
- 4. Environment returns diagnostics and reward.
290
- 5. Episode ends on:
291
- - `submit`
292
- - exhausted budget
293
-
294
- Failure semantics must also be explicit:
295
-
296
- - if VMEC or the forward model fails, the run still consumes budget
297
- - the observation must expose that the step failed
298
- - the reward must apply a documented penalty
299
- - the environment must not silently replace the failed result with a fake success path
300
-
301
- ### Terminal Contract
302
-
303
- The episode should end cleanly and deterministically.
304
-
305
- At termination, the environment should provide:
306
-
307
- - final best design metrics
308
- - whether constraints were satisfied
309
- - total reward
310
- - short human-readable summary of the trajectory
311
-
312
- ## 10. Verifier Contract
313
-
314
- The verifier of record is `constellaration.problems.GeometricalProblem`.
315
-
316
- The environment must preserve:
317
-
318
- - objective direction
319
- - constraint direction
320
- - feasibility semantics
321
- - score ordering
322
-
323
- The environment may add reward shaping, but it must not redefine what `P1` means.
324
-
325
- Implementation split:
326
-
327
- - boundary builder or parameterization adapter:
328
- - custom low-dimensional family construction
329
- - rotating-ellipse seed creation
330
- - triangularity control injection, if used
331
- - official verifier:
332
- - boundary in
333
- - `GeometricalProblem` semantics out
334
-
335
- The verifier should be boundary-based. Parameterization-specific logic should not be treated as verifier truth.
336
-
337
- Current execution rule:
338
-
339
- - do not narrate guessed repaired-family ranges, deltas, or a larger budget as settled defaults until they are measured on the repaired family
340
-
341
- ## 11. Reward V0
342
-
343
- The reward in this document is not the final reward. It is `Reward V0`.
344
-
345
- The initial scoring idea should be feasibility-first:
346
-
347
- - reducing normalized constraint violation should help
348
- - becoming feasible should give a meaningful bonus
349
- - once feasible, lower `max_elongation` should help
350
- - wasting budget should have some cost
351
- - successful submission may deserve a small bonus
352
 
353
- ### Reward V0 Design Goals
354
 
355
- - easy to explain
356
- - sensitive to genuine progress
357
- - hostile to obvious degenerate behavior
358
- - simple enough to debug from trajectories
359
- - aligned with official `P1` semantics
 
 
 
 
 
 
 
360
 
361
- Current execution note:
362
 
363
- - do not tune reward further until the repaired low-dimensional family can actually approach P1 feasibility
364
- - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
365
- - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
366
- - do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
367
- - keep terminal reward and reporting fidelity-consistent; do not compare high-fi submit scores against low-fi best/initial score state
 
368
 
369
- ### Reward V0 Failure Modes To Test
370
-
371
- We should expect at least some of these:
372
-
373
- - the agent oscillates between equivalent moves
374
- - the agent submits too early
375
- - the agent never submits
376
- - the agent learns to improve objective before it learns feasibility
377
- - the agent camps near one constraint while breaking another
378
- - the agent overuses `restore_best`
379
-
380
- The reward is only acceptable after we test for those behaviors.
381
-
382
- Important execution rule:
383
-
384
- - if manual playtesting does not reveal a real pathology, keep `Reward V0` and document that outcome rather than forcing a `Reward V1`
385
-
386
- ## 12. Verifier and Reward Fixture Checks
387
-
388
- Before training, we should validate environment wiring with a few fixed fixtures.
389
-
390
- Use:
391
-
392
- - one known-good design or near-winning design
393
- - a few near-boundary designs
394
- - a few clearly infeasible designs
395
-
396
- Do not assume the default baseline params are enough for this set. They are currently useful as an infeasible reference, not as a near-feasible anchor.
397
-
398
- Purpose:
399
-
400
- - verify the verifier is wired correctly
401
- - verify the reward ordering makes sense
402
- - verify feasible designs outrank clearly infeasible ones
403
-
404
- This is calibration, not training.
405
-
406
- ## 13. What Is Hypothesis vs Validated
407
-
408
- These are still hypotheses until manually or empirically checked:
409
-
410
- - six steps are enough to create non-trivial decision pressure
411
- - the repaired low-dimensional action family is expressive enough for a meaningful `P1` task
412
- - `restore_best` is useful without becoming an exploit
413
- - heuristic should beat random on mean episode reward
414
- - low-fidelity interaction is predictive enough for useful policy learning
415
- - useful repaired-family parameter ranges and deltas
416
- - whether the current budget should stay at `6` or change after playtesting
417
-
418
- These should not be narrated as facts in the final demo until validated.
419
-
420
- ## 14. Manual Playtest Plan
421
-
422
- Before heavy training, we should act as the agent ourselves.
423
-
424
- ### Protocol
425
-
426
- Run 5 to 10 episodes manually and log for each step:
427
-
428
- - observation seen
429
- - action chosen
430
- - reason for the action
431
- - verifier outcome
432
- - reward returned
433
- - whether the reward matched intuitive quality
434
-
435
- ### Questions The Playtest Must Answer
436
-
437
- - can a human understand what to do from the observation?
438
- - do action labels map to meaningful decisions?
439
- - is the step budget interesting or arbitrary?
440
- - which actions are high leverage?
441
- - do obvious bad actions get punished?
442
- - do obviously good actions get rewarded?
443
- - does `restore_best` help recovery or encourage stalling?
444
-
445
- ### Expected Output
446
-
447
- - short manual playtest log
448
- - one paragraph on what a good episode looks like
449
- - one paragraph on what broke or felt ambiguous
450
-
451
- ## 15. Reward Iteration Story
452
-
453
- The reward iteration story is not a side note. It is likely part of the pitch.
454
-
455
- We should aim to document at least one concrete sequence:
456
-
457
- 1. initial reward version
458
- 2. observed bad behavior
459
- 3. reward or penalty change
460
- 4. changed behavior afterward
461
-
462
- Examples of acceptable story structure:
463
-
464
- - "The agent improved elongation while staying deeply infeasible, so we increased feasibility-first shaping."
465
- - "The agent hovered near one constraint and ignored another, so we changed the violation shaping."
466
- - "The agent overused restore-best, so we changed the reward or step logic to make stalling unprofitable."
467
-
468
- This is stronger than saying only "reward improved after training."
469
-
470
- ## 16. Evidence Plan
471
-
472
- ### HF Space
473
-
474
- Must prove:
475
-
476
- - remote `reset` works
477
- - remote `step` works
478
- - one stable episode runs end-to-end
479
- - the remote behavior matches the local contract
480
-
481
- HF Space is the serving surface, not the main heavy-compute workspace.
482
-
483
- ### Northflank Notebook
484
-
485
- Must prove:
486
-
487
- - Jupyter Notebook with PyTorch is live on the team H100
488
- - persistent storage is attached
489
- - verifier and baseline work runs there without local-machine dependency
490
- - environment/debug/training work can proceed there even if local runtime is inconvenient
491
- - one smoke check passes:
492
- - import `constellaration`
493
- - generate one rotating-ellipse boundary
494
- - run one low-fidelity verifier call
495
- - write a result artifact to persistent storage
496
-
497
- ### Colab Notebook
498
-
499
- Primary job:
500
-
501
- - connect to the live environment
502
- - run multi-turn episodes
503
- - export traces and baseline comparisons
504
-
505
- Secondary job:
506
-
507
- - show training or policy improvement if the signal is credible
508
-
509
- If training is weak but the environment and eval traces are strong, the notebook still ships.
510
-
511
- Colab is a required artifact, but it is not the preferred main compute surface.
512
-
513
- Connectivity rule:
514
-
515
- - if HF Space is public, the notebook uses direct HTTP calls with no extra auth flow
516
- - if HF Space is private, the notebook must state the required token path and setup explicitly
517
-
518
- ### Demo Video
519
-
520
- The video should show:
521
-
522
- 1. the `P1` task
523
- 2. the environment observation and action space
524
- 3. one manual or agent trajectory
525
- 4. one reward pathology and fix
526
- 5. one baseline comparison
527
-
528
- Reward curves are optional supporting visuals, not the center of the story.
529
-
530
- ### Public Repo
531
-
532
- The repo should make the environment easy to understand:
533
-
534
- - what `P1` is
535
- - what the agent sees
536
- - what the agent can do
537
- - how reward works
538
- - how to run one episode
539
- - where the demo evidence lives
540
- - why the repo is freshly wired rather than copied from the old project
541
-
542
- ## 17. Success Gates
543
-
544
- ### Prerequisite: Northflank Compute Ready
545
-
546
- - notebook starts on the team H100
547
- - persistent storage mount is usable
548
- - smoke test artifact is written successfully from the rotating-ellipse-derived low-dimensional boundary path
549
- - latest artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
550
-
551
- ### Gate 1: Environment Contract Locked
552
-
553
- - task frozen
554
- - observation schema frozen
555
- - action schema frozen
556
- - terminal conditions frozen
557
- - explicit VMEC failure semantics defined
558
- - low-fi vs high-fi metric labeling defined
559
-
560
- ### Gate 2: Verifier Wiring Pass
561
-
562
- - official `P1` verifier returns expected outputs
563
- - fixture ordering is sensible
564
- - objective direction is correct
565
-
566
- ### Gate 3: Manual Playtest Pass
567
-
568
- - human can act coherently
569
- - at least one trajectory feels sensible
570
- - at least one pathology identified or ruled out
571
-
572
- ### Gate 4: Stable Local Episode
573
-
574
- - local modify -> verify -> observe loop works
575
- - at least one end-to-end episode is stable
576
- - submit-time reward/reporting does not mix low-fi and high-fi score state
577
-
578
- ### Gate 5: Reward V1
579
-
580
- - at least one reward revision completed
581
- - story is documented with before/after behavior
582
-
583
- ### Gate 6: Baselines
584
-
585
- - random baseline complete
586
- - heuristic baseline complete
587
- - heuristic is at least competitive and preferably better than random
588
-
589
- ### Gate 7: Remote Environment
590
-
591
- - HF Space live
592
- - remote client runs one clean episode
593
-
594
- ### Gate 8: Notebook Evidence
595
-
596
- - notebook runs end-to-end
597
- - traces exported
598
- - training evidence included only if it adds signal
599
-
600
- ## 18. Timeline
601
-
602
- ### Phase 0
603
-
604
- Run two parallel tracks:
605
-
606
- - Track A: Northflank compute setup and smoke validation
607
- - Track B: lock the `P1` environment contract
608
-
609
- Deliverables:
610
-
611
- - frozen task definition
612
- - frozen action and observation schema
613
- - proof that one local `P1` loop works
614
- - Northflank smoke test pass
615
-
616
- ### Phase 1
617
-
618
- Repair the low-dimensional parameterization, wire the verifier split cleanly, and run a small measured sweep before fixture checks.
619
-
620
- Deliverables:
621
-
622
- - repaired low-dimensional boundary builder
623
- - boundary-based verifier split
624
- - explicit VMEC failure semantics
625
- - measured parameter ranges, deltas, and candidate reset seeds
626
-
627
- ### Phase 2
628
-
629
- Audit observation clarity, then freeze initial fixtures and manual-playtest the environment.
630
-
631
- Deliverables:
632
-
633
- - observation semantics note covering low-fi vs high-fi reporting and best-state fields
634
- - one good or near-boundary fixture
635
- - bad fixtures
636
- - 5 to 10 episode logs
637
- - notes on leverage, ambiguity, and pathologies
638
-
639
- ### Phase 3
640
-
641
- Implement or refine Reward V0 into Reward V1 based on real behavior.
642
-
643
- Deliverables:
644
-
645
- - documented exploit
646
- - documented fix
647
- - updated reward logic
648
 
649
- ### Phase 4
 
650
 
651
- Stabilize one local task and run baselines.
652
 
653
- Deliverables:
654
 
655
- - stable local trajectory
656
- - random baseline
657
- - heuristic baseline
 
 
 
658
 
659
- ### Phase 5
660
 
661
- Deploy HF Space and validate remote parity.
 
662
 
663
- Deliverables:
664
 
665
- - live environment
666
- - one stable remote episode
 
 
 
 
667
 
668
- ### Phase 6
669
 
670
- Produce notebook evidence.
671
 
672
- Deliverables:
 
 
 
673
 
674
- - Colab notebook
675
- - Northflank traces or run exports
676
- - traces
677
- - baseline comparison
678
- - training outputs only if persuasive
679
 
680
- ### Phase 7
681
 
682
- Record the demo and make the repo readable.
 
 
 
683
 
684
- Deliverables:
685
 
686
- - 1-minute video
687
- - public README
688
- - linked artifacts
689
 
690
- ## 19. Fallback Rules
691
 
692
- If something goes wrong, the fallback should preserve the environment story.
 
 
 
 
 
 
 
693
 
694
- ### If training signal is weak
695
 
696
- Do not force a training-centric pitch.
697
 
698
- Ship:
 
 
 
 
 
699
 
700
- - strong environment
701
- - verifier and fixture evidence
702
- - manual playtest evidence
703
- - reward iteration story
704
- - baseline traces
705
- - one stable remote demo
706
 
707
- ### If Northflank is delayed or unavailable
708
 
709
- Do not block environment design on it.
 
 
 
 
 
 
 
 
 
710
 
711
- Fallback:
712
 
713
- - continue contract definition, reward design, and basic wiring locally
714
- - use local CPU or Colab for limited verifier/debug work
715
- - keep Northflank as the preferred compute target, but do not stall the whole plan waiting for it
716
 
717
- ### If reward is unstable
718
 
719
- Reduce ambition:
720
 
721
- - keep only the terms we can explain
722
- - remove fragile shaping
723
- - prefer legible trajectories over complex reward composition
724
 
725
- ### If the task is too hard
726
 
727
- Do not broaden scope.
 
 
728
 
729
- Instead:
730
 
731
- - simplify the initial states
732
- - tighten the action set
733
- - reduce magnitude choices
734
- - keep the environment more learnable before changing the budget
735
 
736
- ### If the task is too easy
737
 
738
- Do not add more domains.
739
 
740
- Instead:
741
 
742
- - first verify that parameterization repair and reset seeds did not make the task trivial
743
- - adjust budget
744
- - adjust magnitudes
745
- - adjust reward to discourage trivial submission
746
 
747
- ## 20. Demo Story
748
 
749
- The recommended demo structure is:
750
 
751
- ### Part 1: Problem
752
 
753
- "The agent interacts with the official `P1` stellarator-design benchmark and must improve a design under strict geometric constraints."
754
 
755
- ### Part 2: Environment
 
756
 
757
- "Here is what the agent sees, what it can change, and what counts as success."
758
 
759
- ### Part 3: Reward Iteration
 
 
760
 
761
- "Our first reward version produced a bad behavior. We changed the penalty or incentive, and the behavior improved."
762
 
763
- ### Part 4: Evidence
764
 
765
- "Here is one stable trajectory, plus random and heuristic baselines."
766
 
767
- ### Part 5: Why It Matters
 
768
 
769
- "This is a clear, reproducible scientific workflow environment built around a real verifier, not a shortcut task."
770
 
771
- That last line is intentionally conservative. It is strong enough without claiming universal scientific transfer.
772
 
773
- ## 21. Immediate Next Actions
774
 
775
- 1. Run a small measured sweep before locking ranges, deltas, or budget changes.
776
- 2. Freeze fixtures and run manual playtests before heavy training work.
777
- 3. Mark the current reward as `V0`.
778
- 4. Log the first real pathology and reward revision.
779
- 5. Do not let notebook or video work outrun the environment evidence.
 
2
 
3
  **Hackathon:** OpenEnv Hackathon, March 7-8, 2026
4
  **Track:** Statement 3.1 (World Modeling — Professional Tasks)
5
+ **Role:** Planning and execution SSOT for this repo
6
+ **Updated:** March 8, 2026
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  ## 1. Submission Thesis
9
 
10
+ Fusion Design Lab is not primarily a "trained model for fusion" submission.
11
 
12
+ It is a clear, reproducible environment for one constrained scientific design task:
13
 
14
  - official `P1` benchmark semantics
15
+ - narrow, human-playable action space
16
  - real verifier feedback from `constellaration`
17
+ - explicit constraints and failure semantics
18
+ - reward logic that can be explained and iterated
19
 
20
  Training is supporting evidence. The environment is the product.
21
 
22
+ ## 2. Current State
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ Completed:
25
 
26
+ - `P1` is locked as the single benchmark task
27
+ - the repaired 4-knob low-dimensional runtime is live in code
28
+ - the official `constellaration` verifier path is wired
29
+ - low-fidelity `run` and high-fidelity `submit` are separated clearly
30
+ - terminal scoring and reporting are fidelity-consistent
31
+ - explicit VMEC failure semantics are implemented
32
+ - the Northflank smoke workflow is committed
33
+ - the Northflank smoke test passed on the team H100
34
+ - baseline comparison has been rerun on the real verifier path
35
+ - a coarse measured sweep note now exists
36
+ - the first tracked low-fidelity fixtures now exist
37
+ - an initial low-fidelity manual playtest note now exists
38
 
39
+ Still open:
40
 
41
+ - paired high-fidelity checks for the tracked fixtures
42
+ - submit-side manual playtest evidence
43
+ - heuristic baseline refresh on the repaired real-verifier path
44
+ - HF Space deployment evidence
45
+ - Colab artifact wiring
46
+ - demo and README polish after the artifacts are real
47
 
48
+ Current caution:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
51
+ - do not narrate low-fidelity rollout metrics as final submission truth
52
 
53
+ ## 3. Locked Decisions
54
 
55
+ These decisions are fixed unless a hard blocker appears:
56
 
57
+ - benchmark task: `P1`
58
+ - submission framing: `Statement 3.1`
59
+ - verifier of record: `constellaration.problems.GeometricalProblem`
60
+ - repo strategy: fresh wiring in this repo
61
+ - reuse policy: do not port the old `ai-sci-feasible-designs` harness
62
+ - scope rule: one stable task only
63
 
64
+ Execution rule:
65
 
66
+ - do not reopen strategy unless a real blocker appears
67
+ - convert decisions into code, fixtures, traces, baselines, or deployment work
68
 
69
+ ## 4. Non-Negotiables
70
 
71
+ - Keep scope to one stable task.
72
+ - Keep claims conservative and evidence-backed.
73
+ - Do not let training-first work outrun environment stability.
74
+ - Do not rely on reward curves alone; keep trajectory evidence.
75
+ - Do not use reward complexity to hide a blocked action family.
76
+ - Do not polish repo or video before the environment and baselines are real.
77
 
78
+ ## 5. Document Roles
79
 
80
+ Use the docs like this:
81
 
82
+ - this file defines planning order, status, gates, and fallback rules
83
+ - [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
84
+ - [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
85
+ - archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces
86
 
87
+ ## 6. Artifact Plan
 
 
 
 
88
 
89
+ Visible artifacts:
90
 
91
+ - [ ] HF Space environment
92
+ - [ ] Required Colab notebook
93
+ - [ ] 1-minute demo video
94
+ - [x] Public repo and README
95
 
96
+ Compute surfaces:
97
 
98
+ - Northflank is the main compute workspace for verifier-heavy work
99
+ - HF Space is the hosted environment surface
100
+ - Colab is the required public artifact and can stay evaluation-first if training evidence is weak
101
 
102
+ Evidence order:
103
 
104
+ - [x] measured sweep note
105
+ - [ ] fixture checks
106
+ - [x] manual playtest log
107
+ - [ ] reward iteration note
108
+ - [ ] stable local and remote episodes
109
+ - [x] random and heuristic baselines
110
+ - [ ] notebook evidence
111
+ - [ ] demo and repo polish
112
 
113
+ ## 7. Environment Summary
114
 
115
+ The environment contract must stay narrow and legible:
116
 
117
+ - one repaired low-dimensional boundary family derived from a rotating-ellipse seed
118
+ - discrete `run | submit | restore_best` interaction
119
+ - low-fidelity verifier for normal steps
120
+ - high-fidelity verifier for `submit`
121
+ - readable observation surface with explicit fidelity labeling
122
+ - `Reward V0` kept simple and feasibility-first until playtesting proves a real pathology
123
 
124
+ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
 
 
 
 
 
125
 
126
+ ## 8. Execution Order
127
 
128
+ - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
129
+ - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
130
+ - [ ] Manual-playtest 5 to 10 episodes, including real submit traces, and record the first real confusion point, exploit, or reward pathology.
131
+ - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
132
+ - [ ] Refresh the heuristic baseline using the repaired-family evidence.
133
+ - [ ] Prove a stable local episode path.
134
+ - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
135
+ - [ ] Wire the Colab artifact to the live environment.
136
+ - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
137
+ - [ ] Polish the public repo only after the artifacts above exist.
138
 
139
+ ## 9. Success Gates
140
 
141
+ Gate 1: measured sweep exists
 
 
142
 
143
+ - repaired-family ranges, deltas, and reset seeds are justified by recorded evidence
144
 
145
+ Gate 2: fixture checks pass
146
 
147
+ - good, boundary, and bad references behave as expected
 
 
148
 
149
+ Gate 3: manual playtest passes
150
 
151
+ - a human can read the observation
152
+ - a human can choose a plausible next action
153
+ - a human can explain the reward change
154
 
155
+ Gate 4: local episode is stable
156
 
157
+ - one clean trajectory is reproducible enough for demo use
 
 
 
158
 
159
+ Gate 5: baseline story is credible
160
 
161
+ - heuristic behavior is at least interpretable and preferable to random on the repaired task
162
 
163
+ Gate 6: remote surface is real
164
 
165
+ - HF Space preserves the same task contract as local
 
 
 
166
 
167
+ Gate 7: submission artifacts exist
168
 
169
+ - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
170
 
171
+ ## 10. Fallback Rules
172
 
173
+ If training evidence is weak:
174
 
175
+ - keep the notebook evaluation-first
176
+ - ship the environment, playtest, and baseline story anyway
177
 
178
+ If HF Space deployment is delayed:
179
 
180
+ - keep local and Northflank evidence first
181
+ - document the deployment blocker plainly
182
+ - do not invent remote claims without a real run
183
 
184
+ If reward behavior is confusing:
185
 
186
+ - fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity
187
 
188
+ If the repaired family is too hard:
189
 
190
+ - adjust ranges, deltas, or seeds from measured evidence
191
+ - do not expand into a broad Fourier action space just to rescue the hackathon scope
192
 
193
+ If the repaired family is too easy:
194
 
195
+ - prefer fixture and seed adjustments before broadening the action schema
196
 
197
+ ## 11. Immediate Next Actions
198
 
199
+ - [x] Record the measured sweep and choose provisional defaults from evidence.
200
+ - [x] Check in tracked fixtures.
201
+ - [x] Record the first manual playtest log.
202
+ - [ ] Refresh the heuristic baseline from that playtest evidence.
203
+ - [ ] Verify one clean HF Space episode with the same contract.
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md DELETED
@@ -1,246 +0,0 @@
1
- # Fusion Design Lab: Next 12 Hours Checklist
2
-
3
- This checklist turns the updated deliverables map and Plan V2 into concrete execution order. The goal is to produce real evidence for the four submission artifacts, with `P1`, fresh wiring, and environment clarity driving the sequence.
4
-
5
- ## Core Rule
6
-
7
- Do not expand scope beyond one stable task. Training is supporting evidence, not the main story.
8
-
9
- ## Current Branch Status
10
-
11
- - [x] `P1` task is locked
12
- - [x] repaired 4-knob low-dimensional `P1` contract is implemented in the working tree
13
- - [x] baselines and API surface have been moved to the `P1` contract
14
- - [x] add a post-terminal guard in `step()`
15
- - [x] replace the synthetic evaluator with `constellaration`
16
- - [x] re-run baselines on the real verifier path
17
- - [x] commit the Northflank smoke workflow and note
18
- - [x] pass the Northflank smoke test on the team H100
19
- - [x] verify that the historical upstream 3-knob family is blocked on P1 triangularity under the real verifier path
20
- - [x] repair the low-dimensional parameterization
21
- - [x] add explicit VMEC failure semantics
22
- - [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
23
- - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
24
- - [ ] add tracked fixtures and manual playtest evidence
25
- - [ ] refresh the heuristic baseline after the real-verifier rerun
26
-
27
- Current caution:
28
-
29
- - do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
30
- - do not present submit-time score comparisons as clean unless they are grounded in the now-separated high-fi submit state
31
-
32
- ## Plan V2 Inheritance
33
-
34
- Carry these rules through the whole checklist:
35
-
36
- - Freeze the environment contract before heavy iteration.
37
- - Keep the repo freshly wired; do not port the old harness.
38
- - Treat the current reward as `Reward V0`, not final reward.
39
- - Distinguish validated facts from working hypotheses.
40
- - Prefer behavior traces and baseline comparisons over generic reward-curve storytelling.
41
- - If training is weak, ship the environment story anyway.
42
- - Use Northflank as the main compute workspace; keep HF Space and Colab as the submission surfaces.
43
- - Do not open another strategy loop unless a real blocker appears.
44
-
45
- ## Hour 0-2: Parallelize Compute Bring-Up and Contract Lock
46
-
47
- ### Track A: Northflank Compute
48
-
49
- 1. Bring up the Northflank Jupyter Notebook with PyTorch on the team H100.
50
- 2. Attach persistent storage before relying on saved models, caches, or fixture downloads.
51
- 3. Preserve the concrete smoke-test evidence:
52
- - import `constellaration`
53
- - generate one rotating-ellipse-derived low-dimensional boundary
54
- - run one low-fidelity verifier call
55
- - keep one artifact in persistent storage
56
- - current artifact: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
57
-
58
- Exit condition: the notebook is not just open; the verifier path works and persistent storage is usable.
59
-
60
- Artifacts:
61
- - Northflank notebook live
62
- - smoke test note
63
- - one persisted smoke artifact
64
-
65
- ### Track B: Environment Contract
66
-
67
- 1. Write the exact `P1` environment spec.
68
- 2. Freeze one task only.
69
- 3. Define:
70
- - observation schema
71
- - action schema
72
- - episode loop
73
- - terminal conditions
74
- - reward V0 terms
75
- - initial penalties
76
- 4. Update the main diagram so it emphasizes:
77
- - `P1`
78
- - official verifier
79
- - reward shaping
80
- - manual playtesting
81
- 5. Mark open assumptions explicitly:
82
- - whether the repaired low-dimensional action set is expressive enough
83
- - whether the fixed step budget is enough
84
- - whether `restore_best` is useful without becoming an exploit
85
- - whether repaired-family ranges and deltas need adjustment after measurement
86
-
87
- Exit condition: a human can read the spec and understand how to act in the environment.
88
-
89
- Artifacts:
90
- - short environment spec
91
- - revised mermaid diagram
92
- - short hypothesis list
93
-
94
- Transition rule:
95
-
96
- - once Track B exits, stop rewriting the strategy and move straight into wiring and verifier checks
97
-
98
- ## Hour 2-4: Verify Wiring, Then Manual Playtest
99
-
100
- 1. Run a small measured sweep on the repaired family before freezing defaults.
101
- 2. Audit observation clarity:
102
- - low-fi `run` metrics are clearly labeled
103
- - high-fi `submit` metrics are clearly labeled
104
- - low-fidelity and high-fidelity best-state fields are explicit and human-readable
105
- 3. Run fixture checks:
106
- - known-good or near-winning design
107
- - near-boundary designs
108
- - clearly bad designs
109
- - do not rely on the current default baseline params as the only starting point
110
- 4. Confirm:
111
- - verifier outputs are sane
112
- - reward ordering is sane
113
- - objective direction is correct
114
- 5. Manually play 5 to 10 episodes.
115
- 6. Log for each step:
116
- - observation
117
- - chosen action
118
- - expected effect
119
- - returned reward
120
- - confusion or exploit if observed
121
- 7. Identify at least one bad incentive or exploit.
122
- 8. Patch reward or penalty logic immediately.
123
- 9. Write the reward shaping story:
124
- - initial reward V0
125
- - bad behavior
126
- - refinement to reward V1
127
- - improved behavior
128
- 10. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
129
-
130
- Exit condition: you can explain why the environment now rewards the intended behavior.
131
-
132
- Artifacts:
133
- - measured range and delta note
134
- - observation semantics note
135
- - fixture check note
136
- - manual playtest log
137
- - reward shaping note
138
- - reward V1 delta note
139
-
140
- ## Hour 4-6: Stabilize the Local Task
141
-
142
- 1. Prove the fresh local `P1` verifier loop.
143
- 2. Run one stable end-to-end task repeatedly.
144
- 3. Confirm the action schema is deterministic enough for reproducible episodes.
145
- 4. Save one clean local trajectory.
146
- 5. Do not proceed to remote deployment until this gate is real.
147
-
148
- Exit condition: the same setup yields the same type of behavior reliably enough for a demo.
149
-
150
- Artifacts:
151
- - stable local run
152
- - saved trajectory
153
-
154
- ## Hour 6-8: Make the HF Space Real
155
-
156
- 1. Package the OpenEnv `P1` environment for remote use.
157
- 2. Use the explicit deployment path:
158
- - commit changes in this repo
159
- - push to GitHub
160
- - let HF Space build from the repo
161
- 3. Decide and document the access mode:
162
- - preferred: public HF Space for the hackathon
163
- - if private: token-based notebook access documented
164
- 4. Verify remote `reset` and `step`.
165
- 5. Run one clean remote episode end-to-end.
166
- 6. Confirm the remote environment preserves the same task contract as local.
167
-
168
- Exit condition: the environment is runnable in the actual submission surface, not only locally.
169
-
170
- Artifacts:
171
- - live HF Space environment
172
- - remote episode proof
173
-
174
- ## Hour 8-10: Add Baselines
175
-
176
- 1. Implement the random baseline.
177
- 2. Implement the heuristic baseline.
178
- 3. Run short comparisons on the same stable `P1` task.
179
- 4. Save:
180
- - comparison numbers
181
- - behavior traces
182
- - one example where heuristic beats random
183
-
184
- Exit condition: there is a credible baseline anchor for the judges.
185
-
186
- Artifacts:
187
- - random baseline
188
- - heuristic baseline
189
- - comparison table or figure
190
-
191
- ## Hour 10-12: Produce the Submission Evidence
192
-
193
- 1. Wire the Colab training or eval script to the live environment.
194
- 2. Ensure it produces:
195
- - multi-turn episodes
196
- - behavior traces
197
- - reward or behavior comparison outputs
198
- 3. Keep heavy verifier and training work on Northflank; use Colab as the thin public artifact.
199
- 4. Draft the 60-second demo script.
200
- 5. Record the demo around:
201
- - what `P1` is
202
- - how reward was refined
203
- - what manual playtesting revealed
204
- - one stable trajectory
205
- - baseline comparison
206
- 6. If training evidence is weak, keep the notebook eval-first and do not force a training-centric claim.
207
- 7. Make the repo public-facing and readable only after the artifacts are real.
208
-
209
- Exit condition: all four visible artifacts exist in usable form.
210
-
211
- Artifacts:
212
- - Colab training or eval script
213
- - Northflank run notes or exported traces
214
- - demo script
215
- - draft or final video
216
- - updated repo README
217
- - explicit fallback note if training is not persuasive
218
-
219
- ## Artifact Order
220
-
221
- 1. Environment spec
222
- 2. Repaired parameterization note
223
- 3. Fixture check note
224
- 4. Manual playtest log
225
- 5. Reward revision note
226
- 6. Stable task run
227
- 7. Random baseline
228
- 8. Heuristic baseline
229
- 9. Northflank traces or training evidence
230
- 10. Colab training or eval evidence
231
- 11. Demo recording
232
- 12. Repo polish
233
-
234
- ## Non-Negotiables
235
-
236
- - Do not widen scope beyond one stable task.
237
- - Do not port the old `ai-sci-feasible-designs` harness into this repo.
238
- - Do not optimize training before manual playtesting.
239
- - Do not rely on reward curves alone; keep trajectory evidence.
240
- - Do not narrate hypotheses as facts before they are checked.
241
- - Do not guess repaired-family ranges, deltas, or budget changes without a measured sweep.
242
- - Do not polish the repo or video before the environment and baselines are real.
243
- - Treat judge comments as pressure toward clarity and reproducibility, not broader unsupported claims.
244
- - Do not force a training-centric story if the strongest evidence is environment quality plus baselines.
245
- - Do not rely on Northflank container-local state without persistent storage.
246
- - Do not block contract design work on Northflank provisioning friction.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/P1_ENV_CONTRACT_V1.md CHANGED
@@ -1,151 +1,92 @@
1
  # P1 Environment Contract V1
2
 
3
- **Status:** Technical contract with partial implementation now landed
4
- **Role:** Supporting spec for the `P1` environment contract
5
- **SSOT relationship:** This file refines [FUSION_DESIGN_LAB_PLAN_V2.md](FUSION_DESIGN_LAB_PLAN_V2.md). If this file conflicts with the planning SSOT, update both in the same task.
6
 
7
- ## Purpose
8
 
9
- This file captures the technical contract that should drive the next code changes in:
10
 
11
- - [server/physics.py](../server/physics.py)
12
- - [fusion_lab/models.py](../fusion_lab/models.py)
13
- - [server/environment.py](../server/environment.py)
14
- - [server/app.py](../server/app.py)
15
 
16
- The central change is now explicit:
17
 
18
- - the historical upstream 3-knob rotating-ellipse family is blocked on P1 triangularity under the real verifier path
19
- - that blocker drove the repair to the current 4-knob low-dimensional runtime
20
- - the runtime now exposes the repaired 4-knob target, but measured sweep validation and fixture calibration are still pending
21
-
22
- ## Historical Blocker
23
-
24
- This section records the resolved upstream blocker that motivated the current repair. It is not the live runtime state.
25
-
26
- Current verified facts:
27
-
28
- - upstream `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` has no triangularity control
29
- - the historical 3-knob environment directly exposed only:
30
- - `aspect_ratio`
31
- - `elongation`
32
- - `rotational_transform`
33
- - real low-fidelity samples on the current verifier path kept:
34
- - `average_triangularity` at roughly `+0.004975`
35
- - `p1_feasibility` at roughly `1.00995`
36
- - feasible count at `0`
37
-
38
- Conclusion:
39
-
40
- - the historical 3-knob family was not a meaningful playtest or baseline environment for `P1`
41
- - the live runtime therefore moved to a repaired boundary family before further reward iteration
42
-
43
- ## Design Split
44
 
45
  Keep three layers separate:
46
 
47
- 1. **Boundary builder**
48
- - low-dimensional parameterization
49
- - rotating-ellipse seed generation
50
- - optional triangularity control injection
51
- 2. **Official verifier**
52
- - boundary in
53
- - metrics out
54
- - feasibility, objective, and score semantics from `GeometricalProblem`
55
- 3. **Environment**
56
- - reset pool
57
- - discrete actions
58
- - episode flow
59
- - reward shaping
60
-
61
- ## Verifier Plan
62
-
63
- `server/physics.py` should expose a boundary-based verifier surface.
64
-
65
- Current repo state:
66
-
67
- - the live code now exposes a boundary builder plus boundary-based evaluator
68
- - explicit failure results are returned when VMEC evaluation fails
69
- - measured sweep validation is still pending
70
-
71
- Current live functions:
72
-
73
- - `build_boundary_from_params(...) -> SurfaceRZFourier`
74
- - `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`
75
 
76
- Current layering note:
77
 
78
- - discrete perturbation application lives in `server/environment.py`
79
- - there is no separate `apply_low_dim_perturbation(...)` helper in the live code
 
80
 
81
- The verifier layer should own:
82
 
83
- - low-fidelity step-time evaluation
84
- - high-fidelity submit-time evaluation
85
  - official `P1` feasibility semantics
86
- - official `P1` objective direction
87
- - score ordering
88
  - explicit failure results when VMEC or forward-model evaluation fails
89
 
90
- The verifier layer should not own:
91
 
 
 
92
  - episode budget
93
- - action semantics
94
  - reward shaping
95
- - “best so far” state
96
 
97
- ## Low-Dimensional Boundary Plan
98
 
99
- Stay low-dimensional, not Fourier-first.
100
 
101
- Target controllable knobs:
102
 
103
  - `aspect_ratio`
104
  - `elongation`
105
  - `rotational_transform`
106
  - `triangularity_scale`
107
 
108
- Current measurement rule:
109
 
110
- - do not lock exact repaired-family ranges or deltas from prose alone
111
- - measure them on the repaired boundary family before presenting them as defaults
112
- - especially treat `rotational_transform` bounds, `triangularity_scale` deltas, and budget changes as open until measured
113
 
114
- Important naming rule:
115
 
116
- - once triangularity is injected explicitly, stop describing the family as plain upstream “rotating ellipse”
117
- - it becomes a custom low-dimensional boundary family derived from a rotating-ellipse seed
118
 
119
- ## Action Contract
 
 
120
 
121
- Keep the discrete interaction style:
122
 
123
- - `intent`: `run | submit | restore_best`
124
  - `direction`: `increase | decrease`
125
  - `magnitude`: `small | medium | large`
126
 
127
- For `run`, the controllable parameter should be one of:
128
-
129
- - `aspect_ratio`
130
- - `elongation`
131
- - `rotational_transform`
132
- - `triangularity_scale`
133
-
134
- This keeps the environment human-playable and aligned with the historical low-dimensional P1 path.
135
-
136
- Current repo state:
137
 
138
- - the live action schema now exposes:
139
- - `aspect_ratio`
140
- - `elongation`
141
- - `rotational_transform`
142
- - `triangularity_scale`
143
 
144
- ## Observation Contract
145
 
146
- The observation should stay metric-centered and human-readable.
147
 
148
- Keep:
149
 
150
  - `max_elongation`
151
  - `aspect_ratio`
@@ -167,113 +108,114 @@ Keep:
167
  - `target_spec`
168
  - `diagnostics_text`
169
 
170
- Add clarity about fidelity:
171
 
172
- - low-fidelity step-time metrics should be labeled as such
173
- - high-fidelity submit-time metrics should be labeled as such
174
- - do not expose them as if they are the same truth surface
175
- - the live runtime should expose separate low-fidelity and high-fidelity best-state fields instead of overloading one generic best-state metric
176
 
177
- This can be done either by:
178
 
179
- - separate observation fields, or
180
- - explicit fidelity labels in `diagnostics_text`
 
 
 
 
181
 
182
- The minimum requirement is that a reader can tell whether a metric came from low-fi `run` or high-fi `submit`.
183
 
184
- Current repo state:
 
 
 
185
 
186
- - the live observation surface now exposes evaluation fidelity and failure state
187
- - the live observation surface now exposes separate low-fidelity and high-fidelity best-state fields
188
- - terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
189
 
190
- ## Reward V0
191
 
192
- Keep reward mostly scalar and verifier-driven.
 
 
 
193
 
194
- Target structure:
195
 
196
- - infeasible to feasible crossing:
197
- - clear positive bonus
198
- - feasible to infeasible regression:
199
- - clear negative penalty
200
- - both infeasible:
201
- - reward reduction in official feasibility scalar
202
- - both feasible:
203
- - reward lower `max_elongation`
204
- - non-submit step:
205
- - small step cost
206
- - recovery after a failed evaluation:
207
- - modest positive signal for returning to a valid verifier result
208
- - do not compute this from the failed sentinel feasibility value
209
- - explicit `submit`:
210
- - better than passive budget exhaustion when the design is improved
211
 
212
- Do not add:
213
 
214
- - reward terms tied to specific Fourier modes
215
- - bonuses for matching a known winner
216
- - hand-coded constraint tricks to hide a blocked action family
 
 
 
 
 
 
 
 
 
 
217
 
218
- Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
219
 
220
- Additional fidelity rule:
221
 
222
- - do not compare a high-fidelity submit score against low-fidelity baseline state
223
- - terminal reward and submit summaries should use a fidelity-consistent basis
224
 
225
- ## Reset Strategy
226
 
227
- Start with frozen exact seeds, not jitter.
 
 
 
 
 
 
228
 
229
- Reset pool policy:
230
 
231
- - `n_field_periods = 3`
232
- - small frozen seed set
233
- - each seed must be:
234
- - reproducible
235
- - near enough to the feasible boundary that 6 steps is worth testing
236
- - not already solved
237
 
238
- Add bounded jitter only if memorization becomes a real problem.
239
 
240
- ## Manual Playtest Gate
241
 
242
- Do not move to heuristic redesign or reward tuning until this gate is passed.
 
 
243
 
244
- Manual playtest questions:
245
 
246
- - can a human tell which constraint is currently blocking progress?
247
- - can a human choose a plausible next action?
248
- - can a human reach or approach feasibility within the budget?
249
- - does `submit` feel meaningfully different from passive exhaustion?
250
 
251
- If the answer is no, fix:
252
 
253
- - the boundary family
254
- - the step magnitudes
255
- - the seed pool
256
- - the observation semantics around low-fi vs high-fi best-state reporting
257
 
258
- before tuning reward further
259
 
260
- ## Implementation Order
261
 
262
- 1. Repair the low-dimensional boundary builder in [server/physics.py](../server/physics.py).
263
- 2. Split boundary construction from official boundary evaluation in [server/physics.py](../server/physics.py).
264
- 3. Update the action and state schema in [fusion_lab/models.py](../fusion_lab/models.py).
265
- 4. Update the episode loop and observation labeling in [server/environment.py](../server/environment.py).
266
- 5. Update the task summary and public action description in [server/app.py](../server/app.py).
267
- 6. Add explicit VMEC failure semantics in [server/environment.py](../server/environment.py).
268
- 7. Run a small measured sweep to choose ranges, deltas, and reset seeds.
269
- 8. Verify that observation semantics are human-readable and that low-fi versus high-fi best-state reporting is explicit.
270
- 9. Freeze 1-2 repaired low-dimensional fixtures.
271
- 10. Run manual playtesting.
272
- 11. Refresh the heuristic baseline only after that evidence exists.
273
 
274
- ## Out of Scope
275
 
276
- - full Fourier-mode action space as the primary environment
277
  - porting the old `ai-sci-feasible-designs` harness
278
- - making reward more complex before the repaired low-dimensional family exists
279
- - building a full benchmark split protocol before the environment is even playable
 
 
1
  # P1 Environment Contract V1
2
 
3
+ **Role:** Live technical contract SSOT for the current implementation phase
4
+ **Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](FUSION_DESIGN_LAB_PLAN_V2.md)
5
+ **Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
6
 
7
+ ## 1. Scope
8
 
9
+ This document defines the live technical contract for:
10
 
11
+ - [`server/physics.py`](../server/physics.py)
12
+ - [`fusion_lab/models.py`](../fusion_lab/models.py)
13
+ - [`server/environment.py`](../server/environment.py)
14
+ - [`server/app.py`](../server/app.py)
15
 
16
+ If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.
17
 
18
+ ## 2. Design Split
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  Keep three layers separate:
21
 
22
+ 1. boundary builder
23
+ 2. official verifier
24
+ 3. environment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ Boundary builder owns:
27
 
28
+ - the repaired low-dimensional family
29
+ - rotating-ellipse seed generation
30
+ - explicit triangularity control injection
31
 
32
+ Official verifier owns:
33
 
34
+ - boundary in, metrics out
 
35
  - official `P1` feasibility semantics
36
+ - objective direction and score ordering
37
+ - low-fidelity and high-fidelity evaluation modes
38
  - explicit failure results when VMEC or forward-model evaluation fails
39
 
40
+ Environment owns:
41
 
42
+ - reset pool
43
+ - discrete actions
44
  - episode budget
45
+ - best-state tracking
46
  - reward shaping
 
47
 
48
+ ## 3. Boundary Family
49
 
50
+ The historical 3-knob upstream rotating-ellipse family is not the live contract.
51
 
52
+ The live controllable knobs are:
53
 
54
  - `aspect_ratio`
55
  - `elongation`
56
  - `rotational_transform`
57
  - `triangularity_scale`
58
 
59
+ Rules:
60
 
61
+ - stay low-dimensional and human-playable
62
+ - treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
63
+ - the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks
64
 
65
+ ## 4. Action Contract
66
 
67
+ `intent` is one of:
 
68
 
69
+ - `run`
70
+ - `submit`
71
+ - `restore_best`
72
 
73
+ For `run`, the action also includes:
74
 
75
+ - `parameter`: one of `aspect_ratio | elongation | rotational_transform | triangularity_scale`
76
  - `direction`: `increase | decrease`
77
  - `magnitude`: `small | medium | large`
78
 
79
+ Constraints:
 
 
 
 
 
 
 
 
 
80
 
81
+ - keep the discrete interaction style
82
+ - do not expose the full Fourier action space as the primary environment
83
+ - do not use action complexity to compensate for missing clarity elsewhere
 
 
84
 
85
+ ## 5. Observation Contract
86
 
87
+ The observation must stay metric-centered and human-readable.
88
 
89
+ Required fields:
90
 
91
  - `max_elongation`
92
  - `aspect_ratio`
 
108
  - `target_spec`
109
  - `diagnostics_text`
110
 
111
+ Interpretation rules:
112
 
113
+ - low-fidelity `run` metrics must be labeled as low-fidelity
114
+ - high-fidelity `submit` metrics must be labeled as high-fidelity
115
+ - low-fidelity and high-fidelity best-state reporting must stay separate
116
+ - the observation must be understandable without hidden state
117
 
118
+ ## 6. Episode Flow
119
 
120
+ 1. Reset from one frozen repaired-family seed or a small frozen seed set.
121
+ 2. Evaluate the initial state with low fidelity and return the first observation.
122
+ 3. On `run`, perturb one controllable parameter and re-evaluate with low fidelity.
123
+ 4. On `restore_best`, revert to the best known low-fidelity state, re-evaluate, and consume budget.
124
+ 5. On `submit`, end the episode and run the high-fidelity submit evaluation.
125
+ 6. End the episode on `submit` or budget exhaustion.
126
 
127
+ Failure semantics:
128
 
129
+ - failed evaluations still consume budget
130
+ - failed evaluations produce visible failure observations
131
+ - failed evaluations apply a documented penalty
132
+ - the environment must not silently convert failures into success paths
133
 
134
+ ## 7. Terminal Contract
 
 
135
 
136
+ At termination, the environment must provide:
137
 
138
+ - final best design metrics
139
+ - final feasibility status
140
+ - total reward
141
+ - a short human-readable trajectory summary
142
 
143
+ Terminal reporting rules:
144
 
145
+ - keep submit-time reporting fidelity-consistent
146
+ - do not compare high-fidelity submit results against low-fidelity baseline state as if they were the same truth surface
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
+ ## 8. Verifier Contract
149
 
150
+ The verifier of record is `constellaration.problems.GeometricalProblem`.
151
+
152
+ The implementation must preserve:
153
+
154
+ - objective direction
155
+ - constraint direction
156
+ - feasibility semantics
157
+ - score ordering
158
+
159
+ The verifier should stay boundary-based:
160
+
161
+ - `build_boundary_from_params(...) -> SurfaceRZFourier`
162
+ - `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`
163
 
164
+ Do not treat parameterization-specific logic as verifier truth.
165
 
166
+ ## 9. Reward V0
167
 
168
+ `Reward V0` is the live reward contract until playtesting proves a concrete pathology.
 
169
 
170
+ Target behavior:
171
 
172
+ - infeasible to feasible crossing gets a clear positive bonus
173
+ - feasible to infeasible regression gets a clear penalty
174
+ - when both states are infeasible, reduced official feasibility violation should help
175
+ - when both states are feasible, lower `max_elongation` should help
176
+ - non-submit actions pay a small step cost
177
+ - `submit` should be better than passive exhaustion when the design is genuinely improved
178
+ - recovery after a failed evaluation may receive a modest bounded bonus
179
 
180
+ Rules:
181
 
182
+ - keep reward scalar and verifier-driven
183
+ - do not add mode-specific or parameter-specific reward shaping
184
+ - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
 
 
 
185
 
186
+ ## 10. Reset and Fixture Policy
187
 
188
+ Reset policy:
189
 
190
+ - start with exact frozen seeds
191
+ - keep `n_field_periods = 3`
192
+ - prefer a small reproducible seed set
193
 
194
+ Each seed should be:
195
 
196
+ - reproducible
197
+ - near enough to the feasible boundary to make the budget meaningful
198
+ - not already solved
 
199
 
200
+ Fixture policy:
201
 
202
+ - track good, boundary, and clearly bad references
203
+ - use fixtures for verifier and reward sanity checks
204
+ - do not turn fixture mining into a separate broad project
 
205
 
206
+ ## 11. Open Measurements
207
 
208
+ These items remain open until measured on the repaired family:
209
 
210
+ - exact repaired-family range bounds
211
+ - exact `triangularity_scale` deltas
212
+ - exact `rotational_transform` bounds
213
+ - exact reset seed pool
214
+ - whether the budget should stay at 6 or change
 
 
 
 
 
 
215
 
216
+ ## 12. Out of Scope
217
 
 
218
  - porting the old `ai-sci-feasible-designs` harness
219
+ - broad Fourier-mode action space as the main environment
220
+ - complicated reward shaping before playtest evidence
221
+ - a wider task family than the single stellarator environment
docs/P1_PARAMETERIZATION_DEEPDIVE.md CHANGED
@@ -1,447 +1,156 @@
1
  # P1 Parameterization Deep-Dive
2
 
3
- **Date:** 2026-03-07
4
- **Status:** Findings complete. Parameterization repair is implemented; measured sweep follow-up is pending.
 
5
 
6
- This document records the investigation into why the current 3-knob rotating-ellipse
7
- environment cannot produce P1-feasible designs, what the original winning session
8
- actually did, and the validated plan for fixing it.
9
 
10
- ---
 
 
 
11
 
12
- ## 1. The Structural Blocker
13
 
14
  ### Symptom
15
 
16
- The environment's 3-parameter action space (`aspect_ratio`, `elongation`,
17
- `rotational_transform`) cannot satisfy the P1 constraints regardless of parameter
18
- values.
19
 
20
- ### Evidence
21
-
22
- A 125-point grid sweep over the full 3-knob range with the real `constellaration`
23
- verifier:
24
-
25
- ```
26
- aspect_ratio ∈ [2.0, 8.0] (5 values)
27
- elongation ∈ [1.0, 5.0] (5 values)
28
- rot_transform ∈ [0.1, 1.0] (5 values)
29
- n_field_periods = 3
30
- ```
31
-
32
- **Result: 0/125 feasible.** Every configuration produced:
33
-
34
- - `average_triangularity ≈ +0.005` (constraint requires `≤ -0.5`, gap of ~0.505)
35
- - `edge_iota_over_nfp ≈ 0.05-0.22` (constraint requires `≥ 0.3`)
36
-
37
- Varying `n_field_periods` (3, 4, 5) did not change the result. The
38
- `generate_rotating_ellipse` function structurally produces near-zero triangularity
39
- regardless of its input parameters.
40
-
41
- ### Root cause
42
-
43
- `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform,
44
- n_field_periods)` sets Fourier coefficients that define the boundary shape. The
45
- `m=2, n=0` mode (which controls triangularity) is not meaningfully set by any of the
46
- three input parameters. Triangularity is structurally fixed near zero.
47
-
48
- The `rotational_transform` range `[0.1, 1.0]` is also too low. Even with injected
49
- triangularity, `edge_iota_over_nfp` doesn't reach 0.3 until `rotational_transform ≈ 1.5+`.
50
-
51
- ---
52
-
53
- ## 2. The Original Winning Session
54
-
55
- ### Source
56
-
57
- The original session that found P1-feasible designs is documented in:
58
-
59
- ```
60
- ai-sci-feasible-designs/docs/harness/raw-session.md
61
- ```
62
-
63
- Session: `rollout-2026-01-05T10-43-45-019b8bd3-14d6-7253-8235-f732ee43d683.jsonl`
64
- (25,012 lines, 200 agent messages, Jan 5-9 2026)
65
-
66
- ### What the agent actually did for P1
67
-
68
- 1. Built `scripts/search_p1_lowdim.py` — a rotating-ellipse sweep with a **4th knob**
69
- 2. Found 3 feasible designs (feasibility 0.0) within ~20 minutes
70
- 3. Refined with trust-region local optimizer around those seeds
71
- 4. Downloaded scadena-pf leaderboard seed from HuggingFace as anchor
72
- 5. Ran `scripts/p1_alm_ngopt_multifidelity.py` (ALM + NGOpt multi-fidelity)
73
- 6. Final result: score 0.970141, beating leaderboard 0.969457
74
-
75
- ### The 4th knob: `tri_scale`
76
-
77
- The script `search_p1_lowdim.py` (recovered from git at commit `300c191`) does NOT
78
- use the raw `generate_rotating_ellipse` output. After generating the base shape, it:
79
-
80
- 1. Expands Fourier resolution: `set_max_mode_numbers(surface, max_poloidal_mode=3, max_toroidal_mode=3)`
81
- 2. Injects triangularity: `r_cos[2, center] = -tri_scale * minor_radius`
82
- 3. Cleans auxiliary modes: `r_cos[0, :center] = 0.0`, `z_sin[0, :center+1] = 0.0`
83
- 4. Returns the modified `SurfaceRZFourier`
84
-
85
- The `tri_scale` knob directly controls the `m=2, n=0` Fourier mode, which is what
86
- drives `average_triangularity` in the physics evaluation. This is the missing piece.
87
-
88
- ### Parameter ranges from the original script
89
-
90
- ```python
91
- aspect_ratio: [3.0, 3.6]
92
- elongation: [1.4, 2.2]
93
- rotational_transform: [1.5, 2.2] # NOTE: much higher than our [0.1, 1.0]
94
- tri_scale: [0.55, 0.8]
95
- ```
96
-
97
- ---
98
-
99
- ## 3. The Harness Campaign Results
100
-
101
- ### All campaigns found zero feasible designs
102
-
103
- Queried SQLite databases across all P1 runs in `ai-sci-feasible-designs/runs/`:
104
-
105
- | Run | Candidates | Feasible | Best feasibility |
106
- |-----|-----------|----------|-----------------|
107
- | p1_campaign | 58 | 0 | 0.615 |
108
- | p1_campaign_v2 | 50 | 0 | 0.416 |
109
- | p1_e2e_validate | 7 | 0 | 0.639 |
110
- | p1_live | 23 | 0 | 0.569 |
111
-
112
- The campaign candidates used full Fourier boundaries (`5x9` arrays, `n_field_periods=5`),
113
- not the low-dimensional rotating-ellipse family.
114
-
115
- ### Postmortem diagnosis (from P1_CAMPAIGN_POSTMORTEM.md)
116
-
117
- The campaign's `_COLD_START_GUIDANCE` told the LLM "Do NOT generate from scratch" and
118
- "Start with small perturbations (0.01-0.05)." The winning raw-session agent did the
119
- exact opposite: broad sweeps, large parameter variations, rotating-ellipse seeds. The
120
- guidance actively prohibited the winning strategy.
121
-
122
- ---
123
-
124
- ## 4. Live Sweep Validation
125
 
126
- ### 4-knob sweep with `tri_scale` injection
127
 
128
- A 256-point grid sweep using the same boundary construction as `search_p1_lowdim.py`:
129
-
130
- ```
131
- aspect_ratio ∈ [3.2, 3.8] step 0.2 (4 values)
132
- elongation ∈ [1.2, 1.8] step 0.2 (4 values)
133
- rot_transform ∈ [1.2, 1.8] step 0.2 (4 values)
134
- tri_scale ∈ [0.4, 0.7] step 0.1 (4 values)
135
- n_field_periods = 3
136
- mpol = 3, ntor = 3
137
- ```
138
-
139
- **Result from the recorded sweep:**
140
-
141
- ```
142
- Total configs: 256
143
- Evaluated: 228 (VMEC succeeded)
144
- Crashed: 28 (VMEC solver failed)
145
- Feasible: 10
146
- Crash rate: 10.9%
147
- Feasibility rate (of evaluated): 4.4%
148
- ```
149
-
150
- ### Top feasible results
151
-
152
- | AR | elong | rot_t | tri_s | AR_out | tri | iota/nfp | elong_out | feas | ok |
153
- |----|-------|-------|-------|--------|-----|----------|-----------|------|----|
154
- | 3.6 | 1.4 | 1.6 | 0.60 | 3.287 | -0.5003 | 0.3005 | 7.3751 | 0.0000 | YES |
155
- | 3.6 | 1.4 | 1.8 | 0.60 | 3.287 | -0.5003 | 0.3481 | 8.9318 | 0.0000 | YES |
156
- | 3.8 | 1.4 | 1.6 | 0.60 | 3.487 | -0.5003 | 0.3256 | 7.9202 | 0.0000 | YES |
157
- | 3.8 | 1.6 | 1.6 | 0.60 | 3.474 | -0.5037 | 0.3168 | 8.0626 | 0.0000 | YES |
158
- | 3.8 | 1.8 | 1.6 | 0.60 | 3.459 | -0.5075 | 0.3097 | 8.2033 | 0.0000 | YES |
159
- | 3.4 | 1.2 | 1.8 | 0.60 | 3.096 | -0.4977 | 0.3276 | 8.0849 | 0.0046 | YES |
160
- | 3.8 | 1.2 | 1.6 | 0.60 | 3.496 | -0.4977 | 0.3345 | 7.7908 | 0.0046 | YES |
161
- | 3.6 | 1.2 | 1.8 | 0.60 | 3.296 | -0.4977 | 0.3535 | 8.8140 | 0.0046 | YES |
162
- | 3.2 | 1.2 | 1.8 | 0.60 | 2.896 | -0.4977 | 0.2995 | 7.4314 | 0.0046 | YES |
163
- | 3.6 | 1.2 | 1.6 | 0.60 | 3.296 | -0.4977 | 0.3105 | 7.2363 | 0.0046 | YES |
164
-
165
- **Key observations:**
166
-
167
- - All feasible designs have `tri_scale = 0.60`
168
- - `rot_transform ∈ {1.6, 1.8}` only — lower values never reach feasibility
169
- - `average_triangularity` clusters at `-0.500` to `-0.508` (right at the constraint)
170
- - Triangularity is always the binding constraint (within the 1% tolerance)
171
- - `max_elongation` ranges from 7.2 to 8.9 (score ~0.31 to ~0.12). Significant room
172
- for optimization compared to the winning score of 0.970 (elongation 1.27)
173
-
174
- ### Crash rate by `rotational_transform`
175
-
176
- | rot_transform | crash rate | feasible |
177
- |---------------|-----------|----------|
178
- | 1.2 | 0% (0/64) | 0 |
179
- | 1.4 | 0% (0/64) | 0 |
180
- | 1.6 | 0% (0/64) | 6 |
181
- | 1.8 | 44% (28/64) | 4 |
182
-
183
- **`rot_transform = 1.6` is the sweet spot:** zero crashes, highest feasible count.
184
-
185
- ### VMEC crash zone
186
 
187
- Tested the extreme region (`rot_transform [2.0, 2.4]`, `tri_scale [0.7, 0.9]`):
188
 
189
- ```
190
- 600/600 crashed (100%)
191
- ```
192
 
193
- VMEC solver fails universally when the boundary is too distorted. The crash boundary
194
- is approximately `rot_transform ≥ 2.0` combined with `tri_scale ≥ 0.7`.
 
195
 
196
- ---
197
 
198
- ## 5. Verifier Analysis
199
 
200
- ### What's correct in the current verifier (`server/physics.py`)
201
 
202
- 1. **`_to_evaluation_metrics`** uses `GeometricalProblem` public API:
203
- - `problem.is_feasible(metrics)` — applies the 1% tolerance internally
204
- - `problem.compute_feasibility(metrics)` — infinity-norm of normalized violations
205
- - `problem.get_objective(metrics)` — returns `(max_elongation, minimize=True)`
206
 
207
- 2. **`_score_from_objective`** matches the official formula:
208
- `score = 1 - clip((max_elongation - 1) / 9, 0, 1)`
209
 
210
- 3. **Multi-fidelity split** is correct:
211
- - `run` actions use low-fidelity VMEC (~0.6s per eval)
212
- - `submit` uses high-fidelity VMEC (~24s per eval)
213
 
214
- 4. **Constraint constants** match the official P1 definition:
215
- - `aspect_ratio 4.0`
216
- - `average_triangularity -0.5`
217
- - `edge_rotational_transform / n_field_periods 0.3`
218
 
219
- ### Current implementation
220
 
221
- - The old `evaluate_params` helper has been retired.
222
- - The runtime is now split into:
223
- - `build_boundary_from_params(...)` → `SurfaceRZFourier` (handles mode expansion + tri_scale injection)
224
- - `evaluate_boundary(boundary, fidelity)` → `EvaluationMetrics` (pure evaluation, no parameterization knowledge)
225
 
226
- ---
227
 
228
- ## 6. Reward Analysis
229
 
230
- ### Reward V0 structure (current, in `server/environment.py`)
231
 
232
- ```
233
- Feasibility transition: ±3.0 on crossing the feasible/infeasible boundary
234
- Dual-track step shaping:
235
- feasible + feasible → (prev_elongation - curr_elongation) * 10.0
236
- otherwise → (prev_feasibility - curr_feasibility) * 5.0
237
- Post-failure recovery: +1.0 on the first successful step after a failed evaluation
238
- Per-step cost: -0.1 for non-submit actions
239
- Terminal bonus (submit): 5.0 * improvement_ratio + budget_efficiency
240
- Terminal bonus (exhaust): 2.0 * improvement_ratio
241
- Not improved penalty: -1.0 (submit) / -0.5 (exhaust)
242
  ```
243
 
244
- ### Assessment
245
 
246
- **The reward is still simple and should stay close to unchanged.** It mostly uses two scalars
247
- from the verifier: `feasibility` and `objective (max_elongation)`. These are
248
- problem-agnostic quantities that `GeometricalProblem` provides for any problem variant.
249
 
250
- One small exception is now explicit: recovery from a failed VMEC evaluation gets a
251
- modest fixed bonus instead of comparing against the failure sentinel. The previous
252
- behavior could erase recovery signal by comparing a successful step against itself,
253
- while a naive sentinel comparison would explode the reward into an unbounded spike.
254
 
255
- Things the reward correctly avoids:
256
- - Per-constraint shaping (would overfit to P1's specific constraint structure)
257
- - Tolerance-exploit bonus (would overfit to the 1% evaluator quirk)
258
- - Mode-specific or parameter-specific weighting
259
- - Any knowledge of which knob controls which metric
260
 
261
- **One thing to monitor during playtesting:** the `5.0` multiplier on feasibility shaping
262
- may need tuning once the action space changes. Mode perturbations produce different
263
- feasibility deltas per step than the old 3-knob steps. But tune from playtest data,
264
- not from theory.
265
 
266
- ---
267
 
268
- ## 7. Findings from the P1 Score Chase Notes
269
 
270
- From `ai-sci-feasible-designs/docs/P1_SCORE_CHASE_NOTES.md`:
271
-
272
- ### Best submission metrics (high-fidelity)
273
-
274
- ```
275
- max_elongation = 1.266744 → score = 0.970362
276
- aspect_ratio = 3.999377 (tight, but feasible)
277
- average_triangularity = -0.495236 (normalized violation ≈ 0.009529)
278
- iota/nfp = 0.298946 (normalized violation ≈ 0.003515)
279
- feasibility = 0.009529 (feasible due to 1% tolerance)
280
  ```
281
 
282
- ### Key patterns
283
-
284
- - The winning region is "thin": improving elongation pushes triangularity and/or iota
285
- over the constraint cliff
286
- - The **1% feasibility tolerance** is a first-order effect: the best scores come from
287
- intentionally pushing constraints to the edge of tolerance to squeeze more elongation
288
- reduction
289
- - Triangularity is usually the binding constraint near the best scores; iota is second
290
- - The best submission used a full Fourier boundary (`mpol=8, ntor=8`, `n_field_periods=3`)
291
- refined through multiple optimization stages, not the low-dimensional parameterization
292
-
293
- ### What this means for our environment
294
-
295
- Our feasible designs from the 4-knob sweep have `max_elongation ≈ 7.2-8.9`
296
- (score ≈ 0.12-0.31). The winning submission has `max_elongation = 1.27` (score 0.97).
297
- The gap is large — the 4-knob family can reach feasibility but cannot reach competitive
298
- scores. This is expected: the environment is a stepping stone for learning
299
- constraint-satisfaction and basic optimization, not a path to the leaderboard.
300
-
301
- ---
302
-
303
- ## 8. Anti-Overfitting Design
304
-
305
- ### What constitutes overfitting in this context
306
 
307
- - Per-constraint reward weighting (e.g., "triangularity progress is worth 2x")
308
- - Reward bonuses for exploiting the 1% tolerance
309
- - Action-space design that hardcodes which modes matter
310
- - Single fixed starting state that agents memorize
311
 
312
- ### Agreed anti-overfitting levers
313
 
314
- 1. **Multiple frozen reset seeds first** (start with exact seeds; add bounded jitter only
315
- if memorization becomes a real problem)
316
- 2. **Held-out evaluation seeds** (test generalization, not memorization)
317
- 3. **Reward based on official scalars only** (feasibility + objective, not per-constraint)
318
- 4. **Domain knowledge in initial state, not reward** (good baseline params in `reset()`,
319
- not constraint-specific shaping in `_compute_reward`)
320
- 5. **Known winners as calibration fixtures only**, not optimization targets
321
 
322
- ---
 
 
 
323
 
324
- ## 9. Agreed Plan
325
 
326
- ### What everyone agrees on
327
 
328
- 1. **4-knob low-dimensional action space**: `aspect_ratio`, `elongation`,
329
- `rotational_transform`, `triangularity_scale`
330
- 2. **Boundary-based verifier**: `build_boundary_from_params(...)` + `evaluate_boundary(...)`
331
- 3. **Explicit VMEC crash handling**: treat solver failures as bad-but-not-fatal
332
- 4. **Reward V0 unchanged in spirit**: feasibility-first, scalar-only, no P1-specific shaping
333
- 5. **Fidelity labeling in observation**: distinguish low-fi vs high-fi metrics
334
 
335
- ### What is deferred until after build + playtest
336
 
337
- - Exact `rotational_transform` range bounds (data suggests ~[1.2, 1.9] is useful)
338
- - Exact `triangularity_scale` delta values
339
- - Seed pool construction (need empirical sweep in the repaired parameterization)
340
- - Whether budget should be 6 or 8
341
- - Any Reward V1 changes
342
 
343
- ### Implementation order
344
 
345
- 1. `server/physics.py` boundary-based verifier interface
346
- 2. `fusion_lab/models.py` — action/observation/state for 4 knobs
347
- 3. `server/environment.py` — reset with seed pool, discrete knob perturbations, VMEC crash handling
348
- 4. `server/app.py` — expose new action schema
349
- 5. `baselines/` — random + heuristic (repair feasibility first, then reduce elongation)
350
- 6. Manual playtest — verify budget is sufficient, tune ranges/deltas/seeds empirically
351
 
352
- ---
353
 
354
- ## 10. Cross-Validation Record
 
 
355
 
356
- This plan was cross-validated with an independent agent that:
357
 
358
- - Independently confirmed the 3-knob blocker
359
- - Independently confirmed the historical `tri_scale` implementation detail
360
- - Reproduced VMEC crash failures
361
- - Validated the layer decomposition (verifier / environment / parameterization)
362
 
363
- ### Pushbacks from the cross-validation agent and resolution
 
 
 
364
 
365
- | Pushback | Verdict | Resolution |
366
- |----------|---------|------------|
367
- | "10/228 feasible is unverified" | **Partially addressed.** The recorded sweep found feasible points in the repaired 4-knob family, but this exact count should be treated as an artifact-backed result, not a free-floating fact. | Keep the sweep note, and link or preserve the underlying artifact if this exact count will be cited elsewhere. |
368
- | "rt=1.8 comfortably feasible is too strong" | **Partially valid.** 44% crash rate at 1.8. | rt=1.6 is the true sweet spot: 0% crashes, 6 feasible. |
369
- | "Delta values are design proposals, not facts" | **Valid.** | Defer to post-build playtesting. |
370
- | "Seed pool not empirically validated" | **Valid.** | Methodology sound, execution pending. |
371
- | "Budget change to 8-10 is speculative" | **Valid.** | Keep 6 until playtest proves otherwise. |
372
 
373
- ---
374
 
375
- ## Appendix A: Key File Locations
 
 
 
376
 
377
- ### Fusion Design Lab (this repo)
378
-
379
- ```
380
- server/physics.py — verifier (needs boundary-based refactor)
381
- server/environment.py — environment loop + reward V0
382
- fusion_lab/models.py — action/observation/state schemas
383
- server/app.py — FastAPI endpoints
384
- baselines/ — random + heuristic agents
385
- ```
386
-
387
- ### ai-sci-feasible-designs (reference repo)
388
-
389
- ```
390
- docs/harness/raw-session.md — original winning session narrative
391
- docs/harness/P1_CAMPAIGN_POSTMORTEM.md — why campaigns found 0 feasible
392
- docs/P1_SCORE_CHASE_NOTES.md — best P1 score details + approach
393
- scripts/search_p1_lowdim.py — the 4-knob sweep script (git: 300c191)
394
- scripts/p1_alm_ngopt_multifidelity.py — ALM+NGOpt optimizer (git: aba75b7)
395
- runs/p1_campaign*/world.sqlite — campaign evaluation databases
396
- ```
397
-
398
- ## Appendix B: Boundary Construction Reference
399
-
400
- The 4-knob boundary construction, as implemented in the original
401
- `search_p1_lowdim.py`:
402
-
403
- ```python
404
- from constellaration.initial_guess import generate_rotating_ellipse
405
- from constellaration.geometry import surface_rz_fourier
406
- from constellaration.geometry.surface_rz_fourier import SurfaceRZFourier
407
- import numpy as np
408
-
409
- def build_boundary(aspect_ratio, elongation, rotational_transform, tri_scale, nfp=3):
410
- # 1. Generate base rotating-ellipse shape
411
- surface = generate_rotating_ellipse(
412
- aspect_ratio=aspect_ratio,
413
- elongation=elongation,
414
- rotational_transform=rotational_transform,
415
- n_field_periods=nfp,
416
- )
417
-
418
- # 2. Expand to higher Fourier modes
419
- surface = surface_rz_fourier.set_max_mode_numbers(
420
- surface, max_poloidal_mode=3, max_toroidal_mode=3,
421
- )
422
-
423
- # 3. Inject triangularity via the m=2, n=0 Fourier mode
424
- r_cos = np.asarray(surface.r_cos, dtype=float).copy()
425
- z_sin = np.asarray(surface.z_sin, dtype=float).copy()
426
- center = r_cos.shape[1] // 2
427
- minor = float(r_cos[1, center])
428
-
429
- r_cos[2, center] = -tri_scale * minor
430
-
431
- # 4. Clean auxiliary modes
432
- r_cos[0, :center] = 0.0
433
- z_sin[0, :center + 1] = 0.0
434
-
435
- return SurfaceRZFourier(
436
- r_cos=r_cos.tolist(),
437
- z_sin=z_sin.tolist(),
438
- n_field_periods=nfp,
439
- is_stellarator_symmetric=True,
440
- )
441
- ```
442
 
443
- Key details:
444
- - `r_cos[1, center]` is the minor radius of the base shape
445
- - `r_cos[2, center]` is the `m=2, n=0` Fourier coefficient (controls triangularity)
446
- - Setting it to `-tri_scale * minor` produces negative triangularity proportional to `tri_scale`
447
- - The auxiliary mode cleanup ensures the boundary is well-conditioned for VMEC
 
1
  # P1 Parameterization Deep-Dive
2
 
3
+ **Date:** March 7, 2026
4
+ **Role:** Evidence and rationale record
5
+ **Status:** Supporting doc, not a live planning or contract SSOT
6
 
7
+ This document keeps the durable evidence behind the repaired low-dimensional `P1` environment:
 
 
8
 
9
+ - why the historical 3-knob family failed
10
+ - what the original winning session actually did
11
+ - what the recorded 4-knob sweep proved
12
+ - why the current environment is intentionally a playable stepping stone rather than a leaderboard-matching optimizer
13
 
14
+ ## 1. Structural Blocker
15
 
16
  ### Symptom
17
 
18
+ The old 3-parameter action space:
 
 
19
 
20
+ - `aspect_ratio`
21
+ - `elongation`
22
+ - `rotational_transform`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ could not satisfy the `P1` constraints under the real `constellaration` verifier path.
25
 
26
+ ### Evidence
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ A 125-point grid sweep over the historical 3-knob range produced `0/125` feasible designs.
29
 
30
+ Observed behavior:
 
 
31
 
32
+ - `average_triangularity` stayed near `+0.005`
33
+ - `p1_feasibility` stayed near `1.00995`
34
+ - varying `n_field_periods` did not resolve the blocker
35
 
36
+ ### Root Cause
37
 
38
+ `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
39
 
40
+ The historical `rotational_transform` range was also too low to reach the `edge_iota_over_nfp >= 0.3` requirement reliably.
41
 
42
+ ## 2. Original Winning Session
 
 
 
43
 
44
+ The original successful `P1` path in `ai-sci-feasible-designs` did not rely on the raw 3-knob family alone.
 
45
 
46
+ The winning session:
 
 
47
 
48
+ 1. built a low-dimensional sweep with a fourth knob
49
+ 2. found feasible seeds quickly
50
+ 3. refined around those seeds with stronger optimizers
51
+ 4. used leaderboard-quality anchors later in the pipeline
52
 
53
+ ### Missing Fourth Knob
54
 
55
+ The historical script added `tri_scale` by injecting the `m=2, n=0` Fourier mode after generating the base rotating-ellipse shape.
 
 
 
56
 
57
+ That missing triangularity control is the key reason the raw 3-knob family was structurally blocked.
58
 
59
+ ### Recovered Useful Ranges
60
 
61
+ The original script used substantially different useful ranges than the blocked runtime:
62
 
63
+ ```text
64
+ aspect_ratio: [3.0, 3.6]
65
+ elongation: [1.4, 2.2]
66
+ rotational_transform: [1.5, 2.2]
67
+ tri_scale: [0.55, 0.8]
 
 
 
 
 
68
  ```
69
 
70
+ ## 3. Harness Campaign Comparison
71
 
72
+ Recorded `P1` campaign runs in `ai-sci-feasible-designs` also found zero feasible candidates.
 
 
73
 
74
+ That failure does not disprove the repaired low-dimensional path. It mostly shows that the campaign guidance and search style diverged from the winning approach:
 
 
 
75
 
76
+ - the campaigns pushed the agent away from broad low-dimensional exploration
77
+ - the winning session did broad sweeps and large early moves
78
+ - the campaign path used richer Fourier candidates, but not the same successful cold-start behavior
 
 
79
 
80
+ ## 4. Recorded 4-Knob Sweep
 
 
 
81
 
82
+ A recorded 4-knob sweep using explicit triangularity injection showed that the repaired family can reach `P1` feasibility.
83
 
84
+ Recorded sweep family:
85
 
86
+ ```text
87
+ aspect_ratio: [3.2, 3.8]
88
+ elongation: [1.2, 1.8]
89
+ rotational_transform: [1.2, 1.8]
90
+ tri_scale: [0.4, 0.7]
91
+ n_field_periods: 3
92
+ mpol / ntor: 3 / 3
 
 
 
93
  ```
94
 
95
+ What that sweep established:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ - explicit triangularity control fixes the structural blocker
98
+ - repaired-family feasibility is reachable in principle
99
+ - repaired-family defaults still need measured calibration before they should be narrated as stable
 
100
 
101
+ ## 5. Verifier Alignment Evidence
102
 
103
+ The current runtime verifier alignment is sound:
 
 
 
 
 
 
104
 
105
+ - the official `GeometricalProblem` API is used for feasibility and objective semantics
106
+ - score conversion matches the official `P1` objective direction
107
+ - the runtime split is boundary-based: build boundary first, then evaluate boundary
108
+ - low-fidelity `run` and high-fidelity `submit` are treated as separate truth surfaces
109
 
110
+ This matters because the repair belongs in the boundary family, not in redefined verifier semantics.
111
 
112
+ ## 6. Reward Implications
113
 
114
+ The repaired family changes what is possible, but it does not justify a complicated reward.
 
 
 
 
 
115
 
116
+ The main reward conclusions remain:
117
 
118
+ - keep reward tied to official verifier scalars
119
+ - keep feasibility-first behavior
120
+ - do not add per-constraint or knob-specific shaping
121
+ - tune from playtest evidence, not from theory alone
 
122
 
123
+ ## 7. Why The Environment Is Still Valid
124
 
125
+ The repaired 4-knob family is not a leaderboard-matching optimizer. That is acceptable for this repo.
 
 
 
 
 
126
 
127
+ The purpose of the environment is:
128
 
129
+ - teach and evaluate constrained design behavior
130
+ - keep the observation/action/reward loop legible
131
+ - preserve an explainable path from action to verifier feedback
132
 
133
+ The winning high-fidelity score chase used a much richer downstream optimization story. This repo does not need to reproduce that full pipeline to be a valid hackathon environment artifact.
134
 
135
+ ## 8. Design Implications Kept From This Analysis
 
 
 
136
 
137
+ - keep multiple frozen reset seeds rather than one memorized starting state
138
+ - keep reward based on official scalars rather than hand-coded constraint bonuses
139
+ - keep known winners as calibration fixtures, not direct reward targets
140
+ - keep domain knowledge in seeds and fixtures, not in opaque reward tricks
141
 
142
+ ## 9. Primary References
 
 
 
 
 
 
143
 
144
+ Fusion Design Lab:
145
 
146
+ - [`server/physics.py`](../server/physics.py)
147
+ - [`server/environment.py`](../server/environment.py)
148
+ - [`fusion_lab/models.py`](../fusion_lab/models.py)
149
+ - [`docs/P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md)
150
 
151
+ Reference repo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
+ - `ai-sci-feasible-designs/docs/harness/raw-session.md`
154
+ - historical `scripts/search_p1_lowdim.py`
155
+ - `ai-sci-feasible-designs/docs/P1_SCORE_CHASE_NOTES.md`
156
+ - `P1_CAMPAIGN_POSTMORTEM.md`
 
docs/PIVOT_P1_ROTATING_ELLIPSE.md DELETED
@@ -1,304 +0,0 @@
1
- # Pivot: P1 Rotating-Ellipse Environment
2
-
3
- **Date:** 2026-03-07
4
- **Status:** Supporting decision record, superseded as planning SSOT by `FUSION_DESIGN_LAB_PLAN_V2.md`
5
- **Supersedes:** Synthetic physics model in current `server/physics.py`
6
-
7
- Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
8
-
9
- ## Current Branch Status
10
-
11
- - [x] pivot accepted
12
- - [x] historical upstream 3-knob rotating-ellipse `P1` contract was implemented and evaluated
13
- - [x] `constellaration` verifier path is wired
14
- - [x] historical upstream 3-knob family is verified as blocked on P1 triangularity
15
- - [x] repaired low-dimensional family with explicit triangularity control is implemented
16
- - [ ] tracked fixtures are added
17
- - [ ] manual playtest evidence is recorded
18
- - [ ] heuristic baseline is refreshed for the real verifier path
19
-
20
- Current caution:
21
-
22
- - the upstream rotating-ellipse family remains useful as a seed generator, but the live environment action family is the repaired rotating-ellipse-derived 4-knob contract
23
-
24
- ## Decision
25
-
26
- Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
27
-
28
- This borrows the strongest low-dimensional entry point from the proven winning approach documented in `raw-session.md`, not the full approach.
29
-
30
- ## What Was Validated
31
-
32
- | Claim | Status | Source |
33
- |---|---|---|
34
- | P1 is the cleanest benchmark task | Verified | `problems.py:113` — minimize max_elongation, 3 constraints, no QI |
35
- | P1 skips QI | Verified | `problems.py:145` — `_does_it_require_qi = False` |
36
- | Low-fidelity eval is fast enough | Measured | 0.63s per eval on local machine; postmortem says ~1s/eval |
37
- | High-fidelity eval is expensive | Measured | 24s per eval; only viable for final validation |
38
- | Rotating-ellipse can find P1-feasible designs | Verified | `raw-session.md`: sweeps found 3 feasible designs in ~20 min |
39
- | vmecpp installs from wheels | Verified | `uv pip install vmecpp==0.4.7` resolves cleanly, no compilation |
40
- | constellaration Dockerfile is not bloated | Verified | `python:3.10-slim` + `pip install constellaration` |
41
- | Current seed logic is too loose for P1 | Verified | `seeds.py:42`: triangularity override 0.05 vs constraint -0.5 |
42
- | Full harness should not be ported | Verified | Postmortem: prescriptive harness produced 0 feasible candidates |
43
-
44
- ## What Is Hypothesis (Not Yet Validated)
45
-
46
- 1. **6 actions is enough** to reach or improve P1 feasibility from a rotating-ellipse starting point. Must validate by manual playtest immediately.
47
- 2. **Discretized rotating-ellipse perturbations** create non-trivial decision pressure (not too easy, not impossible).
48
- 3. **Low-fidelity metrics** are close enough to high-fidelity P1 scoring that low-fi reward signal is meaningful.
49
- 4. **The Docker image** builds and deploys on HF Spaces within reasonable time/size limits.
50
-
51
- ## Environment Design
52
-
53
- ### Single Task
54
-
55
- Improve a stellarator boundary's P1 score using a rotating-ellipse-derived low-dimensional parameterization under the official ConStellaration P1 constraints.
56
-
57
- ### P1 Constraints (from `GeometricalProblem`)
58
-
59
- - aspect_ratio <= 4.0
60
- - average_triangularity <= -0.5
61
- - edge_rotational_transform / n_field_periods >= 0.3
62
-
63
- ### P1 Objective
64
-
65
- Minimize `max_elongation`. Score = `1 - clip((max_elongation - 1) / 9, 0, 1)`.
66
-
67
- Feasibility tolerance: normalized constraint violations <= 1% (0.01).
68
-
69
- ### Parameter Space
70
-
71
- Historical upstream seed generator:
72
-
73
- | Parameter | Role | Typical range |
74
- |---|---|---|
75
- | `aspect_ratio` | Width-to-height ratio of the boundary | 2.0 - 8.0 |
76
- | `elongation` | Vertical stretching of cross-section | 1.0 - 5.0 |
77
- | `rotational_transform` | Magnetic field line winding | 0.1 - 1.0 |
78
- | `n_field_periods` | Fixed at 3 (not an action) | 3 |
79
-
80
- These map to `constellaration.initial_guess.generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` which returns a `SurfaceRZFourier` boundary in ~4ms.
81
-
82
- Historical blocker:
83
-
84
- - on the real low-fidelity verifier path, sampled 3-knob points kept `average_triangularity` at roughly `+0.004975`
85
- - sampled `p1_feasibility` stayed at roughly `1.00995`
86
- - no sampled point was feasible
87
-
88
- Current live environment family:
89
-
90
- | Parameter | Role | Current implementation range |
91
- |---|---|---|
92
- | `aspect_ratio` | Width-to-height ratio of the repaired boundary | 3.2 - 3.8 |
93
- | `elongation` | Vertical stretching of cross-section | 1.2 - 1.8 |
94
- | `rotational_transform` | Magnetic field line winding | 1.2 - 1.9 |
95
- | `triangularity_scale` | Explicit triangularity control | 0.4 - 0.7 |
96
- | `n_field_periods` | Fixed at 3 (not an action) | 3 |
97
-
98
- These ranges describe the live implementation in `server/environment.py`. They are still subject to measured sweep and playtest refinement.
99
-
100
- ### Action Space
101
-
102
- Current action space:
103
-
104
- ```
105
- intent: "run" | "submit" | "restore_best"
106
- parameter: "aspect_ratio" | "elongation" | "rotational_transform" | "triangularity_scale"
107
- direction: "increase" | "decrease"
108
- magnitude: "small" | "medium" | "large"
109
- ```
110
-
111
- Current implementation deltas:
112
-
113
- | Parameter | small | medium | large |
114
- |---|---|---|---|
115
- | aspect_ratio | 0.05 | 0.10 | 0.20 |
116
- | elongation | 0.05 | 0.10 | 0.20 |
117
- | rotational_transform | 0.05 | 0.10 | 0.20 |
118
- | triangularity_scale | 0.02 | 0.05 | 0.10 |
119
-
120
- ### Episode Flow
121
-
122
- 1. Reset: generate initial boundary from baseline rotating-ellipse parameters (+ optional seed perturbation). Run low-fi forward_model. Return initial observation.
123
- 2. Agent chooses action.
124
- 3. If `run`: modify parameter, regenerate boundary, run low-fi forward_model (~0.6s). Return diagnostics + reward.
125
- 4. If `restore_best`: revert to best-known parameters, re-evaluate low-fidelity metrics, and charge a budget step.
126
- 5. If `submit`: end episode. Optionally run high-fi for final score.
127
- 6. Episode ends on `submit` or budget exhaustion.
128
-
129
- ### Budget
130
-
131
- 6 evaluations per episode. All non-submit actions cost 1 budget.
132
-
133
- ### Observation
134
-
135
- ```
136
- diagnostics_text: str # human-readable summary
137
- max_elongation: float # P1 objective (minimize)
138
- aspect_ratio: float # constraint: <= 4.0
139
- average_triangularity: float # constraint: <= -0.5
140
- edge_iota_over_nfp: float # constraint: >= 0.3
141
- p1_score: float # current step-time score
142
- p1_feasibility: float # current step-time max normalized constraint violation
143
- constraints_satisfied: bool # feasibility <= 0.01
144
- vacuum_well: float # stability indicator
145
- evaluation_fidelity: "low" | "high"
146
- evaluation_failed: bool
147
- failure_reason: str
148
- step_number: int
149
- budget_remaining: int
150
- best_low_fidelity_score: float
151
- best_low_fidelity_feasibility: float
152
- best_high_fidelity_score: float | None
153
- best_high_fidelity_feasibility: float | None
154
- target_spec: str
155
- ```
156
-
157
- Current requirement:
158
-
159
- - the observation and diagnostics text should make the low-fi vs high-fi distinction explicit
160
- - best-state reporting should be split explicitly between low-fidelity rollout state and high-fidelity submit state
161
- - do not narrate low-fi and high-fi best-state fields as one combined metric
162
-
163
- ### Reward V0
164
-
165
- Feasibility-first, then objective improvement:
166
-
167
- ```
168
- if constraints newly satisfied:
169
- +3.0
170
- if constraints newly violated:
171
- -3.0
172
-
173
- if feasible:
174
- reward += (prev_elongation - curr_elongation) * 10.0 # improvement in objective
175
- else:
176
- reward += (prev_feasibility - curr_feasibility) * 5.0 # progress toward feasibility
177
-
178
- per-step cost: -0.1
179
-
180
- submit bonus (if feasible and improved):
181
- +5.0 * improvement_ratio + 1.0 * budget_efficiency
182
- submit penalty (if infeasible or no improvement):
183
- -1.0
184
- ```
185
-
186
- This puts feasibility first. An agent that achieves feasibility then minimizes elongation gets rewarded. An agent that never reaches feasibility gets penalized.
187
-
188
- Current execution note:
189
-
190
- - keep reward mostly scalar and verifier-driven
191
- - keep parameterization repair and reward semantics separate
192
- - do not add mode- or constraint-specific reward hacks to compensate for a blocked action family
193
-
194
- ### State
195
-
196
- ```
197
- step_count: int
198
- current_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
199
- best_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
200
- initial_low_fidelity_score: float
201
- initial_high_fidelity_score: float | None
202
- best_low_fidelity_score: float
203
- best_low_fidelity_feasibility: float
204
- best_high_fidelity_score: float | None
205
- best_high_fidelity_feasibility: float | None
206
- history: list[str]
207
- ```
208
-
209
- ## Two Designs That Were Considered
210
-
211
- | | Rotating-ellipse env | Curated-seed Fourier-repair env |
212
- |---|---|---|
213
- | Action space | 4 parameters (AR, elongation, rotational transform, triangularity scale) | N Fourier modes |
214
- | Starting point | Generated from parameters | Frozen from HF dataset |
215
- | Interpretability | High — parameters map to physical shape | Lower — mode perturbations are abstract |
216
- | Dataset dependency | None at runtime | Requires offline curation |
217
- | Search space coverage | Low-dimensional subfamily | Full boundary space |
218
- | Hackathon viability | High | Medium (needs pre-work) |
219
-
220
- **Decision:** Rotating-ellipse for the hackathon. It is self-contained, human-playable, and proven as a viable entry point for P1.
221
-
222
- **What it does NOT claim:** Full coverage of the P1 boundary design space. This is a tradeoff accepted for hackathon scope.
223
-
224
- ## Implementation Order
225
-
226
- ### Phase 1: Physics Backend (~1 hour)
227
-
228
- Status: done.
229
-
230
- Rewrite `server/physics.py` to wrap:
231
- - `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
232
- - `constellaration.forward_model.forward_model` with low-fi settings for evaluation
233
- - `constellaration.problems.GeometricalProblem` for official P1 scoring on every evaluation
234
-
235
- ### Phase 2: Environment Contract (~1 hour)
236
-
237
- Status: done.
238
-
239
- Update `server/environment.py`:
240
- - New observation schema with P1 metrics
241
- - New action schema for rotating-ellipse perturbations
242
- - Reward V0 with feasibility-first logic
243
- - Terminal conditions
244
-
245
- Update `fusion_lab/models.py` for new schemas.
246
-
247
- ### Phase 3: Manual Playtest (~30 min)
248
-
249
- Status: open.
250
-
251
- Validate hypothesis: "6 actions is enough" on the repaired low-dimensional family.
252
- - Play 5-10 episodes manually
253
- - Log: can a human reach feasibility? Improve elongation?
254
- - Tune magnitude deltas if needed
255
- - Document at least one pathology or adjustment
256
-
257
- ### Phase 4: Baselines (~30 min)
258
-
259
- Status: partial. Baselines exist, but the heuristic needs refresh on the real verifier path.
260
-
261
- - Random agent
262
- - Heuristic agent (greedy toward known-good parameter region)
263
- - Comparison table
264
-
265
- ### Phase 5: Deploy + Evidence (~2 hours)
266
-
267
- Status: open.
268
-
269
- - Update Dockerfile/deps for constellaration
270
- - `openenv validate` + `openenv push`
271
- - Colab notebook connecting to live environment
272
- - 1-minute demo video
273
-
274
- This section exists to justify the pivot with an implementation path. It should not trigger another strategy pass when the same work is already covered by the SSOT plan and checklist.
275
-
276
- ## Fallback
277
-
278
- If full high-fidelity `constellaration` deployment fails (Docker build, HF Spaces issues):
279
- - Keep the low-fidelity `constellaration` run path working
280
- - Fall back to a low-fidelity-only hosted environment and document the limitation clearly
281
- - Do not spend more than 1 hour debugging deployment before falling back
282
-
283
- ## Known-Good Fixtures
284
-
285
- Start with the frozen repaired-family reset seeds in `server/contract.py` and expand only if the implementation needs more coverage:
286
-
287
- 1. **Reset seed:** aspect_ratio=3.6, elongation=1.4, rotational_transform=1.5, triangularity_scale=0.55
288
- 2. **Reset seed:** aspect_ratio=3.4, elongation=1.4, rotational_transform=1.6, triangularity_scale=0.55
289
- 3. **Reset seed:** aspect_ratio=3.8, elongation=1.4, rotational_transform=1.5, triangularity_scale=0.55
290
- 4. **Deliberately bad reference:** keep a clearly infeasible boundary only as a negative verifier/reward sanity check
291
-
292
- These are for verifier/reward sanity, not a prerequisite seed-mining project.
293
-
294
- ## What Not To Do
295
-
296
- - Do not port the full ai-sci-feasible-designs harness or governor stack.
297
- - Do not make the task "agent writes arbitrary optimization scripts."
298
- - Do not stream the full HF dataset at runtime.
299
- - Do not mix rotating-ellipse and Fourier-repair action spaces.
300
- - Do not pretend the upstream 3-knob family is enough for P1 after the verified triangularity blocker.
301
- - Do not use high-fidelity eval for interactive steps (24s is too slow).
302
- - Do not narrate "6 actions is enough" as validated until manually playtested.
303
- - Do not claim full P1 boundary space coverage. The env uses a low-dim subfamily.
304
- - Do not reopen the task-selection debate after the pivot is already accepted unless a blocker forces it.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/archive/FUSION_DELIVERABLES_MAP.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fusion Design Lab Deliverables Map
2
+
3
+ **Role:** Compatibility doc kept for link stability
4
+ **Status:** Not an active SSOT surface
5
+
6
+ The deliverables mapping now lives in [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md):
7
+
8
+ - artifact plan
9
+ - compute surface roles
10
+ - evidence order
11
+ - success gates
12
+ - fallback rules
13
+
14
+ Use these docs instead:
15
+
16
+ - planning and execution: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
17
+ - live technical contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
18
+ - blocker and sweep evidence: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)
19
+
20
+ This file no longer carries branch status, execution order, or a separate planning queue.
docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fusion Design Lab Next 12 Hours Checklist
2
+
3
+ **Role:** Compatibility doc kept for link stability
4
+ **Status:** Not an active SSOT surface
5
+
6
+ The old dated checklist created a second execution-order source and went stale immediately.
7
+
8
+ Use these docs instead:
9
+
10
+ - planning and live execution order: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
11
+ - technical task contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
12
+ - evidence behind the repaired parameterization: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)
13
+
14
+ Current execution priority remains:
15
+
16
+ 1. measured sweep
17
+ 2. tracked fixtures
18
+ 3. manual playtest
19
+ 4. heuristic baseline refresh
20
+ 5. HF Space proof
21
+ 6. notebook, demo, and repo polish
docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pivot: P1 Rotating-Ellipse Environment
2
+
3
+ **Date:** March 7, 2026
4
+ **Role:** Short decision record
5
+ **Status:** Not an active SSOT surface
6
+
7
+ ## Decision
8
+
9
+ Pivot the environment to the official `P1` benchmark with real `constellaration` physics and a repaired low-dimensional boundary family derived from a rotating-ellipse seed.
10
+
11
+ ## Why
12
+
13
+ - the historical upstream 3-knob family could not satisfy the `P1` triangularity requirement
14
+ - the repaired 4-knob family restored a meaningful low-dimensional control surface
15
+ - the hackathon artifact needs a narrow, legible, human-playable environment rather than a transplanted optimization harness
16
+
17
+ ## What This Does Not Mean
18
+
19
+ - it does not make the low-dimensional family the full `P1` design space
20
+ - it does not justify porting the old `ai-sci-feasible-designs` harness
21
+ - it does not settle repaired-family ranges, deltas, or budget choices without measurement
22
+
23
+ ## Where The Live Truth Now Lives
24
+
25
+ - planning and execution: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
26
+ - technical contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
27
+ - blocker and sweep evidence: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)
docs/archive/README.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Archived Docs
2
+
3
+ This directory keeps legacy planning docs that are no longer active SSOT surfaces.
4
+
5
+ Current live docs:
6
+
7
+ - [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md) for planning and execution order
8
+ - [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md) for the live technical contract
9
+ - [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md) for blocker evidence and supporting rationale
10
+
11
+ Archived here:
12
+
13
+ - `FUSION_DELIVERABLES_MAP.md`
14
+ - `FUSION_NEXT_12_HOURS_CHECKLIST.md`
15
+ - `PIVOT_P1_ROTATING_ELLIPSE.md`