Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

e815b38

1 Parent(s): 6deaccc

docs: simplify and archive planning surfaces

Browse files

Files changed (13) hide show

AGENTS.md +6 -6
README.md +19 -14
TODO.md +26 -21
docs/FUSION_DELIVERABLES_MAP.md +0 -121
docs/FUSION_DESIGN_LAB_PLAN_V2.md +132 -708
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md +0 -246
docs/P1_ENV_CONTRACT_V1.md +126 -184
docs/P1_PARAMETERIZATION_DEEPDIVE.md +97 -388
docs/PIVOT_P1_ROTATING_ELLIPSE.md +0 -304
docs/archive/FUSION_DELIVERABLES_MAP.md +20 -0
docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md +21 -0
docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md +27 -0
docs/archive/README.md +15 -0

AGENTS.md CHANGED Viewed

@@ -18,15 +18,15 @@ Training is supporting evidence. Do not let the repo drift into training-first w
 ## Source of Truth
-Use these docs as the planning SSOT:
-- `docs/FUSION_DESIGN_LAB_PLAN_V2.md`
-- `docs/FUSION_DELIVERABLES_MAP.md`
-- `docs/FUSION_NEXT_12_HOURS_CHECKLIST.md`
-`docs/PIVOT_P1_ROTATING_ELLIPSE.md` is a supporting decision record, not a planning SSOT. If it disagrees with the three docs above, the three SSOT docs win.
-`docs/P1_ENV_CONTRACT_V1.md` is a supporting technical spec for the current implementation phase. It should refine the SSOT docs, not silently diverge from them.
 If code and docs disagree, either:

 ## Source of Truth
+Use these docs as the repo documentation SSOT:
+- `docs/FUSION_DESIGN_LAB_PLAN_V2.md` for planning and execution order
+- `docs/P1_ENV_CONTRACT_V1.md` for the live technical contract
+- `docs/P1_PARAMETERIZATION_DEEPDIVE.md` for blocker evidence and supporting rationale
+Legacy planning docs are archived under `docs/archive/`. They are not active SSOT surfaces.
+`docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md` is a short supporting decision record, not a planning SSOT.
 If code and docs disagree, either:

README.md CHANGED Viewed

@@ -16,7 +16,11 @@ A trained model is optional for this repo's submission story. A public Colab not
 ## Current Status
-This repository is the clean hackathon workspace. The detailed planning docs live in `docs/FUSION_DESIGN_LAB_PLAN_V2.md`, `docs/FUSION_DELIVERABLES_MAP.md`, and `docs/FUSION_NEXT_12_HOURS_CHECKLIST.md`.
 Implementation status:
@@ -25,7 +29,8 @@ Implementation status:
 - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
-- the next runtime work is measured sweep validation, fixtures, manual playtesting, heuristic refresh, and deployment evidence
 ## Execution Status
@@ -46,7 +51,7 @@ Implementation status:
 - [x] Add explicit VMEC failure semantics to the environment contract
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
-- [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
@@ -54,19 +59,20 @@ Implementation status:
 ## Known Gaps
 - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
-- The repaired family now uses frozen exact seeds with explicit triangularity control. Those seeds are near-boundary references, not yet tracked fixtures.
-- The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
 Current mode:
 - strategic task choice is already locked
-- the next work is measured sweep validation, then fixtures, manual playtesting, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
@@ -120,14 +126,13 @@ uv sync --extra notebooks
 ## Immediate Next Steps
-1. Run a small measured sweep on the repaired family to choose useful ranges, deltas, and reset seeds.
-2. Verify that observation semantics are human-readable and that low-fi `run` versus high-fi `submit` best-state reporting is not ambiguous.
-3. Add tracked `P1` fixtures under `server/data/p1`.
-4. Run manual playtest episodes and record the first real reward pathology, if any.
-5. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
-6. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
-7. Deploy the environment to HF Space.
-8. Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

 ## Current Status
+This repository is the clean hackathon workspace. The live docs now split cleanly by role:
+- planning and execution: `docs/FUSION_DESIGN_LAB_PLAN_V2.md`
+- technical contract: `docs/P1_ENV_CONTRACT_V1.md`
+- blocker and sweep evidence: `docs/P1_PARAMETERIZATION_DEEPDIVE.md`
 Implementation status:
 - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
+- the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
+- the next runtime work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, and deployment evidence
 ## Execution Status
 - [x] Add explicit VMEC failure semantics to the environment contract
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
+- [x] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
 ## Known Gaps
 - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
+- The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
+- The tracked fixtures in `server/data/p1/` are currently low-fidelity-calibrated. Do not narrate them as fully paired low-fi/high-fi references until the submit-side spot checks land.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
+- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next meaningful playtest step is a real `submit` trace, not more abstract reward debate.
 Current mode:
 - strategic task choice is already locked
+- the next work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
 ## Immediate Next Steps
+- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
+- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
+- [ ] Run submit-side manual playtest episodes and record the first real reward pathology, if any.
+- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
+- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
+- [ ] Deploy the environment to HF Space.
+- [ ] Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

TODO.md CHANGED Viewed

@@ -2,19 +2,24 @@
 This is the execution tracker for the hackathon repo.
-Use this file for day-of build progress. Use the linked docs for rationale, sequencing, and submission framing:
 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
-- [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
-- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
-- [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 - [Repo Guardrails](AGENTS.md)
 Priority source:
 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md) is the planning SSOT
-- [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md) is the execution order SSOT
 - this file should track execution progress only
 ## Current State
@@ -34,8 +39,8 @@ Priority source:
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi vs high-fi truth in the observation/task surface
 - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
-- [ ] tracked `P1` fixtures
-- [ ] manual playtest log
 - [x] settle the non-submit terminal reward policy
 - [x] baseline comparison has been re-run on the `constellaration` branch state
 - [ ] refresh the heuristic baseline for the real verifier path
@@ -64,12 +69,12 @@ flowchart TD
   freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [x] Pass the Northflank smoke test
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md),
   [training/notebooks/README.md](training/notebooks/README.md)
 - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
@@ -77,7 +82,7 @@ flowchart TD
   resolve the historical gating question about whether parameterization repair was required before more reward work
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
-  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 ## Fresh Wiring
@@ -90,7 +95,7 @@ flowchart TD
   Files:
   [server/environment.py](server/environment.py),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 - [x] Add a post-terminal guard to the environment loop
   Files:
@@ -158,7 +163,7 @@ flowchart TD
 ## Validation and Reward
-- [ ] Run a small measured sweep on the repaired low-dimensional family
   Goal:
   choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose
   Related:
@@ -170,26 +175,26 @@ flowchart TD
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
-- [ ] Add 1-2 tracked `P1` fixtures
   Files:
   [server/data/p1/README.md](server/data/p1/README.md),
-  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
   Note:
-  add fixtures only after the repaired family is calibrated into a meaningful near-boundary region
 - [ ] Run fixture sanity checks
   Goal:
-  confirm verifier outputs, objective direction, and reward ordering
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [ ] Manual-playtest 5-10 episodes
   Goal:
-  verify a human can act coherently and surface at least one pathology or ambiguity
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
 - [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
   Goal:
@@ -240,7 +245,7 @@ flowchart TD
 - [ ] Deploy the environment to HF Space
   Related:
-  [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md),
   [README.md](README.md)
 - [ ] Create the thin public Colab notebook
@@ -258,7 +263,7 @@ flowchart TD
 - [ ] Only add training evidence if it is actually persuasive
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
-  [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 ## Guardrails

 This is the execution tracker for the hackathon repo.
+Use this file for day-of build progress. Use the linked docs for rationale, contract truth, and submission framing:
 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [P1 Parameterization Deep-Dive](docs/P1_PARAMETERIZATION_DEEPDIVE.md)
 - [Repo Guardrails](AGENTS.md)
+Archived legacy references:
+- [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
+- [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
+- [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 Priority source:
 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md) is the planning SSOT
+- [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md) is the technical contract SSOT
+- [P1 Parameterization Deep-Dive](docs/P1_PARAMETERIZATION_DEEPDIVE.md) is the evidence and rationale record
 - this file should track execution progress only
 ## Current State
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi vs high-fi truth in the observation/task surface
 - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
+- [x] tracked `P1` fixtures
+- [x] manual playtest log
 - [x] settle the non-submit terminal reward policy
 - [x] baseline comparison has been re-run on the `constellaration` branch state
 - [ ] refresh the heuristic baseline for the real verifier path
   freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [x] Pass the Northflank smoke test
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md),
   [training/notebooks/README.md](training/notebooks/README.md)
 - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
   resolve the historical gating question about whether parameterization repair was required before more reward work
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
+  [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
 ## Fresh Wiring
   Files:
   [server/environment.py](server/environment.py),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
 - [x] Add a post-terminal guard to the environment loop
   Files:
 ## Validation and Reward
+- [x] Run a small measured sweep on the repaired low-dimensional family
   Goal:
   choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose
   Related:
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [x] Add 1-2 tracked `P1` fixtures
   Files:
   [server/data/p1/README.md](server/data/p1/README.md),
+  [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
   Note:
+  first tracked fixtures are low-fidelity-calibrated; add paired high-fidelity submit checks next
 - [ ] Run fixture sanity checks
   Goal:
+  confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [ ] Manual-playtest 5-10 episodes
   Goal:
+  expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
 - [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
   Goal:
 - [ ] Deploy the environment to HF Space
   Related:
+  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md),
   [README.md](README.md)
 - [ ] Create the thin public Colab notebook
 - [ ] Only add training evidence if it is actually persuasive
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
+  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 ## Guardrails

docs/FUSION_DELIVERABLES_MAP.md DELETED Viewed

@@ -1,121 +0,0 @@
-# Fusion Design Lab Deliverables Map
-This is the output-first map for the hackathon. It is aligned to Plan V2: `P1` is locked, the environment is built fresh in this repo, the old harness is not ported, and training claims stay conservative. Everything branches from the four final artifacts the judges and submission flow will actually see.
-Northflank is the recommended compute workspace behind those artifacts. HF Space and Colab remain the actual submission surfaces.
-Use this map to sequence execution, not to reopen already-locked task choices.
-## Current Branch Status
-- [x] `P1` contract is frozen in code
-- [x] official `constellaration` verifier loop is wired
-- [x] baseline comparison has been rerun on the real verifier path
-- [x] Northflank smoke workflow and note are committed
-- [x] Northflank smoke test has passed on the team H100
-- [x] historical upstream 3-knob family has been verified as blocked on P1 triangularity
-- [x] repaired low-dimensional boundary builder is implemented
-- [x] explicit VMEC failure semantics are implemented
-- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
-- [x] terminal submit scoring/reporting is fidelity-consistent
-- [ ] tracked fixtures are checked in
-- [ ] manual playtest evidence exists
-- [ ] heuristic baseline has been refreshed for the real verifier path
-- [ ] HF Space deployment is live
-## Deliverables Tree
-```mermaid
-flowchart TD
-    A["Fusion Design Lab Submission"] --> B["HF Space Environment"]
-    A --> C["Colab Eval / Training Notebook"]
-    A --> D["1-Minute Demo"]
-    A --> E["Public Repo + README"]
-    A --> N["Northflank H100 Workspace"]
-    B --> B0["P1 environment contract frozen"]
-    B --> B1["Remote reset/step works"]
-    B --> B2["Reward V0 -> V1 documented"]
-    B --> B3["One stable task runs end-to-end"]
-    B --> B4["Clear rules + reproducible episodes"]
-    C --> C1["Connects to HF Space"]
-    C --> C2["Runs multi-turn episodes"]
-    C --> C3["Logs behavior + reward traces"]
-    D --> D1["Clear problem statement"]
-    D --> D2["Manual playtest + agent trajectory"]
-    D --> D3["Reward shaping story"]
-    E --> E1["Readable project summary"]
-    E --> E2["Setup + run instructions"]
-    E --> E3["Submission links and artifacts"]
-    N --> N1["Jupyter Notebook with PyTorch live"]
-    N --> N2["Persistent storage attached"]
-    N --> N3["Verifier + baseline runs happen here"]
-    N --> N4["Northflank smoke test passes"]
-    B0 --> F["Observation + action schema frozen"]
-    B3 --> G["Fresh P1 verifier loop proven"]
-    G --> G1["Parameterization can actually reach P1 feasibility"]
-    G --> G2["VMEC failures are explicit and penalized"]
-    B2 --> H["Exploit observed -> penalty added"]
-    B4 --> I0["Deterministic action schema"]
-    D2 --> I["Human can act coherently in env"]
-    C3 --> J["Random baseline"]
-    C3 --> K["Heuristic baseline"]
-    G --> L["Official constellaration P1 verifier wired correctly"]
-    L --> M["Good / boundary / bad fixture checks pass"]
-    N4 --> N3
-    N3 --> G
-```
-## Reverse Timeline
-```mermaid
-flowchart LR
-    S["Submit by Sun 1:00 PM"] --> V["Video finalized"]
-    S --> R["Repo public and readable"]
-    S --> T["Training / eval evidence exported"]
-    S --> H["HF Space live"]
-    S --> N1["Northflank compute ready"]
-    V --> V1["Recorded clean demo trajectory"]
-    V --> V2["Scripted 60-second story"]
-    T --> T1["Behavior trace image"]
-    T --> T2["Baseline comparison numbers"]
-    T --> T3["Colab notebook runs end-to-end"]
-    H --> H1["OpenEnv P1 environment packaged"]
-    H --> H2["Remote client can reset and step"]
-    H --> H3["Verifier and reward stable"]
-    H --> H4["Rules are clear and reproducible"]
-    H4 --> P["Environment contract locked first"]
-    N1 --> N2["Jupyter with PyTorch up first"]
-    N2 --> N3["Persistent storage attached"]
-    N3 --> N4["Import + low-fi verifier smoke passes"]
-    N4 --> M0
-    P --> Q["Manual playtest completed first"]
-    H3 --> M0["Local verifier loop proven first"]
-    T2 --> B["Random + heuristic baselines done"]
-    T3 --> X["Training included only if persuasive"]
-    V1 --> Y["One stable task only"]
-    V2 --> Z["Explain reward fix, not just reward gain"]
-    M0 --> N["Fresh wiring, not legacy harness port"]
-```
-## Priority Order
-Northflank compute bring-up and smoke validation are complete.
-1. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
-2. Add tracked fixtures and run fixture sanity checks.
-3. Manual-playtest the environment and record the first real pathology, if any.
-4. Refresh the heuristic baseline from that evidence.
-5. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
-6. Use the notebook to show traces and comparisons; include training only if it adds signal.
-7. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
-8. Polish the repo only after the artifacts are real.

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -2,778 +2,202 @@
 **Hackathon:** OpenEnv Hackathon, March 7-8, 2026
 **Track:** Statement 3.1 (World Modeling — Professional Tasks)
-**Status:** Judge-aligned plan with `P1` locked
-## 0. Current Branch Status
-- [x] `P1` task family is locked
-- [x] repaired 4-knob low-dimensional `P1` contract is implemented in code
-- [x] real `constellaration` verifier wiring is in place
-- [x] low-fidelity `run` plus high-fidelity `submit` split is documented
-- [x] post-terminal `step()` guard is in place
-- [x] baseline comparison has been rerun on the real verifier path
-- [x] Northflank smoke workflow and note are committed
-- [x] Northflank smoke test has passed on the team H100
-- [x] historical upstream 3-knob family has been checked against the real low-fidelity verifier
-- [x] parameterization repair is implemented so triangularity is controllable
-- [x] explicit VMEC failure semantics are implemented
-- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
-- [x] terminal scoring/reporting is fidelity-consistent between low-fi rollout state and high-fi submit truth
-- [ ] tracked `P1` fixtures are added
-- [ ] manual playtest evidence is recorded
-- [ ] heuristic baseline is refreshed for the real verifier path
-- [ ] HF Space deployment evidence is recorded
-Current caution:
-- the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
-- terminal scoring/reporting now uses a fidelity-consistent basis at episode end: high-fi `submit` comparisons are no longer anchored to low-fi rollout score state
 ## 1. Submission Thesis
-We are not primarily submitting "a trained model for fusion."
-We are submitting a clear, reproducible training environment for a constrained scientific design task:
 - official `P1` benchmark semantics
-- a narrow, human-playable action space
 - real verifier feedback from `constellaration`
-- explicit constraints
-- a reward function that is understandable and iteratively improved
 Training is supporting evidence. The environment is the product.
-A trained model is optional. The Colab notebook is still a required public artifact, and it can remain evaluation-first if training evidence is weak.
-## 2. Locked Decisions
-These decisions are now fixed unless a hard blocker appears:
-- benchmark task: `P1`
-- submission framing: `Statement 3.1`
-- verifier of record: `constellaration.problems.GeometricalProblem`
-- implementation strategy: fresh wiring in this repo
-- reuse policy: do not port the old `ai-sci-feasible-designs` harness; only reuse selected JSON artifacts or boundaries when useful
-Execution rule after lock:
-- do not reopen these decisions in new planning passes unless a real blocker appears
-- once a decision is locked, translate it into code, fixtures, baselines, or deployment work
-## 3. What Changed From V1
-This version changes the center of gravity:
-- `environment quality > training effort`
-- `reward shaping story > polished final reward formula`
-- `manual playtesting > training-first iteration`
-- `clarity and reproducibility > broad unsupported transfer claims`
-- `fresh, minimal environment wiring > transplanting legacy orchestration`
-This version also separates:
-- what is already decided
-- what is a working hypothesis
-- what must be validated before it becomes part of the final pitch
-## 4. Judge-Aligned Priorities
-The judging signal now implies four priorities:
-1. The environment itself must be strong.
-2. The reward function must be explainable and visibly iterated.
-3. A human should be able to act in the environment coherently before we invest heavily in training.
-4. The final story should emphasize a clear, reproducible environment, not just a reward curve.
-## 5. Final Artifacts
-The four visible artifacts remain:
-1. HF Space environment
-2. Required Colab notebook for evaluation or training
-3. 1-minute demo video
-4. Public repo and README
-The primary compute workspace should be Northflank:
-- Northflank Jupyter Notebook with PyTorch on the team H100 for development, verifier integration, baselines, and training/debugging
-- HF Space as the hosted environment surface
-- Colab as the minimal required public notebook artifact, even if it ships as an evaluation-first notebook instead of a training-first notebook
-But the evidence order is:
-1. environment contract
-2. manual playtest log
-3. reward iteration note
-4. stable local and remote episodes
-5. random and heuristic baselines
-6. training or eval notebook evidence
-7. demo and repo polish
-## 6. Non-Negotiables
-- One stable task only.
-- No broad cross-science claims unless evidence exists.
-- No training-first drift.
-- No dependence on reward curves alone.
-- No repo/video polish before environment and baselines are real.
-- No harness transplant from `ai-sci-feasible-designs`.
-- No new strategy churn after `P1` + rotating-ellipse is locked unless a blocker forces it.
-## 7. Single Stable Task
-We intentionally narrow the scope to one environment family:
-- `P1` geometrical benchmark
-- repaired low-dimensional boundary family derived from rotating-ellipse seeds
-- official `constellaration` verifier
-- low-fidelity evaluation for ordinary interaction
-- optional high-fidelity verification for final checks or `submit`
-The task is:
-> improve a stellarator boundary on the `P1` benchmark under explicit constraints and limited evaluation budget
-### Constraints
-Use the official `P1` constraints:
-- aspect ratio `<= 4.0`
-- average triangularity `<= -0.5`
-- edge rotational transform over field periods `>= 0.3`
-### Objective
-Use the official `P1` objective:
-- minimize `max_elongation`
-### Why This Task
-- it is official rather than invented
-- it is cheaper than `P2` and `P3` because `P1` skips QI
-- it maps cleanly to a tool-using scientific workflow
-- it is easier to explain than a broader fusion-design claim
-## 8. Fresh Wiring Rule
-This repo should implement a minimal environment directly for the hackathon.
-That means:
-- define our own environment contract
-- define our own reward logic on top of the official verifier
-- define our own baselines
-- define our own HF Space interface
-That does not mean:
-- importing the old governor
-- importing the old planner
-- importing the old experiment harness
-- recreating the old agent-as-coder stack
-Allowed reuse:
-- official `constellaration` library behavior
-- selected JSON artifacts or seed boundaries
-- problem notes as human reference
-Implementation handoff:
-- the remaining work is now fixture coverage, manual playtesting, heuristic refresh, smoke validation, and deployment
-- do not treat supporting decision notes as a new planning backlog
-## 8.1 Compute Surfaces
-Use each surface for one clear purpose:
-- Northflank Jupyter Notebook with PyTorch:
-  - main development and compute workspace
-  - verifier sanity checks
-  - manual playtesting
-  - baseline runs
-  - optional RL fine-tuning
-- HF Space:
-  - public OpenEnv environment surface
-  - remote `reset` and `step` endpoint for the final demo path
-- Colab:
-  - minimal reproducible evaluation or training notebook required by the hackathon
-  - the notebook itself is mandatory; a trained model inside it is not
-Northflank-specific constraint:
-- containers are ephemeral, so persistent storage must be attached before relying on saved models, caches, or fixture downloads
-Deployment path:
-- develop and verify in Northflank or local
-- commit and push changes to the public GitHub repo
-- have HF Space build and serve from that repo path
-- do not rely on manual copy-paste deployment as the default path
-Auth stance:
-- prefer a public HF Space for the hackathon to keep the Colab artifact simple
-- if the Space must be private, the notebook must explicitly document token-based access
-## 9. Environment Contract
-The environment contract must be frozen before meaningful evaluation.
-Historical blocker that drove the repair:
-- the upstream 3-knob `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` family does not expose triangularity control
-- on the real low-fidelity verifier path, sampled points stayed at roughly `average_triangularity=+0.004975` and `p1_feasibility=1.00995`
-- that blocker is why the repo now uses a repaired 4-knob low-dimensional family with explicit `triangularity_scale`
-### Observation
-The observation should expose:
-- current `max_elongation`
-- current aspect ratio
-- current average triangularity
-- current edge rotational transform over field periods
-- current `p1_score`
-- current `p1_feasibility`
-- current `constraints_satisfied`
-- current `vacuum_well`
-- `evaluation_fidelity`
-- `evaluation_failed`
-- `failure_reason`
-- `step_number`
-- `budget_remaining`
-- `best_low_fidelity_score`
-- `best_low_fidelity_feasibility`
-- `best_high_fidelity_score`
-- `best_high_fidelity_feasibility`
-- `target_spec`
-- concise textual summary of the last action outcome in `diagnostics_text`
-The observation must be interpretable by a human without additional hidden state.
-Current runtime note:
-- the live observation surface now exposes explicit low-fidelity and high-fidelity best-state fields
-- low-fi run steps and high-fi submit steps no longer overload one generic `best_score` field
-- traces and baselines should use the explicit fields instead of reconstructing a mixed best-state story
-### Action Space
-The live action space stays intentionally small and discrete while exposing the repaired 4-knob low-dimensional family.
-Current contract:
-- `run`
-- `submit`
-- `restore_best`
-For `run`, the controllable fields are:
-- parameter: one of
-  - `aspect_ratio`
-  - `elongation`
-  - `rotational_transform`
-  - `triangularity_scale`
-- direction: increase or decrease
-- magnitude: small, medium, large
-This is not trying to expose the full Fourier-boundary space. The goal is a legible environment, not maximal realism. The verifier stays official; the custom logic belongs in the low-dimensional boundary builder, not in reward semantics.
-### Episode Flow
-1. Reset from one rotating-ellipse initial state or a small frozen set of initial states.
-2. Agent chooses one action.
-3. Low-fidelity verifier runs for normal interaction.
-4. Environment returns diagnostics and reward.
-5. Episode ends on:
-   - `submit`
-   - exhausted budget
-Failure semantics must also be explicit:
-- if VMEC or the forward model fails, the run still consumes budget
-- the observation must expose that the step failed
-- the reward must apply a documented penalty
-- the environment must not silently replace the failed result with a fake success path
-### Terminal Contract
-The episode should end cleanly and deterministically.
-At termination, the environment should provide:
-- final best design metrics
-- whether constraints were satisfied
-- total reward
-- short human-readable summary of the trajectory
-## 10. Verifier Contract
-The verifier of record is `constellaration.problems.GeometricalProblem`.
-The environment must preserve:
-- objective direction
-- constraint direction
-- feasibility semantics
-- score ordering
-The environment may add reward shaping, but it must not redefine what `P1` means.
-Implementation split:
-- boundary builder or parameterization adapter:
-  - custom low-dimensional family construction
-  - rotating-ellipse seed creation
-  - triangularity control injection, if used
-- official verifier:
-  - boundary in
-  - `GeometricalProblem` semantics out
-The verifier should be boundary-based. Parameterization-specific logic should not be treated as verifier truth.
-Current execution rule:
-- do not narrate guessed repaired-family ranges, deltas, or a larger budget as settled defaults until they are measured on the repaired family
-## 11. Reward V0
-The reward in this document is not the final reward. It is `Reward V0`.
-The initial scoring idea should be feasibility-first:
-- reducing normalized constraint violation should help
-- becoming feasible should give a meaningful bonus
-- once feasible, lower `max_elongation` should help
-- wasting budget should have some cost
-- successful submission may deserve a small bonus
-### Reward V0 Design Goals
-- easy to explain
-- sensitive to genuine progress
-- hostile to obvious degenerate behavior
-- simple enough to debug from trajectories
-- aligned with official `P1` semantics
-Current execution note:
-- do not tune reward further until the repaired low-dimensional family can actually approach P1 feasibility
-- once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
-- clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
-- do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
-- keep terminal reward and reporting fidelity-consistent; do not compare high-fi submit scores against low-fi best/initial score state
-### Reward V0 Failure Modes To Test
-We should expect at least some of these:
-- the agent oscillates between equivalent moves
-- the agent submits too early
-- the agent never submits
-- the agent learns to improve objective before it learns feasibility
-- the agent camps near one constraint while breaking another
-- the agent overuses `restore_best`
-The reward is only acceptable after we test for those behaviors.
-Important execution rule:
-- if manual playtesting does not reveal a real pathology, keep `Reward V0` and document that outcome rather than forcing a `Reward V1`
-## 12. Verifier and Reward Fixture Checks
-Before training, we should validate environment wiring with a few fixed fixtures.
-Use:
-- one known-good design or near-winning design
-- a few near-boundary designs
-- a few clearly infeasible designs
-Do not assume the default baseline params are enough for this set. They are currently useful as an infeasible reference, not as a near-feasible anchor.
-Purpose:
-- verify the verifier is wired correctly
-- verify the reward ordering makes sense
-- verify feasible designs outrank clearly infeasible ones
-This is calibration, not training.
-## 13. What Is Hypothesis vs Validated
-These are still hypotheses until manually or empirically checked:
-- six steps are enough to create non-trivial decision pressure
-- the repaired low-dimensional action family is expressive enough for a meaningful `P1` task
-- `restore_best` is useful without becoming an exploit
-- heuristic should beat random on mean episode reward
-- low-fidelity interaction is predictive enough for useful policy learning
-- useful repaired-family parameter ranges and deltas
-- whether the current budget should stay at `6` or change after playtesting
-These should not be narrated as facts in the final demo until validated.
-## 14. Manual Playtest Plan
-Before heavy training, we should act as the agent ourselves.
-### Protocol
-Run 5 to 10 episodes manually and log for each step:
-- observation seen
-- action chosen
-- reason for the action
-- verifier outcome
-- reward returned
-- whether the reward matched intuitive quality
-### Questions The Playtest Must Answer
-- can a human understand what to do from the observation?
-- do action labels map to meaningful decisions?
-- is the step budget interesting or arbitrary?
-- which actions are high leverage?
-- do obvious bad actions get punished?
-- do obviously good actions get rewarded?
-- does `restore_best` help recovery or encourage stalling?
-### Expected Output
-- short manual playtest log
-- one paragraph on what a good episode looks like
-- one paragraph on what broke or felt ambiguous
-## 15. Reward Iteration Story
-The reward iteration story is not a side note. It is likely part of the pitch.
-We should aim to document at least one concrete sequence:
-1. initial reward version
-2. observed bad behavior
-3. reward or penalty change
-4. changed behavior afterward
-Examples of acceptable story structure:
-- "The agent improved elongation while staying deeply infeasible, so we increased feasibility-first shaping."
-- "The agent hovered near one constraint and ignored another, so we changed the violation shaping."
-- "The agent overused restore-best, so we changed the reward or step logic to make stalling unprofitable."
-This is stronger than saying only "reward improved after training."
-## 16. Evidence Plan
-### HF Space
-Must prove:
-- remote `reset` works
-- remote `step` works
-- one stable episode runs end-to-end
-- the remote behavior matches the local contract
-HF Space is the serving surface, not the main heavy-compute workspace.
-### Northflank Notebook
-Must prove:
-- Jupyter Notebook with PyTorch is live on the team H100
-- persistent storage is attached
-- verifier and baseline work runs there without local-machine dependency
-- environment/debug/training work can proceed there even if local runtime is inconvenient
-- one smoke check passes:
-  - import `constellaration`
-  - generate one rotating-ellipse boundary
-  - run one low-fidelity verifier call
-  - write a result artifact to persistent storage
-### Colab Notebook
-Primary job:
-- connect to the live environment
-- run multi-turn episodes
-- export traces and baseline comparisons
-Secondary job:
-- show training or policy improvement if the signal is credible
-If training is weak but the environment and eval traces are strong, the notebook still ships.
-Colab is a required artifact, but it is not the preferred main compute surface.
-Connectivity rule:
-- if HF Space is public, the notebook uses direct HTTP calls with no extra auth flow
-- if HF Space is private, the notebook must state the required token path and setup explicitly
-### Demo Video
-The video should show:
-1. the `P1` task
-2. the environment observation and action space
-3. one manual or agent trajectory
-4. one reward pathology and fix
-5. one baseline comparison
-Reward curves are optional supporting visuals, not the center of the story.
-### Public Repo
-The repo should make the environment easy to understand:
-- what `P1` is
-- what the agent sees
-- what the agent can do
-- how reward works
-- how to run one episode
-- where the demo evidence lives
-- why the repo is freshly wired rather than copied from the old project
-## 17. Success Gates
-### Prerequisite: Northflank Compute Ready
-- notebook starts on the team H100
-- persistent storage mount is usable
-- smoke test artifact is written successfully from the rotating-ellipse-derived low-dimensional boundary path
-- latest artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
-### Gate 1: Environment Contract Locked
-- task frozen
-- observation schema frozen
-- action schema frozen
-- terminal conditions frozen
-- explicit VMEC failure semantics defined
-- low-fi vs high-fi metric labeling defined
-### Gate 2: Verifier Wiring Pass
-- official `P1` verifier returns expected outputs
-- fixture ordering is sensible
-- objective direction is correct
-### Gate 3: Manual Playtest Pass
-- human can act coherently
-- at least one trajectory feels sensible
-- at least one pathology identified or ruled out
-### Gate 4: Stable Local Episode
-- local modify -> verify -> observe loop works
-- at least one end-to-end episode is stable
-- submit-time reward/reporting does not mix low-fi and high-fi score state
-### Gate 5: Reward V1
-- at least one reward revision completed
-- story is documented with before/after behavior
-### Gate 6: Baselines
-- random baseline complete
-- heuristic baseline complete
-- heuristic is at least competitive and preferably better than random
-### Gate 7: Remote Environment
-- HF Space live
-- remote client runs one clean episode
-### Gate 8: Notebook Evidence
-- notebook runs end-to-end
-- traces exported
-- training evidence included only if it adds signal
-## 18. Timeline
-### Phase 0
-Run two parallel tracks:
-- Track A: Northflank compute setup and smoke validation
-- Track B: lock the `P1` environment contract
-Deliverables:
-- frozen task definition
-- frozen action and observation schema
-- proof that one local `P1` loop works
-- Northflank smoke test pass
-### Phase 1
-Repair the low-dimensional parameterization, wire the verifier split cleanly, and run a small measured sweep before fixture checks.
-Deliverables:
-- repaired low-dimensional boundary builder
-- boundary-based verifier split
-- explicit VMEC failure semantics
-- measured parameter ranges, deltas, and candidate reset seeds
-### Phase 2
-Audit observation clarity, then freeze initial fixtures and manual-playtest the environment.
-Deliverables:
-- observation semantics note covering low-fi vs high-fi reporting and best-state fields
-- one good or near-boundary fixture
-- bad fixtures
-- 5 to 10 episode logs
-- notes on leverage, ambiguity, and pathologies
-### Phase 3
-Implement or refine Reward V0 into Reward V1 based on real behavior.
-Deliverables:
-- documented exploit
-- documented fix
-- updated reward logic
-### Phase 4
-Stabilize one local task and run baselines.
-Deliverables:
-- stable local trajectory
-- random baseline
-- heuristic baseline
-### Phase 5
-Deploy HF Space and validate remote parity.
-Deliverables:
-- live environment
-- one stable remote episode
-### Phase 6
-Produce notebook evidence.
-Deliverables:
-- Colab notebook
-- Northflank traces or run exports
-- traces
-- baseline comparison
-- training outputs only if persuasive
-### Phase 7
-Record the demo and make the repo readable.
-Deliverables:
-- 1-minute video
-- public README
-- linked artifacts
-## 19. Fallback Rules
-If something goes wrong, the fallback should preserve the environment story.
-### If training signal is weak
-Do not force a training-centric pitch.
-Ship:
-- strong environment
-- verifier and fixture evidence
-- manual playtest evidence
-- reward iteration story
-- baseline traces
-- one stable remote demo
-### If Northflank is delayed or unavailable
-Do not block environment design on it.
-Fallback:
-- continue contract definition, reward design, and basic wiring locally
-- use local CPU or Colab for limited verifier/debug work
-- keep Northflank as the preferred compute target, but do not stall the whole plan waiting for it
-### If reward is unstable
-Reduce ambition:
-- keep only the terms we can explain
-- remove fragile shaping
-- prefer legible trajectories over complex reward composition
-### If the task is too hard
-Do not broaden scope.
-Instead:
-- simplify the initial states
-- tighten the action set
-- reduce magnitude choices
-- keep the environment more learnable before changing the budget
-### If the task is too easy
-Do not add more domains.
-Instead:
-- first verify that parameterization repair and reset seeds did not make the task trivial
-- adjust budget
-- adjust magnitudes
-- adjust reward to discourage trivial submission
-## 20. Demo Story
-The recommended demo structure is:
-### Part 1: Problem
-"The agent interacts with the official `P1` stellarator-design benchmark and must improve a design under strict geometric constraints."
-### Part 2: Environment
-"Here is what the agent sees, what it can change, and what counts as success."
-### Part 3: Reward Iteration
-"Our first reward version produced a bad behavior. We changed the penalty or incentive, and the behavior improved."
-### Part 4: Evidence
-"Here is one stable trajectory, plus random and heuristic baselines."
-### Part 5: Why It Matters
-"This is a clear, reproducible scientific workflow environment built around a real verifier, not a shortcut task."
-That last line is intentionally conservative. It is strong enough without claiming universal scientific transfer.
-## 21. Immediate Next Actions
-1. Run a small measured sweep before locking ranges, deltas, or budget changes.
-2. Freeze fixtures and run manual playtests before heavy training work.
-3. Mark the current reward as `V0`.
-4. Log the first real pathology and reward revision.
-5. Do not let notebook or video work outrun the environment evidence.

 **Hackathon:** OpenEnv Hackathon, March 7-8, 2026
 **Track:** Statement 3.1 (World Modeling — Professional Tasks)
+**Role:** Planning and execution SSOT for this repo
+**Updated:** March 8, 2026
 ## 1. Submission Thesis
+Fusion Design Lab is not primarily a "trained model for fusion" submission.
+It is a clear, reproducible environment for one constrained scientific design task:
 - official `P1` benchmark semantics
+- narrow, human-playable action space
 - real verifier feedback from `constellaration`
+- explicit constraints and failure semantics
+- reward logic that can be explained and iterated
 Training is supporting evidence. The environment is the product.
+## 2. Current State
+Completed:
+- `P1` is locked as the single benchmark task
+- the repaired 4-knob low-dimensional runtime is live in code
+- the official `constellaration` verifier path is wired
+- low-fidelity `run` and high-fidelity `submit` are separated clearly
+- terminal scoring and reporting are fidelity-consistent
+- explicit VMEC failure semantics are implemented
+- the Northflank smoke workflow is committed
+- the Northflank smoke test passed on the team H100
+- baseline comparison has been rerun on the real verifier path
+- a coarse measured sweep note now exists
+- the first tracked low-fidelity fixtures now exist
+- an initial low-fidelity manual playtest note now exists
+Still open:
+- paired high-fidelity checks for the tracked fixtures
+- submit-side manual playtest evidence
+- heuristic baseline refresh on the repaired real-verifier path
+- HF Space deployment evidence
+- Colab artifact wiring
+- demo and README polish after the artifacts are real
+Current caution:
+- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
+- do not narrate low-fidelity rollout metrics as final submission truth
+## 3. Locked Decisions
+These decisions are fixed unless a hard blocker appears:
+- benchmark task: `P1`
+- submission framing: `Statement 3.1`
+- verifier of record: `constellaration.problems.GeometricalProblem`
+- repo strategy: fresh wiring in this repo
+- reuse policy: do not port the old `ai-sci-feasible-designs` harness
+- scope rule: one stable task only
+Execution rule:
+- do not reopen strategy unless a real blocker appears
+- convert decisions into code, fixtures, traces, baselines, or deployment work
+## 4. Non-Negotiables
+- Keep scope to one stable task.
+- Keep claims conservative and evidence-backed.
+- Do not let training-first work outrun environment stability.
+- Do not rely on reward curves alone; keep trajectory evidence.
+- Do not use reward complexity to hide a blocked action family.
+- Do not polish repo or video before the environment and baselines are real.
+## 5. Document Roles
+Use the docs like this:
+- this file defines planning order, status, gates, and fallback rules
+- [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
+- [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
+- archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces
+## 6. Artifact Plan
+Visible artifacts:
+- [ ] HF Space environment
+- [ ] Required Colab notebook
+- [ ] 1-minute demo video
+- [x] Public repo and README
+Compute surfaces:
+- Northflank is the main compute workspace for verifier-heavy work
+- HF Space is the hosted environment surface
+- Colab is the required public artifact and can stay evaluation-first if training evidence is weak
+Evidence order:
+- [x] measured sweep note
+- [ ] fixture checks
+- [x] manual playtest log
+- [ ] reward iteration note
+- [ ] stable local and remote episodes
+- [x] random and heuristic baselines
+- [ ] notebook evidence
+- [ ] demo and repo polish
+## 7. Environment Summary
+The environment contract must stay narrow and legible:
+- one repaired low-dimensional boundary family derived from a rotating-ellipse seed
+- discrete `run | submit | restore_best` interaction
+- low-fidelity verifier for normal steps
+- high-fidelity verifier for `submit`
+- readable observation surface with explicit fidelity labeling
+- `Reward V0` kept simple and feasibility-first until playtesting proves a real pathology
+The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
+## 8. Execution Order
+- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
+- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
+- [ ] Manual-playtest 5 to 10 episodes, including real submit traces, and record the first real confusion point, exploit, or reward pathology.
+- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
+- [ ] Refresh the heuristic baseline using the repaired-family evidence.
+- [ ] Prove a stable local episode path.
+- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
+- [ ] Wire the Colab artifact to the live environment.
+- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
+- [ ] Polish the public repo only after the artifacts above exist.
+## 9. Success Gates
+Gate 1: measured sweep exists
+- repaired-family ranges, deltas, and reset seeds are justified by recorded evidence
+Gate 2: fixture checks pass
+- good, boundary, and bad references behave as expected
+Gate 3: manual playtest passes
+- a human can read the observation
+- a human can choose a plausible next action
+- a human can explain the reward change
+Gate 4: local episode is stable
+- one clean trajectory is reproducible enough for demo use
+Gate 5: baseline story is credible
+- heuristic behavior is at least interpretable and preferable to random on the repaired task
+Gate 6: remote surface is real
+- HF Space preserves the same task contract as local
+Gate 7: submission artifacts exist
+- Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
+## 10. Fallback Rules
+If training evidence is weak:
+- keep the notebook evaluation-first
+- ship the environment, playtest, and baseline story anyway
+If HF Space deployment is delayed:
+- keep local and Northflank evidence first
+- document the deployment blocker plainly
+- do not invent remote claims without a real run
+If reward behavior is confusing:
+- fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity
+If the repaired family is too hard:
+- adjust ranges, deltas, or seeds from measured evidence
+- do not expand into a broad Fourier action space just to rescue the hackathon scope
+If the repaired family is too easy:
+- prefer fixture and seed adjustments before broadening the action schema
+## 11. Immediate Next Actions
+- [x] Record the measured sweep and choose provisional defaults from evidence.
+- [x] Check in tracked fixtures.
+- [x] Record the first manual playtest log.
+- [ ] Refresh the heuristic baseline from that playtest evidence.
+- [ ] Verify one clean HF Space episode with the same contract.

docs/FUSION_NEXT_12_HOURS_CHECKLIST.md DELETED Viewed

@@ -1,246 +0,0 @@
-# Fusion Design Lab: Next 12 Hours Checklist
-This checklist turns the updated deliverables map and Plan V2 into concrete execution order. The goal is to produce real evidence for the four submission artifacts, with `P1`, fresh wiring, and environment clarity driving the sequence.
-## Core Rule
-Do not expand scope beyond one stable task. Training is supporting evidence, not the main story.
-## Current Branch Status
-- [x] `P1` task is locked
-- [x] repaired 4-knob low-dimensional `P1` contract is implemented in the working tree
-- [x] baselines and API surface have been moved to the `P1` contract
-- [x] add a post-terminal guard in `step()`
-- [x] replace the synthetic evaluator with `constellaration`
-- [x] re-run baselines on the real verifier path
-- [x] commit the Northflank smoke workflow and note
-- [x] pass the Northflank smoke test on the team H100
-- [x] verify that the historical upstream 3-knob family is blocked on P1 triangularity under the real verifier path
-- [x] repair the low-dimensional parameterization
-- [x] add explicit VMEC failure semantics
-- [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
-- [x] separate high-fi submit scoring/reporting from low-fi rollout score state
-- [ ] add tracked fixtures and manual playtest evidence
-- [ ] refresh the heuristic baseline after the real-verifier rerun
-Current caution:
-- do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
-- do not present submit-time score comparisons as clean unless they are grounded in the now-separated high-fi submit state
-## Plan V2 Inheritance
-Carry these rules through the whole checklist:
-- Freeze the environment contract before heavy iteration.
-- Keep the repo freshly wired; do not port the old harness.
-- Treat the current reward as `Reward V0`, not final reward.
-- Distinguish validated facts from working hypotheses.
-- Prefer behavior traces and baseline comparisons over generic reward-curve storytelling.
-- If training is weak, ship the environment story anyway.
-- Use Northflank as the main compute workspace; keep HF Space and Colab as the submission surfaces.
-- Do not open another strategy loop unless a real blocker appears.
-## Hour 0-2: Parallelize Compute Bring-Up and Contract Lock
-### Track A: Northflank Compute
-1. Bring up the Northflank Jupyter Notebook with PyTorch on the team H100.
-2. Attach persistent storage before relying on saved models, caches, or fixture downloads.
-3. Preserve the concrete smoke-test evidence:
-   - import `constellaration`
-   - generate one rotating-ellipse-derived low-dimensional boundary
-   - run one low-fidelity verifier call
-   - keep one artifact in persistent storage
-   - current artifact: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
-Exit condition: the notebook is not just open; the verifier path works and persistent storage is usable.
-Artifacts:
-- Northflank notebook live
-- smoke test note
-- one persisted smoke artifact
-### Track B: Environment Contract
-1. Write the exact `P1` environment spec.
-2. Freeze one task only.
-3. Define:
-   - observation schema
-   - action schema
-   - episode loop
-   - terminal conditions
-   - reward V0 terms
-   - initial penalties
-4. Update the main diagram so it emphasizes:
-   - `P1`
-   - official verifier
-   - reward shaping
-   - manual playtesting
-5. Mark open assumptions explicitly:
-   - whether the repaired low-dimensional action set is expressive enough
-   - whether the fixed step budget is enough
-   - whether `restore_best` is useful without becoming an exploit
-   - whether repaired-family ranges and deltas need adjustment after measurement
-Exit condition: a human can read the spec and understand how to act in the environment.
-Artifacts:
-- short environment spec
-- revised mermaid diagram
-- short hypothesis list
-Transition rule:
-- once Track B exits, stop rewriting the strategy and move straight into wiring and verifier checks
-## Hour 2-4: Verify Wiring, Then Manual Playtest
-1. Run a small measured sweep on the repaired family before freezing defaults.
-2. Audit observation clarity:
-   - low-fi `run` metrics are clearly labeled
-   - high-fi `submit` metrics are clearly labeled
-   - low-fidelity and high-fidelity best-state fields are explicit and human-readable
-3. Run fixture checks:
-   - known-good or near-winning design
-   - near-boundary designs
-   - clearly bad designs
-   - do not rely on the current default baseline params as the only starting point
-4. Confirm:
-   - verifier outputs are sane
-   - reward ordering is sane
-   - objective direction is correct
-5. Manually play 5 to 10 episodes.
-6. Log for each step:
-   - observation
-   - chosen action
-   - expected effect
-   - returned reward
-   - confusion or exploit if observed
-7. Identify at least one bad incentive or exploit.
-8. Patch reward or penalty logic immediately.
-9. Write the reward shaping story:
-   - initial reward V0
-   - bad behavior
-   - refinement to reward V1
-   - improved behavior
-10. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
-Exit condition: you can explain why the environment now rewards the intended behavior.
-Artifacts:
-- measured range and delta note
-- observation semantics note
-- fixture check note
-- manual playtest log
-- reward shaping note
-- reward V1 delta note
-## Hour 4-6: Stabilize the Local Task
-1. Prove the fresh local `P1` verifier loop.
-2. Run one stable end-to-end task repeatedly.
-3. Confirm the action schema is deterministic enough for reproducible episodes.
-4. Save one clean local trajectory.
-5. Do not proceed to remote deployment until this gate is real.
-Exit condition: the same setup yields the same type of behavior reliably enough for a demo.
-Artifacts:
-- stable local run
-- saved trajectory
-## Hour 6-8: Make the HF Space Real
-1. Package the OpenEnv `P1` environment for remote use.
-2. Use the explicit deployment path:
-   - commit changes in this repo
-   - push to GitHub
-   - let HF Space build from the repo
-3. Decide and document the access mode:
-   - preferred: public HF Space for the hackathon
-   - if private: token-based notebook access documented
-4. Verify remote `reset` and `step`.
-5. Run one clean remote episode end-to-end.
-6. Confirm the remote environment preserves the same task contract as local.
-Exit condition: the environment is runnable in the actual submission surface, not only locally.
-Artifacts:
-- live HF Space environment
-- remote episode proof
-## Hour 8-10: Add Baselines
-1. Implement the random baseline.
-2. Implement the heuristic baseline.
-3. Run short comparisons on the same stable `P1` task.
-4. Save:
-   - comparison numbers
-   - behavior traces
-   - one example where heuristic beats random
-Exit condition: there is a credible baseline anchor for the judges.
-Artifacts:
-- random baseline
-- heuristic baseline
-- comparison table or figure
-## Hour 10-12: Produce the Submission Evidence
-1. Wire the Colab training or eval script to the live environment.
-2. Ensure it produces:
-   - multi-turn episodes
-   - behavior traces
-   - reward or behavior comparison outputs
-3. Keep heavy verifier and training work on Northflank; use Colab as the thin public artifact.
-4. Draft the 60-second demo script.
-5. Record the demo around:
-   - what `P1` is
-   - how reward was refined
-   - what manual playtesting revealed
-   - one stable trajectory
-   - baseline comparison
-6. If training evidence is weak, keep the notebook eval-first and do not force a training-centric claim.
-7. Make the repo public-facing and readable only after the artifacts are real.
-Exit condition: all four visible artifacts exist in usable form.
-Artifacts:
-- Colab training or eval script
-- Northflank run notes or exported traces
-- demo script
-- draft or final video
-- updated repo README
-- explicit fallback note if training is not persuasive
-## Artifact Order
-1. Environment spec
-2. Repaired parameterization note
-3. Fixture check note
-4. Manual playtest log
-5. Reward revision note
-6. Stable task run
-7. Random baseline
-8. Heuristic baseline
-9. Northflank traces or training evidence
-10. Colab training or eval evidence
-11. Demo recording
-12. Repo polish
-## Non-Negotiables
-- Do not widen scope beyond one stable task.
-- Do not port the old `ai-sci-feasible-designs` harness into this repo.
-- Do not optimize training before manual playtesting.
-- Do not rely on reward curves alone; keep trajectory evidence.
-- Do not narrate hypotheses as facts before they are checked.
-- Do not guess repaired-family ranges, deltas, or budget changes without a measured sweep.
-- Do not polish the repo or video before the environment and baselines are real.
-- Treat judge comments as pressure toward clarity and reproducibility, not broader unsupported claims.
-- Do not force a training-centric story if the strongest evidence is environment quality plus baselines.
-- Do not rely on Northflank container-local state without persistent storage.
-- Do not block contract design work on Northflank provisioning friction.

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -1,151 +1,92 @@
 # P1 Environment Contract V1
-**Status:** Technical contract with partial implementation now landed
-**Role:** Supporting spec for the `P1` environment contract
-**SSOT relationship:** This file refines [FUSION_DESIGN_LAB_PLAN_V2.md](FUSION_DESIGN_LAB_PLAN_V2.md). If this file conflicts with the planning SSOT, update both in the same task.
-## Purpose
-This file captures the technical contract that should drive the next code changes in:
-- [server/physics.py](../server/physics.py)
-- [fusion_lab/models.py](../fusion_lab/models.py)
-- [server/environment.py](../server/environment.py)
-- [server/app.py](../server/app.py)
-The central change is now explicit:
-- the historical upstream 3-knob rotating-ellipse family is blocked on P1 triangularity under the real verifier path
-- that blocker drove the repair to the current 4-knob low-dimensional runtime
-- the runtime now exposes the repaired 4-knob target, but measured sweep validation and fixture calibration are still pending
-## Historical Blocker
-This section records the resolved upstream blocker that motivated the current repair. It is not the live runtime state.
-Current verified facts:
-- upstream `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` has no triangularity control
-- the historical 3-knob environment directly exposed only:
-  - `aspect_ratio`
-  - `elongation`
-  - `rotational_transform`
-- real low-fidelity samples on the current verifier path kept:
-  - `average_triangularity` at roughly `+0.004975`
-  - `p1_feasibility` at roughly `1.00995`
-  - feasible count at `0`
-Conclusion:
-- the historical 3-knob family was not a meaningful playtest or baseline environment for `P1`
-- the live runtime therefore moved to a repaired boundary family before further reward iteration
-## Design Split
 Keep three layers separate:
-1. **Boundary builder**
-   - low-dimensional parameterization
-   - rotating-ellipse seed generation
-   - optional triangularity control injection
-2. **Official verifier**
-   - boundary in
-   - metrics out
-   - feasibility, objective, and score semantics from `GeometricalProblem`
-3. **Environment**
-   - reset pool
-   - discrete actions
-   - episode flow
-   - reward shaping
-## Verifier Plan
-`server/physics.py` should expose a boundary-based verifier surface.
-Current repo state:
-- the live code now exposes a boundary builder plus boundary-based evaluator
-- explicit failure results are returned when VMEC evaluation fails
-- measured sweep validation is still pending
-Current live functions:
-- `build_boundary_from_params(...) -> SurfaceRZFourier`
-- `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`
-Current layering note:
-- discrete perturbation application lives in `server/environment.py`
-- there is no separate `apply_low_dim_perturbation(...)` helper in the live code
-The verifier layer should own:
-- low-fidelity step-time evaluation
-- high-fidelity submit-time evaluation
 - official `P1` feasibility semantics
-- official `P1` objective direction
-- score ordering
 - explicit failure results when VMEC or forward-model evaluation fails
-The verifier layer should not own:
 - episode budget
-- action semantics
 - reward shaping
-- “best so far” state
-## Low-Dimensional Boundary Plan
-Stay low-dimensional, not Fourier-first.
-Target controllable knobs:
 - `aspect_ratio`
 - `elongation`
 - `rotational_transform`
 - `triangularity_scale`
-Current measurement rule:
-- do not lock exact repaired-family ranges or deltas from prose alone
-- measure them on the repaired boundary family before presenting them as defaults
-- especially treat `rotational_transform` bounds, `triangularity_scale` deltas, and budget changes as open until measured
-Important naming rule:
-- once triangularity is injected explicitly, stop describing the family as plain upstream “rotating ellipse”
-- it becomes a custom low-dimensional boundary family derived from a rotating-ellipse seed
-## Action Contract
-Keep the discrete interaction style:
-- `intent`: `run | submit | restore_best`
 - `direction`: `increase | decrease`
 - `magnitude`: `small | medium | large`
-For `run`, the controllable parameter should be one of:
-- `aspect_ratio`
-- `elongation`
-- `rotational_transform`
-- `triangularity_scale`
-This keeps the environment human-playable and aligned with the historical low-dimensional P1 path.
-Current repo state:
-- the live action schema now exposes:
-  - `aspect_ratio`
-  - `elongation`
-  - `rotational_transform`
-  - `triangularity_scale`
-## Observation Contract
-The observation should stay metric-centered and human-readable.
-Keep:
 - `max_elongation`
 - `aspect_ratio`
@@ -167,113 +108,114 @@ Keep:
 - `target_spec`
 - `diagnostics_text`
-Add clarity about fidelity:
-- low-fidelity step-time metrics should be labeled as such
-- high-fidelity submit-time metrics should be labeled as such
-- do not expose them as if they are the same truth surface
-- the live runtime should expose separate low-fidelity and high-fidelity best-state fields instead of overloading one generic best-state metric
-This can be done either by:
-- separate observation fields, or
-- explicit fidelity labels in `diagnostics_text`
-The minimum requirement is that a reader can tell whether a metric came from low-fi `run` or high-fi `submit`.
-Current repo state:
-- the live observation surface now exposes evaluation fidelity and failure state
-- the live observation surface now exposes separate low-fidelity and high-fidelity best-state fields
-- terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
-## Reward V0
-Keep reward mostly scalar and verifier-driven.
-Target structure:
-- infeasible to feasible crossing:
-  - clear positive bonus
-- feasible to infeasible regression:
-  - clear negative penalty
-- both infeasible:
-  - reward reduction in official feasibility scalar
-- both feasible:
-  - reward lower `max_elongation`
-- non-submit step:
-  - small step cost
-- recovery after a failed evaluation:
-  - modest positive signal for returning to a valid verifier result
-  - do not compute this from the failed sentinel feasibility value
-- explicit `submit`:
-  - better than passive budget exhaustion when the design is improved
-Do not add:
-- reward terms tied to specific Fourier modes
-- bonuses for matching a known winner
-- hand-coded constraint tricks to hide a blocked action family
-Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
-Additional fidelity rule:
-- do not compare a high-fidelity submit score against low-fidelity baseline state
-- terminal reward and submit summaries should use a fidelity-consistent basis
-## Reset Strategy
-Start with frozen exact seeds, not jitter.
-Reset pool policy:
-- `n_field_periods = 3`
-- small frozen seed set
-- each seed must be:
-  - reproducible
-  - near enough to the feasible boundary that 6 steps is worth testing
-  - not already solved
-Add bounded jitter only if memorization becomes a real problem.
-## Manual Playtest Gate
-Do not move to heuristic redesign or reward tuning until this gate is passed.
-Manual playtest questions:
-- can a human tell which constraint is currently blocking progress?
-- can a human choose a plausible next action?
-- can a human reach or approach feasibility within the budget?
-- does `submit` feel meaningfully different from passive exhaustion?
-If the answer is no, fix:
-- the boundary family
-- the step magnitudes
-- the seed pool
-- the observation semantics around low-fi vs high-fi best-state reporting
-before tuning reward further
-## Implementation Order
-1. Repair the low-dimensional boundary builder in [server/physics.py](../server/physics.py).
-2. Split boundary construction from official boundary evaluation in [server/physics.py](../server/physics.py).
-3. Update the action and state schema in [fusion_lab/models.py](../fusion_lab/models.py).
-4. Update the episode loop and observation labeling in [server/environment.py](../server/environment.py).
-5. Update the task summary and public action description in [server/app.py](../server/app.py).
-6. Add explicit VMEC failure semantics in [server/environment.py](../server/environment.py).
-7. Run a small measured sweep to choose ranges, deltas, and reset seeds.
-8. Verify that observation semantics are human-readable and that low-fi versus high-fi best-state reporting is explicit.
-9. Freeze 1-2 repaired low-dimensional fixtures.
-10. Run manual playtesting.
-11. Refresh the heuristic baseline only after that evidence exists.
-## Out of Scope
-- full Fourier-mode action space as the primary environment
 - porting the old `ai-sci-feasible-designs` harness
-- making reward more complex before the repaired low-dimensional family exists
-- building a full benchmark split protocol before the environment is even playable

 # P1 Environment Contract V1
+**Role:** Live technical contract SSOT for the current implementation phase
+**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](FUSION_DESIGN_LAB_PLAN_V2.md)
+**Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
+## 1. Scope
+This document defines the live technical contract for:
+- [`server/physics.py`](../server/physics.py)
+- [`fusion_lab/models.py`](../fusion_lab/models.py)
+- [`server/environment.py`](../server/environment.py)
+- [`server/app.py`](../server/app.py)
+If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.
+## 2. Design Split
 Keep three layers separate:
+1. boundary builder
+2. official verifier
+3. environment
+Boundary builder owns:
+- the repaired low-dimensional family
+- rotating-ellipse seed generation
+- explicit triangularity control injection
+Official verifier owns:
+- boundary in, metrics out
 - official `P1` feasibility semantics
+- objective direction and score ordering
+- low-fidelity and high-fidelity evaluation modes
 - explicit failure results when VMEC or forward-model evaluation fails
+Environment owns:
+- reset pool
+- discrete actions
 - episode budget
+- best-state tracking
 - reward shaping
+## 3. Boundary Family
+The historical 3-knob upstream rotating-ellipse family is not the live contract.
+The live controllable knobs are:
 - `aspect_ratio`
 - `elongation`
 - `rotational_transform`
 - `triangularity_scale`
+Rules:
+- stay low-dimensional and human-playable
+- treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
+- the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks
+## 4. Action Contract
+`intent` is one of:
+- `run`
+- `submit`
+- `restore_best`
+For `run`, the action also includes:
+- `parameter`: one of `aspect_ratio | elongation | rotational_transform | triangularity_scale`
 - `direction`: `increase | decrease`
 - `magnitude`: `small | medium | large`
+Constraints:
+- keep the discrete interaction style
+- do not expose the full Fourier action space as the primary environment
+- do not use action complexity to compensate for missing clarity elsewhere
+## 5. Observation Contract
+The observation must stay metric-centered and human-readable.
+Required fields:
 - `max_elongation`
 - `aspect_ratio`
 - `target_spec`
 - `diagnostics_text`
+Interpretation rules:
+- low-fidelity `run` metrics must be labeled as low-fidelity
+- high-fidelity `submit` metrics must be labeled as high-fidelity
+- low-fidelity and high-fidelity best-state reporting must stay separate
+- the observation must be understandable without hidden state
+## 6. Episode Flow
+1. Reset from one frozen repaired-family seed or a small frozen seed set.
+2. Evaluate the initial state with low fidelity and return the first observation.
+3. On `run`, perturb one controllable parameter and re-evaluate with low fidelity.
+4. On `restore_best`, revert to the best known low-fidelity state, re-evaluate, and consume budget.
+5. On `submit`, end the episode and run the high-fidelity submit evaluation.
+6. End the episode on `submit` or budget exhaustion.
+Failure semantics:
+- failed evaluations still consume budget
+- failed evaluations produce visible failure observations
+- failed evaluations apply a documented penalty
+- the environment must not silently convert failures into success paths
+## 7. Terminal Contract
+At termination, the environment must provide:
+- final best design metrics
+- final feasibility status
+- total reward
+- a short human-readable trajectory summary
+Terminal reporting rules:
+- keep submit-time reporting fidelity-consistent
+- do not compare high-fidelity submit results against low-fidelity baseline state as if they were the same truth surface
+## 8. Verifier Contract
+The verifier of record is `constellaration.problems.GeometricalProblem`.
+The implementation must preserve:
+- objective direction
+- constraint direction
+- feasibility semantics
+- score ordering
+The verifier should stay boundary-based:
+- `build_boundary_from_params(...) -> SurfaceRZFourier`
+- `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`
+Do not treat parameterization-specific logic as verifier truth.
+## 9. Reward V0
+`Reward V0` is the live reward contract until playtesting proves a concrete pathology.
+Target behavior:
+- infeasible to feasible crossing gets a clear positive bonus
+- feasible to infeasible regression gets a clear penalty
+- when both states are infeasible, reduced official feasibility violation should help
+- when both states are feasible, lower `max_elongation` should help
+- non-submit actions pay a small step cost
+- `submit` should be better than passive exhaustion when the design is genuinely improved
+- recovery after a failed evaluation may receive a modest bounded bonus
+Rules:
+- keep reward scalar and verifier-driven
+- do not add mode-specific or parameter-specific reward shaping
+- do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
+## 10. Reset and Fixture Policy
+Reset policy:
+- start with exact frozen seeds
+- keep `n_field_periods = 3`
+- prefer a small reproducible seed set
+Each seed should be:
+- reproducible
+- near enough to the feasible boundary to make the budget meaningful
+- not already solved
+Fixture policy:
+- track good, boundary, and clearly bad references
+- use fixtures for verifier and reward sanity checks
+- do not turn fixture mining into a separate broad project
+## 11. Open Measurements
+These items remain open until measured on the repaired family:
+- exact repaired-family range bounds
+- exact `triangularity_scale` deltas
+- exact `rotational_transform` bounds
+- exact reset seed pool
+- whether the budget should stay at 6 or change
+## 12. Out of Scope
 - porting the old `ai-sci-feasible-designs` harness
+- broad Fourier-mode action space as the main environment
+- complicated reward shaping before playtest evidence
+- a wider task family than the single stellarator environment

docs/P1_PARAMETERIZATION_DEEPDIVE.md CHANGED Viewed

@@ -1,447 +1,156 @@
 # P1 Parameterization Deep-Dive
-**Date:** 2026-03-07
-**Status:** Findings complete. Parameterization repair is implemented; measured sweep follow-up is pending.
-This document records the investigation into why the current 3-knob rotating-ellipse
-environment cannot produce P1-feasible designs, what the original winning session
-actually did, and the validated plan for fixing it.
----
-## 1. The Structural Blocker
 ### Symptom
-The environment's 3-parameter action space (`aspect_ratio`, `elongation`,
-`rotational_transform`) cannot satisfy the P1 constraints regardless of parameter
-values.
-### Evidence
-A 125-point grid sweep over the full 3-knob range with the real `constellaration`
-verifier:
-```
-aspect_ratio    ∈ [2.0, 8.0]   (5 values)
-elongation      ∈ [1.0, 5.0]   (5 values)
-rot_transform   ∈ [0.1, 1.0]   (5 values)
-n_field_periods = 3
-```
-**Result: 0/125 feasible.** Every configuration produced:
-- `average_triangularity ≈ +0.005` (constraint requires `≤ -0.5`, gap of ~0.505)
-- `edge_iota_over_nfp ≈ 0.05-0.22` (constraint requires `≥ 0.3`)
-Varying `n_field_periods` (3, 4, 5) did not change the result. The
-`generate_rotating_ellipse` function structurally produces near-zero triangularity
-regardless of its input parameters.
-### Root cause
-`generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform,
-n_field_periods)` sets Fourier coefficients that define the boundary shape. The
-`m=2, n=0` mode (which controls triangularity) is not meaningfully set by any of the
-three input parameters. Triangularity is structurally fixed near zero.
-The `rotational_transform` range `[0.1, 1.0]` is also too low. Even with injected
-triangularity, `edge_iota_over_nfp` doesn't reach 0.3 until `rotational_transform ≈ 1.5+`.
----
-## 2. The Original Winning Session
-### Source
-The original session that found P1-feasible designs is documented in:
-```
-ai-sci-feasible-designs/docs/harness/raw-session.md
-```
-Session: `rollout-2026-01-05T10-43-45-019b8bd3-14d6-7253-8235-f732ee43d683.jsonl`
-(25,012 lines, 200 agent messages, Jan 5-9 2026)
-### What the agent actually did for P1
-1. Built `scripts/search_p1_lowdim.py` — a rotating-ellipse sweep with a **4th knob**
-2. Found 3 feasible designs (feasibility 0.0) within ~20 minutes
-3. Refined with trust-region local optimizer around those seeds
-4. Downloaded scadena-pf leaderboard seed from HuggingFace as anchor
-5. Ran `scripts/p1_alm_ngopt_multifidelity.py` (ALM + NGOpt multi-fidelity)
-6. Final result: score 0.970141, beating leaderboard 0.969457
-### The 4th knob: `tri_scale`
-The script `search_p1_lowdim.py` (recovered from git at commit `300c191`) does NOT
-use the raw `generate_rotating_ellipse` output. After generating the base shape, it:
-1. Expands Fourier resolution: `set_max_mode_numbers(surface, max_poloidal_mode=3, max_toroidal_mode=3)`
-2. Injects triangularity: `r_cos[2, center] = -tri_scale * minor_radius`
-3. Cleans auxiliary modes: `r_cos[0, :center] = 0.0`, `z_sin[0, :center+1] = 0.0`
-4. Returns the modified `SurfaceRZFourier`
-The `tri_scale` knob directly controls the `m=2, n=0` Fourier mode, which is what
-drives `average_triangularity` in the physics evaluation. This is the missing piece.
-### Parameter ranges from the original script
-```python
-aspect_ratio:          [3.0, 3.6]
-elongation:            [1.4, 2.2]
-rotational_transform:  [1.5, 2.2]   # NOTE: much higher than our [0.1, 1.0]
-tri_scale:             [0.55, 0.8]
-```
----
-## 3. The Harness Campaign Results
-### All campaigns found zero feasible designs
-Queried SQLite databases across all P1 runs in `ai-sci-feasible-designs/runs/`:
-| Run | Candidates | Feasible | Best feasibility |
-|-----|-----------|----------|-----------------|
-| p1_campaign | 58 | 0 | 0.615 |
-| p1_campaign_v2 | 50 | 0 | 0.416 |
-| p1_e2e_validate | 7 | 0 | 0.639 |
-| p1_live | 23 | 0 | 0.569 |
-The campaign candidates used full Fourier boundaries (`5x9` arrays, `n_field_periods=5`),
-not the low-dimensional rotating-ellipse family.
-### Postmortem diagnosis (from P1_CAMPAIGN_POSTMORTEM.md)
-The campaign's `_COLD_START_GUIDANCE` told the LLM "Do NOT generate from scratch" and
-"Start with small perturbations (0.01-0.05)." The winning raw-session agent did the
-exact opposite: broad sweeps, large parameter variations, rotating-ellipse seeds. The
-guidance actively prohibited the winning strategy.
----
-## 4. Live Sweep Validation
-### 4-knob sweep with `tri_scale` injection
-A 256-point grid sweep using the same boundary construction as `search_p1_lowdim.py`:
-```
-aspect_ratio    ∈ [3.2, 3.8]   step 0.2  (4 values)
-elongation      ∈ [1.2, 1.8]   step 0.2  (4 values)
-rot_transform   ∈ [1.2, 1.8]   step 0.2  (4 values)
-tri_scale       ∈ [0.4, 0.7]   step 0.1  (4 values)
-n_field_periods = 3
-mpol = 3, ntor = 3
-```
-**Result from the recorded sweep:**
-```
-Total configs: 256
-Evaluated:     228  (VMEC succeeded)
-Crashed:        28  (VMEC solver failed)
-Feasible:       10
-Crash rate:    10.9%
-Feasibility rate (of evaluated): 4.4%
-```
-### Top feasible results
-| AR | elong | rot_t | tri_s | AR_out | tri | iota/nfp | elong_out | feas | ok |
-|----|-------|-------|-------|--------|-----|----------|-----------|------|----|
-| 3.6 | 1.4 | 1.6 | 0.60 | 3.287 | -0.5003 | 0.3005 | 7.3751 | 0.0000 | YES |
-| 3.6 | 1.4 | 1.8 | 0.60 | 3.287 | -0.5003 | 0.3481 | 8.9318 | 0.0000 | YES |
-| 3.8 | 1.4 | 1.6 | 0.60 | 3.487 | -0.5003 | 0.3256 | 7.9202 | 0.0000 | YES |
-| 3.8 | 1.6 | 1.6 | 0.60 | 3.474 | -0.5037 | 0.3168 | 8.0626 | 0.0000 | YES |
-| 3.8 | 1.8 | 1.6 | 0.60 | 3.459 | -0.5075 | 0.3097 | 8.2033 | 0.0000 | YES |
-| 3.4 | 1.2 | 1.8 | 0.60 | 3.096 | -0.4977 | 0.3276 | 8.0849 | 0.0046 | YES |
-| 3.8 | 1.2 | 1.6 | 0.60 | 3.496 | -0.4977 | 0.3345 | 7.7908 | 0.0046 | YES |
-| 3.6 | 1.2 | 1.8 | 0.60 | 3.296 | -0.4977 | 0.3535 | 8.8140 | 0.0046 | YES |
-| 3.2 | 1.2 | 1.8 | 0.60 | 2.896 | -0.4977 | 0.2995 | 7.4314 | 0.0046 | YES |
-| 3.6 | 1.2 | 1.6 | 0.60 | 3.296 | -0.4977 | 0.3105 | 7.2363 | 0.0046 | YES |
-**Key observations:**
-- All feasible designs have `tri_scale = 0.60`
-- `rot_transform ∈ {1.6, 1.8}` only — lower values never reach feasibility
-- `average_triangularity` clusters at `-0.500` to `-0.508` (right at the constraint)
-- Triangularity is always the binding constraint (within the 1% tolerance)
-- `max_elongation` ranges from 7.2 to 8.9 (score ~0.31 to ~0.12). Significant room
-  for optimization compared to the winning score of 0.970 (elongation 1.27)
-### Crash rate by `rotational_transform`
-| rot_transform | crash rate | feasible |
-|---------------|-----------|----------|
-| 1.2 | 0% (0/64) | 0 |
-| 1.4 | 0% (0/64) | 0 |
-| 1.6 | 0% (0/64) | 6 |
-| 1.8 | 44% (28/64) | 4 |
-**`rot_transform = 1.6` is the sweet spot:** zero crashes, highest feasible count.
-### VMEC crash zone
-Tested the extreme region (`rot_transform ∈ [2.0, 2.4]`, `tri_scale ∈ [0.7, 0.9]`):
-```
-600/600 crashed (100%)
-```
-VMEC solver fails universally when the boundary is too distorted. The crash boundary
-is approximately `rot_transform ≥ 2.0` combined with `tri_scale ≥ 0.7`.
----
-## 5. Verifier Analysis
-### What's correct in the current verifier (`server/physics.py`)
-1. **`_to_evaluation_metrics`** uses `GeometricalProblem` public API:
-   - `problem.is_feasible(metrics)` — applies the 1% tolerance internally
-   - `problem.compute_feasibility(metrics)` — infinity-norm of normalized violations
-   - `problem.get_objective(metrics)` — returns `(max_elongation, minimize=True)`
-2. **`_score_from_objective`** matches the official formula:
-   `score = 1 - clip((max_elongation - 1) / 9, 0, 1)`
-3. **Multi-fidelity split** is correct:
-   - `run` actions use low-fidelity VMEC (~0.6s per eval)
-   - `submit` uses high-fidelity VMEC (~24s per eval)
-4. **Constraint constants** match the official P1 definition:
-   - `aspect_ratio ≤ 4.0`
-   - `average_triangularity ≤ -0.5`
-   - `edge_rotational_transform / n_field_periods ≥ 0.3`
-### Current implementation
-- The old `evaluate_params` helper has been retired.
-- The runtime is now split into:
-  - `build_boundary_from_params(...)` → `SurfaceRZFourier` (handles mode expansion + tri_scale injection)
-  - `evaluate_boundary(boundary, fidelity)` → `EvaluationMetrics` (pure evaluation, no parameterization knowledge)
----
-## 6. Reward Analysis
-### Reward V0 structure (current, in `server/environment.py`)
-```
-Feasibility transition:      ±3.0 on crossing the feasible/infeasible boundary
-Dual-track step shaping:
-  feasible + feasible     →  (prev_elongation - curr_elongation) * 10.0
-  otherwise               →  (prev_feasibility - curr_feasibility) * 5.0
-Post-failure recovery:       +1.0 on the first successful step after a failed evaluation
-Per-step cost:               -0.1 for non-submit actions
-Terminal bonus (submit):      5.0 * improvement_ratio + budget_efficiency
-Terminal bonus (exhaust):     2.0 * improvement_ratio
-Not improved penalty:        -1.0 (submit) / -0.5 (exhaust)
 ```
-### Assessment
-**The reward is still simple and should stay close to unchanged.** It mostly uses two scalars
-from the verifier: `feasibility` and `objective (max_elongation)`. These are
-problem-agnostic quantities that `GeometricalProblem` provides for any problem variant.
-One small exception is now explicit: recovery from a failed VMEC evaluation gets a
-modest fixed bonus instead of comparing against the failure sentinel. The previous
-behavior could erase recovery signal by comparing a successful step against itself,
-while a naive sentinel comparison would explode the reward into an unbounded spike.
-Things the reward correctly avoids:
-- Per-constraint shaping (would overfit to P1's specific constraint structure)
-- Tolerance-exploit bonus (would overfit to the 1% evaluator quirk)
-- Mode-specific or parameter-specific weighting
-- Any knowledge of which knob controls which metric
-**One thing to monitor during playtesting:** the `5.0` multiplier on feasibility shaping
-may need tuning once the action space changes. Mode perturbations produce different
-feasibility deltas per step than the old 3-knob steps. But tune from playtest data,
-not from theory.
----
-## 7. Findings from the P1 Score Chase Notes
-From `ai-sci-feasible-designs/docs/P1_SCORE_CHASE_NOTES.md`:
-### Best submission metrics (high-fidelity)
-```
-max_elongation        = 1.266744   →  score = 0.970362
-aspect_ratio          = 3.999377   (tight, but feasible)
-average_triangularity = -0.495236  (normalized violation ≈ 0.009529)
-iota/nfp              = 0.298946   (normalized violation ≈ 0.003515)
-feasibility           = 0.009529   (feasible due to 1% tolerance)
 ```
-### Key patterns
-- The winning region is "thin": improving elongation pushes triangularity and/or iota
-  over the constraint cliff
-- The **1% feasibility tolerance** is a first-order effect: the best scores come from
-  intentionally pushing constraints to the edge of tolerance to squeeze more elongation
-  reduction
-- Triangularity is usually the binding constraint near the best scores; iota is second
-- The best submission used a full Fourier boundary (`mpol=8, ntor=8`, `n_field_periods=3`)
-  refined through multiple optimization stages, not the low-dimensional parameterization
-### What this means for our environment
-Our feasible designs from the 4-knob sweep have `max_elongation ≈ 7.2-8.9`
-(score ≈ 0.12-0.31). The winning submission has `max_elongation = 1.27` (score 0.97).
-The gap is large — the 4-knob family can reach feasibility but cannot reach competitive
-scores. This is expected: the environment is a stepping stone for learning
-constraint-satisfaction and basic optimization, not a path to the leaderboard.
----
-## 8. Anti-Overfitting Design
-### What constitutes overfitting in this context
-- Per-constraint reward weighting (e.g., "triangularity progress is worth 2x")
-- Reward bonuses for exploiting the 1% tolerance
-- Action-space design that hardcodes which modes matter
-- Single fixed starting state that agents memorize
-### Agreed anti-overfitting levers
-1. **Multiple frozen reset seeds first** (start with exact seeds; add bounded jitter only
-   if memorization becomes a real problem)
-2. **Held-out evaluation seeds** (test generalization, not memorization)
-3. **Reward based on official scalars only** (feasibility + objective, not per-constraint)
-4. **Domain knowledge in initial state, not reward** (good baseline params in `reset()`,
-   not constraint-specific shaping in `_compute_reward`)
-5. **Known winners as calibration fixtures only**, not optimization targets
----
-## 9. Agreed Plan
-### What everyone agrees on
-1. **4-knob low-dimensional action space**: `aspect_ratio`, `elongation`,
-   `rotational_transform`, `triangularity_scale`
-2. **Boundary-based verifier**: `build_boundary_from_params(...)` + `evaluate_boundary(...)`
-3. **Explicit VMEC crash handling**: treat solver failures as bad-but-not-fatal
-4. **Reward V0 unchanged in spirit**: feasibility-first, scalar-only, no P1-specific shaping
-5. **Fidelity labeling in observation**: distinguish low-fi vs high-fi metrics
-### What is deferred until after build + playtest
-- Exact `rotational_transform` range bounds (data suggests ~[1.2, 1.9] is useful)
-- Exact `triangularity_scale` delta values
-- Seed pool construction (need empirical sweep in the repaired parameterization)
-- Whether budget should be 6 or 8
-- Any Reward V1 changes
-### Implementation order
-1. `server/physics.py` — boundary-based verifier interface
-2. `fusion_lab/models.py` — action/observation/state for 4 knobs
-3. `server/environment.py` — reset with seed pool, discrete knob perturbations, VMEC crash handling
-4. `server/app.py` — expose new action schema
-5. `baselines/` — random + heuristic (repair feasibility first, then reduce elongation)
-6. Manual playtest — verify budget is sufficient, tune ranges/deltas/seeds empirically
----
-## 10. Cross-Validation Record
-This plan was cross-validated with an independent agent that:
-- Independently confirmed the 3-knob blocker
-- Independently confirmed the historical `tri_scale` implementation detail
-- Reproduced VMEC crash failures
-- Validated the layer decomposition (verifier / environment / parameterization)
-### Pushbacks from the cross-validation agent and resolution
-| Pushback | Verdict | Resolution |
-|----------|---------|------------|
-| "10/228 feasible is unverified" | **Partially addressed.** The recorded sweep found feasible points in the repaired 4-knob family, but this exact count should be treated as an artifact-backed result, not a free-floating fact. | Keep the sweep note, and link or preserve the underlying artifact if this exact count will be cited elsewhere. |
-| "rt=1.8 comfortably feasible is too strong" | **Partially valid.** 44% crash rate at 1.8. | rt=1.6 is the true sweet spot: 0% crashes, 6 feasible. |
-| "Delta values are design proposals, not facts" | **Valid.** | Defer to post-build playtesting. |
-| "Seed pool not empirically validated" | **Valid.** | Methodology sound, execution pending. |
-| "Budget change to 8-10 is speculative" | **Valid.** | Keep 6 until playtest proves otherwise. |
----
-## Appendix A: Key File Locations
-### Fusion Design Lab (this repo)
-```
-server/physics.py       — verifier (needs boundary-based refactor)
-server/environment.py   — environment loop + reward V0
-fusion_lab/models.py    — action/observation/state schemas
-server/app.py           — FastAPI endpoints
-baselines/              — random + heuristic agents
-```
-### ai-sci-feasible-designs (reference repo)
-```
-docs/harness/raw-session.md                — original winning session narrative
-docs/harness/P1_CAMPAIGN_POSTMORTEM.md     — why campaigns found 0 feasible
-docs/P1_SCORE_CHASE_NOTES.md               — best P1 score details + approach
-scripts/search_p1_lowdim.py                — the 4-knob sweep script (git: 300c191)
-scripts/p1_alm_ngopt_multifidelity.py      — ALM+NGOpt optimizer (git: aba75b7)
-runs/p1_campaign*/world.sqlite             — campaign evaluation databases
-```
-## Appendix B: Boundary Construction Reference
-The 4-knob boundary construction, as implemented in the original
-`search_p1_lowdim.py`:
-```python
-from constellaration.initial_guess import generate_rotating_ellipse
-from constellaration.geometry import surface_rz_fourier
-from constellaration.geometry.surface_rz_fourier import SurfaceRZFourier
-import numpy as np
-def build_boundary(aspect_ratio, elongation, rotational_transform, tri_scale, nfp=3):
-    # 1. Generate base rotating-ellipse shape
-    surface = generate_rotating_ellipse(
-        aspect_ratio=aspect_ratio,
-        elongation=elongation,
-        rotational_transform=rotational_transform,
-        n_field_periods=nfp,
-    )
-    # 2. Expand to higher Fourier modes
-    surface = surface_rz_fourier.set_max_mode_numbers(
-        surface, max_poloidal_mode=3, max_toroidal_mode=3,
-    )
-    # 3. Inject triangularity via the m=2, n=0 Fourier mode
-    r_cos = np.asarray(surface.r_cos, dtype=float).copy()
-    z_sin = np.asarray(surface.z_sin, dtype=float).copy()
-    center = r_cos.shape[1] // 2
-    minor = float(r_cos[1, center])
-    r_cos[2, center] = -tri_scale * minor
-    # 4. Clean auxiliary modes
-    r_cos[0, :center] = 0.0
-    z_sin[0, :center + 1] = 0.0
-    return SurfaceRZFourier(
-        r_cos=r_cos.tolist(),
-        z_sin=z_sin.tolist(),
-        n_field_periods=nfp,
-        is_stellarator_symmetric=True,
-    )
-```
-Key details:
-- `r_cos[1, center]` is the minor radius of the base shape
-- `r_cos[2, center]` is the `m=2, n=0` Fourier coefficient (controls triangularity)
-- Setting it to `-tri_scale * minor` produces negative triangularity proportional to `tri_scale`
-- The auxiliary mode cleanup ensures the boundary is well-conditioned for VMEC

 # P1 Parameterization Deep-Dive
+**Date:** March 7, 2026
+**Role:** Evidence and rationale record
+**Status:** Supporting doc, not a live planning or contract SSOT
+This document keeps the durable evidence behind the repaired low-dimensional `P1` environment:
+- why the historical 3-knob family failed
+- what the original winning session actually did
+- what the recorded 4-knob sweep proved
+- why the current environment is intentionally a playable stepping stone rather than a leaderboard-matching optimizer
+## 1. Structural Blocker
 ### Symptom
+The old 3-parameter action space:
+- `aspect_ratio`
+- `elongation`
+- `rotational_transform`
+could not satisfy the `P1` constraints under the real `constellaration` verifier path.
+### Evidence
+A 125-point grid sweep over the historical 3-knob range produced `0/125` feasible designs.
+Observed behavior:
+- `average_triangularity` stayed near `+0.005`
+- `p1_feasibility` stayed near `1.00995`
+- varying `n_field_periods` did not resolve the blocker
+### Root Cause
+`generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
+The historical `rotational_transform` range was also too low to reach the `edge_iota_over_nfp >= 0.3` requirement reliably.
+## 2. Original Winning Session
+The original successful `P1` path in `ai-sci-feasible-designs` did not rely on the raw 3-knob family alone.
+The winning session:
+1. built a low-dimensional sweep with a fourth knob
+2. found feasible seeds quickly
+3. refined around those seeds with stronger optimizers
+4. used leaderboard-quality anchors later in the pipeline
+### Missing Fourth Knob
+The historical script added `tri_scale` by injecting the `m=2, n=0` Fourier mode after generating the base rotating-ellipse shape.
+That missing triangularity control is the key reason the raw 3-knob family was structurally blocked.
+### Recovered Useful Ranges
+The original script used substantially different useful ranges than the blocked runtime:
+```text
+aspect_ratio:         [3.0, 3.6]
+elongation:           [1.4, 2.2]
+rotational_transform: [1.5, 2.2]
+tri_scale:            [0.55, 0.8]
 ```
+## 3. Harness Campaign Comparison
+Recorded `P1` campaign runs in `ai-sci-feasible-designs` also found zero feasible candidates.
+That failure does not disprove the repaired low-dimensional path. It mostly shows that the campaign guidance and search style diverged from the winning approach:
+- the campaigns pushed the agent away from broad low-dimensional exploration
+- the winning session did broad sweeps and large early moves
+- the campaign path used richer Fourier candidates, but not the same successful cold-start behavior
+## 4. Recorded 4-Knob Sweep
+A recorded 4-knob sweep using explicit triangularity injection showed that the repaired family can reach `P1` feasibility.
+Recorded sweep family:
+```text
+aspect_ratio:         [3.2, 3.8]
+elongation:           [1.2, 1.8]
+rotational_transform: [1.2, 1.8]
+tri_scale:            [0.4, 0.7]
+n_field_periods:      3
+mpol / ntor:          3 / 3
 ```
+What that sweep established:
+- explicit triangularity control fixes the structural blocker
+- repaired-family feasibility is reachable in principle
+- repaired-family defaults still need measured calibration before they should be narrated as stable
+## 5. Verifier Alignment Evidence
+The current runtime verifier alignment is sound:
+- the official `GeometricalProblem` API is used for feasibility and objective semantics
+- score conversion matches the official `P1` objective direction
+- the runtime split is boundary-based: build boundary first, then evaluate boundary
+- low-fidelity `run` and high-fidelity `submit` are treated as separate truth surfaces
+This matters because the repair belongs in the boundary family, not in redefined verifier semantics.
+## 6. Reward Implications
+The repaired family changes what is possible, but it does not justify a complicated reward.
+The main reward conclusions remain:
+- keep reward tied to official verifier scalars
+- keep feasibility-first behavior
+- do not add per-constraint or knob-specific shaping
+- tune from playtest evidence, not from theory alone
+## 7. Why The Environment Is Still Valid
+The repaired 4-knob family is not a leaderboard-matching optimizer. That is acceptable for this repo.
+The purpose of the environment is:
+- teach and evaluate constrained design behavior
+- keep the observation/action/reward loop legible
+- preserve an explainable path from action to verifier feedback
+The winning high-fidelity score chase used a much richer downstream optimization story. This repo does not need to reproduce that full pipeline to be a valid hackathon environment artifact.
+## 8. Design Implications Kept From This Analysis
+- keep multiple frozen reset seeds rather than one memorized starting state
+- keep reward based on official scalars rather than hand-coded constraint bonuses
+- keep known winners as calibration fixtures, not direct reward targets
+- keep domain knowledge in seeds and fixtures, not in opaque reward tricks
+## 9. Primary References
+Fusion Design Lab:
+- [`server/physics.py`](../server/physics.py)
+- [`server/environment.py`](../server/environment.py)
+- [`fusion_lab/models.py`](../fusion_lab/models.py)
+- [`docs/P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md)
+Reference repo:
+- `ai-sci-feasible-designs/docs/harness/raw-session.md`
+- historical `scripts/search_p1_lowdim.py`
+- `ai-sci-feasible-designs/docs/P1_SCORE_CHASE_NOTES.md`
+- `P1_CAMPAIGN_POSTMORTEM.md`

docs/PIVOT_P1_ROTATING_ELLIPSE.md DELETED Viewed

@@ -1,304 +0,0 @@
-# Pivot: P1 Rotating-Ellipse Environment
-**Date:** 2026-03-07
-**Status:** Supporting decision record, superseded as planning SSOT by `FUSION_DESIGN_LAB_PLAN_V2.md`
-**Supersedes:** Synthetic physics model in current `server/physics.py`
-Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
-## Current Branch Status
-- [x] pivot accepted
-- [x] historical upstream 3-knob rotating-ellipse `P1` contract was implemented and evaluated
-- [x] `constellaration` verifier path is wired
-- [x] historical upstream 3-knob family is verified as blocked on P1 triangularity
-- [x] repaired low-dimensional family with explicit triangularity control is implemented
-- [ ] tracked fixtures are added
-- [ ] manual playtest evidence is recorded
-- [ ] heuristic baseline is refreshed for the real verifier path
-Current caution:
-- the upstream rotating-ellipse family remains useful as a seed generator, but the live environment action family is the repaired rotating-ellipse-derived 4-knob contract
-## Decision
-Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
-This borrows the strongest low-dimensional entry point from the proven winning approach documented in `raw-session.md`, not the full approach.
-## What Was Validated
-| Claim | Status | Source |
-|---|---|---|
-| P1 is the cleanest benchmark task | Verified | `problems.py:113` — minimize max_elongation, 3 constraints, no QI |
-| P1 skips QI | Verified | `problems.py:145` — `_does_it_require_qi = False` |
-| Low-fidelity eval is fast enough | Measured | 0.63s per eval on local machine; postmortem says ~1s/eval |
-| High-fidelity eval is expensive | Measured | 24s per eval; only viable for final validation |
-| Rotating-ellipse can find P1-feasible designs | Verified | `raw-session.md`: sweeps found 3 feasible designs in ~20 min |
-| vmecpp installs from wheels | Verified | `uv pip install vmecpp==0.4.7` resolves cleanly, no compilation |
-| constellaration Dockerfile is not bloated | Verified | `python:3.10-slim` + `pip install constellaration` |
-| Current seed logic is too loose for P1 | Verified | `seeds.py:42`: triangularity override 0.05 vs constraint -0.5 |
-| Full harness should not be ported | Verified | Postmortem: prescriptive harness produced 0 feasible candidates |
-## What Is Hypothesis (Not Yet Validated)
-1. **6 actions is enough** to reach or improve P1 feasibility from a rotating-ellipse starting point. Must validate by manual playtest immediately.
-2. **Discretized rotating-ellipse perturbations** create non-trivial decision pressure (not too easy, not impossible).
-3. **Low-fidelity metrics** are close enough to high-fidelity P1 scoring that low-fi reward signal is meaningful.
-4. **The Docker image** builds and deploys on HF Spaces within reasonable time/size limits.
-## Environment Design
-### Single Task
-Improve a stellarator boundary's P1 score using a rotating-ellipse-derived low-dimensional parameterization under the official ConStellaration P1 constraints.
-### P1 Constraints (from `GeometricalProblem`)
-- aspect_ratio <= 4.0
-- average_triangularity <= -0.5
-- edge_rotational_transform / n_field_periods >= 0.3
-### P1 Objective
-Minimize `max_elongation`. Score = `1 - clip((max_elongation - 1) / 9, 0, 1)`.
-Feasibility tolerance: normalized constraint violations <= 1% (0.01).
-### Parameter Space
-Historical upstream seed generator:
-| Parameter | Role | Typical range |
-|---|---|---|
-| `aspect_ratio` | Width-to-height ratio of the boundary | 2.0 - 8.0 |
-| `elongation` | Vertical stretching of cross-section | 1.0 - 5.0 |
-| `rotational_transform` | Magnetic field line winding | 0.1 - 1.0 |
-| `n_field_periods` | Fixed at 3 (not an action) | 3 |
-These map to `constellaration.initial_guess.generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` which returns a `SurfaceRZFourier` boundary in ~4ms.
-Historical blocker:
-- on the real low-fidelity verifier path, sampled 3-knob points kept `average_triangularity` at roughly `+0.004975`
-- sampled `p1_feasibility` stayed at roughly `1.00995`
-- no sampled point was feasible
-Current live environment family:
-| Parameter | Role | Current implementation range |
-|---|---|---|
-| `aspect_ratio` | Width-to-height ratio of the repaired boundary | 3.2 - 3.8 |
-| `elongation` | Vertical stretching of cross-section | 1.2 - 1.8 |
-| `rotational_transform` | Magnetic field line winding | 1.2 - 1.9 |
-| `triangularity_scale` | Explicit triangularity control | 0.4 - 0.7 |
-| `n_field_periods` | Fixed at 3 (not an action) | 3 |
-These ranges describe the live implementation in `server/environment.py`. They are still subject to measured sweep and playtest refinement.
-### Action Space
-Current action space:
-```
-intent: "run" | "submit" | "restore_best"
-parameter: "aspect_ratio" | "elongation" | "rotational_transform" | "triangularity_scale"
-direction: "increase" | "decrease"
-magnitude: "small" | "medium" | "large"
-```
-Current implementation deltas:
-| Parameter | small | medium | large |
-|---|---|---|---|
-| aspect_ratio | 0.05 | 0.10 | 0.20 |
-| elongation | 0.05 | 0.10 | 0.20 |
-| rotational_transform | 0.05 | 0.10 | 0.20 |
-| triangularity_scale | 0.02 | 0.05 | 0.10 |
-### Episode Flow
-1. Reset: generate initial boundary from baseline rotating-ellipse parameters (+ optional seed perturbation). Run low-fi forward_model. Return initial observation.
-2. Agent chooses action.
-3. If `run`: modify parameter, regenerate boundary, run low-fi forward_model (~0.6s). Return diagnostics + reward.
-4. If `restore_best`: revert to best-known parameters, re-evaluate low-fidelity metrics, and charge a budget step.
-5. If `submit`: end episode. Optionally run high-fi for final score.
-6. Episode ends on `submit` or budget exhaustion.
-### Budget
-6 evaluations per episode. All non-submit actions cost 1 budget.
-### Observation
-```
-diagnostics_text: str          # human-readable summary
-max_elongation: float          # P1 objective (minimize)
-aspect_ratio: float            # constraint: <= 4.0
-average_triangularity: float   # constraint: <= -0.5
-edge_iota_over_nfp: float     # constraint: >= 0.3
-p1_score: float                # current step-time score
-p1_feasibility: float          # current step-time max normalized constraint violation
-constraints_satisfied: bool    # feasibility <= 0.01
-vacuum_well: float             # stability indicator
-evaluation_fidelity: "low" | "high"
-evaluation_failed: bool
-failure_reason: str
-step_number: int
-budget_remaining: int
-best_low_fidelity_score: float
-best_low_fidelity_feasibility: float
-best_high_fidelity_score: float | None
-best_high_fidelity_feasibility: float | None
-target_spec: str
-```
-Current requirement:
-- the observation and diagnostics text should make the low-fi vs high-fi distinction explicit
-- best-state reporting should be split explicitly between low-fidelity rollout state and high-fidelity submit state
-- do not narrate low-fi and high-fi best-state fields as one combined metric
-### Reward V0
-Feasibility-first, then objective improvement:
-```
-if constraints newly satisfied:
-    +3.0
-if constraints newly violated:
-    -3.0
-if feasible:
-    reward += (prev_elongation - curr_elongation) * 10.0  # improvement in objective
-else:
-    reward += (prev_feasibility - curr_feasibility) * 5.0  # progress toward feasibility
-per-step cost: -0.1
-submit bonus (if feasible and improved):
-    +5.0 * improvement_ratio + 1.0 * budget_efficiency
-submit penalty (if infeasible or no improvement):
-    -1.0
-```
-This puts feasibility first. An agent that achieves feasibility then minimizes elongation gets rewarded. An agent that never reaches feasibility gets penalized.
-Current execution note:
-- keep reward mostly scalar and verifier-driven
-- keep parameterization repair and reward semantics separate
-- do not add mode- or constraint-specific reward hacks to compensate for a blocked action family
-### State
-```
-step_count: int
-current_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
-best_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
-initial_low_fidelity_score: float
-initial_high_fidelity_score: float | None
-best_low_fidelity_score: float
-best_low_fidelity_feasibility: float
-best_high_fidelity_score: float | None
-best_high_fidelity_feasibility: float | None
-history: list[str]
-```
-## Two Designs That Were Considered
-| | Rotating-ellipse env | Curated-seed Fourier-repair env |
-|---|---|---|
-| Action space | 4 parameters (AR, elongation, rotational transform, triangularity scale) | N Fourier modes |
-| Starting point | Generated from parameters | Frozen from HF dataset |
-| Interpretability | High — parameters map to physical shape | Lower — mode perturbations are abstract |
-| Dataset dependency | None at runtime | Requires offline curation |
-| Search space coverage | Low-dimensional subfamily | Full boundary space |
-| Hackathon viability | High | Medium (needs pre-work) |
-**Decision:** Rotating-ellipse for the hackathon. It is self-contained, human-playable, and proven as a viable entry point for P1.
-**What it does NOT claim:** Full coverage of the P1 boundary design space. This is a tradeoff accepted for hackathon scope.
-## Implementation Order
-### Phase 1: Physics Backend (~1 hour)
-Status: done.
-Rewrite `server/physics.py` to wrap:
-- `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
-- `constellaration.forward_model.forward_model` with low-fi settings for evaluation
-- `constellaration.problems.GeometricalProblem` for official P1 scoring on every evaluation
-### Phase 2: Environment Contract (~1 hour)
-Status: done.
-Update `server/environment.py`:
-- New observation schema with P1 metrics
-- New action schema for rotating-ellipse perturbations
-- Reward V0 with feasibility-first logic
-- Terminal conditions
-Update `fusion_lab/models.py` for new schemas.
-### Phase 3: Manual Playtest (~30 min)
-Status: open.
-Validate hypothesis: "6 actions is enough" on the repaired low-dimensional family.
-- Play 5-10 episodes manually
-- Log: can a human reach feasibility? Improve elongation?
-- Tune magnitude deltas if needed
-- Document at least one pathology or adjustment
-### Phase 4: Baselines (~30 min)
-Status: partial. Baselines exist, but the heuristic needs refresh on the real verifier path.
-- Random agent
-- Heuristic agent (greedy toward known-good parameter region)
-- Comparison table
-### Phase 5: Deploy + Evidence (~2 hours)
-Status: open.
-- Update Dockerfile/deps for constellaration
-- `openenv validate` + `openenv push`
-- Colab notebook connecting to live environment
-- 1-minute demo video
-This section exists to justify the pivot with an implementation path. It should not trigger another strategy pass when the same work is already covered by the SSOT plan and checklist.
-## Fallback
-If full high-fidelity `constellaration` deployment fails (Docker build, HF Spaces issues):
-- Keep the low-fidelity `constellaration` run path working
-- Fall back to a low-fidelity-only hosted environment and document the limitation clearly
-- Do not spend more than 1 hour debugging deployment before falling back
-## Known-Good Fixtures
-Start with the frozen repaired-family reset seeds in `server/contract.py` and expand only if the implementation needs more coverage:
-1. **Reset seed:** aspect_ratio=3.6, elongation=1.4, rotational_transform=1.5, triangularity_scale=0.55
-2. **Reset seed:** aspect_ratio=3.4, elongation=1.4, rotational_transform=1.6, triangularity_scale=0.55
-3. **Reset seed:** aspect_ratio=3.8, elongation=1.4, rotational_transform=1.5, triangularity_scale=0.55
-4. **Deliberately bad reference:** keep a clearly infeasible boundary only as a negative verifier/reward sanity check
-These are for verifier/reward sanity, not a prerequisite seed-mining project.
-## What Not To Do
-- Do not port the full ai-sci-feasible-designs harness or governor stack.
-- Do not make the task "agent writes arbitrary optimization scripts."
-- Do not stream the full HF dataset at runtime.
-- Do not mix rotating-ellipse and Fourier-repair action spaces.
-- Do not pretend the upstream 3-knob family is enough for P1 after the verified triangularity blocker.
-- Do not use high-fidelity eval for interactive steps (24s is too slow).
-- Do not narrate "6 actions is enough" as validated until manually playtested.
-- Do not claim full P1 boundary space coverage. The env uses a low-dim subfamily.
-- Do not reopen the task-selection debate after the pivot is already accepted unless a blocker forces it.

docs/archive/FUSION_DELIVERABLES_MAP.md ADDED Viewed

	@@ -0,0 +1,20 @@

+# Fusion Design Lab Deliverables Map
+**Role:** Compatibility doc kept for link stability
+**Status:** Not an active SSOT surface
+The deliverables mapping now lives in [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md):
+- artifact plan
+- compute surface roles
+- evidence order
+- success gates
+- fallback rules
+Use these docs instead:
+- planning and execution: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
+- live technical contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
+- blocker and sweep evidence: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)
+This file no longer carries branch status, execution order, or a separate planning queue.

docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Fusion Design Lab Next 12 Hours Checklist
+**Role:** Compatibility doc kept for link stability
+**Status:** Not an active SSOT surface
+The old dated checklist created a second execution-order source and went stale immediately.
+Use these docs instead:
+- planning and live execution order: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
+- technical task contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
+- evidence behind the repaired parameterization: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)
+Current execution priority remains:
+1. measured sweep
+2. tracked fixtures
+3. manual playtest
+4. heuristic baseline refresh
+5. HF Space proof
+6. notebook, demo, and repo polish

docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Pivot: P1 Rotating-Ellipse Environment
+**Date:** March 7, 2026
+**Role:** Short decision record
+**Status:** Not an active SSOT surface
+## Decision
+Pivot the environment to the official `P1` benchmark with real `constellaration` physics and a repaired low-dimensional boundary family derived from a rotating-ellipse seed.
+## Why
+- the historical upstream 3-knob family could not satisfy the `P1` triangularity requirement
+- the repaired 4-knob family restored a meaningful low-dimensional control surface
+- the hackathon artifact needs a narrow, legible, human-playable environment rather than a transplanted optimization harness
+## What This Does Not Mean
+- it does not make the low-dimensional family the full `P1` design space
+- it does not justify porting the old `ai-sci-feasible-designs` harness
+- it does not settle repaired-family ranges, deltas, or budget choices without measurement
+## Where The Live Truth Now Lives
+- planning and execution: [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md)
+- technical contract: [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md)
+- blocker and sweep evidence: [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md)

docs/archive/README.md ADDED Viewed

	@@ -0,0 +1,15 @@

+# Archived Docs
+This directory keeps legacy planning docs that are no longer active SSOT surfaces.
+Current live docs:
+- [`../FUSION_DESIGN_LAB_PLAN_V2.md`](../FUSION_DESIGN_LAB_PLAN_V2.md) for planning and execution order
+- [`../P1_ENV_CONTRACT_V1.md`](../P1_ENV_CONTRACT_V1.md) for the live technical contract
+- [`../P1_PARAMETERIZATION_DEEPDIVE.md`](../P1_PARAMETERIZATION_DEEPDIVE.md) for blocker evidence and supporting rationale
+Archived here:
+- `FUSION_DELIVERABLES_MAP.md`
+- `FUSION_NEXT_12_HOURS_CHECKLIST.md`
+- `PIVOT_P1_ROTATING_ELLIPSE.md`