Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on 28 days ago

Commit

2d47f4f

1 Parent(s): acb992c

docs: refine p1 parameterization plan

Browse files

Files changed (8) hide show

AGENTS.md +2 -0
README.md +13 -7
TODO.md +64 -3
baselines/README.md +3 -0
docs/FUSION_DELIVERABLES_MAP.md +16 -7
docs/FUSION_DESIGN_LAB_PLAN_V2.md +36 -14
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md +36 -22
docs/P1_ENV_CONTRACT_V1.md +14 -3

AGENTS.md CHANGED Viewed

@@ -26,6 +26,8 @@ Use these docs as the planning SSOT:
 `docs/PIVOT_P1_ROTATING_ELLIPSE.md` is a supporting decision record, not a planning SSOT. If it disagrees with the three docs above, the three SSOT docs win.
 If code and docs disagree, either:
 1. update code to match the docs, or

 `docs/PIVOT_P1_ROTATING_ELLIPSE.md` is a supporting decision record, not a planning SSOT. If it disagrees with the three docs above, the three SSOT docs win.
+`docs/P1_ENV_CONTRACT_V1.md` is a supporting technical spec for the current implementation phase. It should refine the SSOT docs, not silently diverge from them.
 If code and docs disagree, either:
 1. update code to match the docs, or

README.md CHANGED Viewed

@@ -41,6 +41,8 @@ Implementation status:
 - [ ] Add a custom low-dimensional boundary builder with an explicit triangularity control knob
 - [ ] Split boundary construction from boundary evaluation in `server/physics.py`
 - [ ] Update the action contract from 3 knobs to the repaired low-dimensional family
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
@@ -50,7 +52,9 @@ Implementation status:
 - The current 3-knob family is structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That means reward tuning is secondary until the parameterization is repaired.
 - `BASELINE_PARAMS` is not a near-feasible anchor on the real verifier path. The current low-fidelity measurement is roughly `p1_feasibility=1.01`, `average_triangularity=+0.005`, and `edge_iota_over_nfp=0.059`, so fixture discovery has to happen after parameterization repair, not before.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
@@ -112,13 +116,15 @@ uv sync --extra notebooks
 1. Repair the low-dimensional boundary parameterization so it can actually move P1 triangularity.
 2. Split boundary construction from boundary evaluation in `server/physics.py`.
-3. Update the environment contract to the repaired low-dimensional family and label low-fi vs high-fi truth clearly in observations.
-4. Add tracked `P1` fixtures under `server/data/p1`.
-5. Run manual playtest episodes and record the first real reward pathology, if any.
-6. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
-7. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
-8. Deploy the environment to HF Space.
-9. Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

 - [ ] Add a custom low-dimensional boundary builder with an explicit triangularity control knob
 - [ ] Split boundary construction from boundary evaluation in `server/physics.py`
 - [ ] Update the action contract from 3 knobs to the repaired low-dimensional family
+- [ ] Add explicit VMEC failure semantics to the environment contract
+- [ ] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
 - The current 3-knob family is structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That means reward tuning is secondary until the parameterization is repaired.
 - `BASELINE_PARAMS` is not a near-feasible anchor on the real verifier path. The current low-fidelity measurement is roughly `p1_feasibility=1.01`, `average_triangularity=+0.005`, and `edge_iota_over_nfp=0.059`, so fixture discovery has to happen after parameterization repair, not before.
+- The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
+- The environment still needs explicit VMEC failure semantics. Failed evaluations should cost budget, produce a visible failure observation, and apply a documented penalty; they should not be silently swallowed.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
 1. Repair the low-dimensional boundary parameterization so it can actually move P1 triangularity.
 2. Split boundary construction from boundary evaluation in `server/physics.py`.
+3. Add explicit VMEC failure semantics to the environment loop.
+4. Update the environment contract to the repaired low-dimensional family and label low-fi vs high-fi truth clearly in observations.
+5. Run a small measured sweep on the repaired family to choose useful ranges, deltas, and reset seeds.
+6. Add tracked `P1` fixtures under `server/data/p1`.
+7. Run manual playtest episodes and record the first real reward pathology, if any.
+8. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
+9. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
+10. Deploy the environment to HF Space.
+11. Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

TODO.md CHANGED Viewed

@@ -7,6 +7,7 @@ Use this file for day-of build progress. Use the linked docs for rationale, sequ
 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
 - [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 - [Repo Guardrails](AGENTS.md)
@@ -26,6 +27,12 @@ Priority source:
 - [x] repo docs call out the low-fi/high-fi `constellaration` split honestly
 - [x] post-terminal guard in `step()`
 - [x] `constellaration` verifier wiring
 - [ ] tracked `P1` fixtures
 - [ ] manual playtest log
 - [x] settle the non-submit terminal reward policy
@@ -39,7 +46,8 @@ flowchart TD
     A["Northflank Smoke Test"] --> E["Fixture Checks"]
     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
     C["constellaration Physics Wiring"] --> D
-    D --> E["Fixture Checks"]
     E --> F["Manual Playtest"]
     F --> G["Reward V1"]
     G --> H["Baselines"]
@@ -57,12 +65,19 @@ flowchart TD
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
-- [ ] Pass the Northflank smoke test
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md),
   [training/notebooks/README.md](training/notebooks/README.md)
 ## Fresh Wiring
 - [x] Rewrite the shared models to the locked `P1` contract
@@ -93,14 +108,58 @@ flowchart TD
   [server/app.py](server/app.py),
   [README.md](README.md)
 ## Validation and Reward
 - [ ] Add 1-2 tracked `P1` fixtures
   Files:
   [server/data/p1/README.md](server/data/p1/README.md),
   [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
   Note:
-  the default baseline params are not near-feasible on the real verifier path, so they are not enough for the fixture set by themselves
 - [ ] Run fixture sanity checks
   Goal:
@@ -188,6 +247,8 @@ flowchart TD
 ## Guardrails
 - [ ] Do not reopen `P1 + rotating-ellipse` strategy without a real blocker
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not add training-first complexity before manual playtesting

 - [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [Deliverables Map](docs/FUSION_DELIVERABLES_MAP.md)
 - [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
+- [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
 - [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 - [Repo Guardrails](AGENTS.md)
 - [x] repo docs call out the low-fi/high-fi `constellaration` split honestly
 - [x] post-terminal guard in `step()`
 - [x] `constellaration` verifier wiring
+- [x] verify the current 3-knob family against the real low-fidelity verifier
+- [ ] repair the low-dimensional parameterization so triangularity is controllable
+- [ ] split boundary building from boundary evaluation
+- [ ] update the action schema from 3 knobs to the repaired low-dimensional family
+- [ ] add explicit VMEC failure semantics
+- [ ] label low-fi vs high-fi truth in the observation/task surface
 - [ ] tracked `P1` fixtures
 - [ ] manual playtest log
 - [x] settle the non-submit terminal reward policy
     A["Northflank Smoke Test"] --> E["Fixture Checks"]
     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
     C["constellaration Physics Wiring"] --> D
+    D --> P["Parameterization Repair"]
+    P --> E["Fixture Checks"]
     E --> F["Manual Playtest"]
     F --> G["Reward V1"]
     G --> H["Baselines"]
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md)
+- [x] Pass the Northflank smoke test
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/FUSION_NEXT_12_HOURS_CHECKLIST.md),
   [training/notebooks/README.md](training/notebooks/README.md)
+- [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
+  Goal:
+  decide whether parameterization repair is a blocker before more reward work
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
+  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 ## Fresh Wiring
 - [x] Rewrite the shared models to the locked `P1` contract
   [server/app.py](server/app.py),
   [README.md](README.md)
+- [ ] Repair the low-dimensional boundary family
+  Goal:
+  add an explicit triangularity control knob or equivalent low-dimensional control so the environment can actually approach P1 feasibility
+  Files:
+  [server/physics.py](server/physics.py),
+  [fusion_lab/models.py](fusion_lab/models.py),
+  [server/environment.py](server/environment.py),
+  [server/app.py](server/app.py)
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [ ] Split boundary construction from boundary evaluation
+  Goal:
+  make the verifier boundary-based and keep parameterization-specific logic in the environment adapter layer
+  Files:
+  [server/physics.py](server/physics.py)
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [ ] Add explicit VMEC failure semantics
+  Goal:
+  failed evaluations must cost budget, return a visible failure observation, and apply a documented penalty without silent fallback
+  Files:
+  [server/physics.py](server/physics.py),
+  [server/environment.py](server/environment.py)
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [ ] Label low-fi vs high-fi truth in the observation/task surface
+  Goal:
+  make it obvious whether a metric came from a low-fidelity `run` step or a high-fidelity `submit`
+  Files:
+  [fusion_lab/models.py](fusion_lab/models.py),
+  [server/environment.py](server/environment.py),
+  [server/app.py](server/app.py)
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
 ## Validation and Reward
+- [ ] Run a small measured sweep on the repaired low-dimensional family
+  Goal:
+  choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
 - [ ] Add 1-2 tracked `P1` fixtures
   Files:
   [server/data/p1/README.md](server/data/p1/README.md),
   [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
   Note:
+  add fixtures only after the parameterization repair produces a meaningful near-boundary region
 - [ ] Run fixture sanity checks
   Goal:
 ## Guardrails
 - [ ] Do not reopen `P1 + rotating-ellipse` strategy without a real blocker
+- [ ] Do not pretend the current 3-knob family is sufficient for P1 after the verified triangularity blocker
+- [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not add training-first complexity before manual playtesting

baselines/README.md CHANGED Viewed

@@ -6,6 +6,9 @@ Random and heuristic baselines will live here.
 - [x] heuristic baseline exists
 - [x] baseline comparison script exists
 - [x] baseline comparison rerun completed on the real verifier path
 - [ ] heuristic refreshed after the real-verifier rerun
 - [ ] near-boundary fixture-backed baseline start chosen for manual playtesting
 - [ ] presentation-ready comparison trace exported

 - [x] heuristic baseline exists
 - [x] baseline comparison script exists
 - [x] baseline comparison rerun completed on the real verifier path
+- [x] verified that the current 3-knob family is blocked on P1 triangularity under the real verifier path
+- [ ] repair the low-dimensional parameterization before further heuristic work
+- [ ] wait for measured repaired-family ranges and reset seeds before retuning the heuristic
 - [ ] heuristic refreshed after the real-verifier rerun
 - [ ] near-boundary fixture-backed baseline start chosen for manual playtesting
 - [ ] presentation-ready comparison trace exported

docs/FUSION_DELIVERABLES_MAP.md CHANGED Viewed

@@ -13,6 +13,10 @@ Use this map to sequence execution, not to reopen already-locked task choices.
 - [x] baseline comparison has been rerun on the real verifier path
 - [x] Northflank smoke workflow and note are committed
 - [x] Northflank smoke test has passed on the team H100
 - [ ] tracked fixtures are checked in
 - [ ] manual playtest evidence exists
 - [ ] heuristic baseline has been refreshed for the real verifier path
@@ -53,6 +57,8 @@ flowchart TD
     B0 --> F["Observation + action schema frozen"]
     B3 --> G["Fresh P1 verifier loop proven"]
     B2 --> H["Exploit observed -> penalty added"]
     B4 --> I0["Deterministic action schema"]
     D2 --> I["Human can act coherently in env"]
@@ -104,10 +110,13 @@ flowchart LR
 Northflank compute bring-up and smoke validation are complete.
-1. Add tracked fixtures and run fixture sanity checks.
-2. Manual-playtest the environment and record the first real pathology, if any.
-3. Refresh the heuristic baseline from that evidence.
-4. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
-5. Use the notebook to show traces and comparisons; include training only if it adds signal.
-6. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
-7. Polish the repo only after the artifacts are real.

 - [x] baseline comparison has been rerun on the real verifier path
 - [x] Northflank smoke workflow and note are committed
 - [x] Northflank smoke test has passed on the team H100
+- [x] current 3-knob family has been verified as blocked on P1 triangularity
+- [ ] repaired low-dimensional boundary builder is implemented
+- [ ] explicit VMEC failure semantics are implemented
+- [ ] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
 - [ ] tracked fixtures are checked in
 - [ ] manual playtest evidence exists
 - [ ] heuristic baseline has been refreshed for the real verifier path
     B0 --> F["Observation + action schema frozen"]
     B3 --> G["Fresh P1 verifier loop proven"]
+    G --> G1["Parameterization can actually reach P1 feasibility"]
+    G --> G2["VMEC failures are explicit and penalized"]
     B2 --> H["Exploit observed -> penalty added"]
     B4 --> I0["Deterministic action schema"]
     D2 --> I["Human can act coherently in env"]
 Northflank compute bring-up and smoke validation are complete.
+1. Repair the low-dimensional parameterization so triangularity is controllable under the official verifier.
+2. Add explicit VMEC failure semantics and clear low-fi vs high-fi observation labeling.
+3. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
+4. Add tracked fixtures and run fixture sanity checks.
+5. Manual-playtest the environment and record the first real pathology, if any.
+6. Refresh the heuristic baseline from that evidence.
+7. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
+8. Use the notebook to show traces and comparisons; include training only if it adds signal.
+9. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
+10. Polish the repo only after the artifacts are real.

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -16,6 +16,8 @@
 - [x] Northflank smoke test has passed on the team H100
 - [x] current 3-knob family has been checked against the real low-fidelity verifier
 - [ ] parameterization repair is implemented so triangularity is controllable
 - [ ] tracked `P1` fixtures are added
 - [ ] manual playtest evidence is recorded
 - [ ] heuristic baseline is refreshed for the real verifier path
@@ -269,6 +271,13 @@ This is not trying to expose the full Fourier-boundary space. The goal is a legi
    - `submit`
    - exhausted budget
 ### Terminal Contract
 The episode should end cleanly and deterministically.
@@ -305,6 +314,10 @@ Implementation split:
 The verifier should be boundary-based. Parameterization-specific logic should not be treated as verifier truth.
 ## 11. Reward V0
 The reward in this document is not the final reward. It is `Reward V0`.
@@ -330,6 +343,7 @@ Current execution note:
 - do not tune reward further until the repaired low-dimensional family can actually approach P1 feasibility
 - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
 - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
 ### Reward V0 Failure Modes To Test
@@ -377,6 +391,8 @@ These are still hypotheses until manually or empirically checked:
 - `restore_best` is useful without becoming an exploit
 - heuristic should beat random on mean episode reward
 - low-fidelity interaction is predictive enough for useful policy learning
 These should not be narrated as facts in the final demo until validated.
@@ -517,6 +533,8 @@ The repo should make the environment easy to understand:
 - observation schema frozen
 - action schema frozen
 - terminal conditions frozen
 ### Gate 2: Verifier Wiring Pass
@@ -575,21 +593,23 @@ Deliverables:
 ### Phase 1
-Wire the official verifier and run fixture checks.
 Deliverables:
-- one good fixture
-- near-boundary fixtures
-- bad fixtures
-- confidence that reward/verifier ordering is sane
 ### Phase 2
-Manual-playtest the environment.
 Deliverables:
 - 5 to 10 episode logs
 - notes on leverage, ambiguity, and pathologies
@@ -688,7 +708,7 @@ Instead:
 - simplify the initial states
 - tighten the action set
 - reduce magnitude choices
-- keep the environment more learnable within the fixed budget
 ### If the task is too easy
@@ -696,6 +716,7 @@ Do not add more domains.
 Instead:
 - adjust budget
 - adjust magnitudes
 - adjust reward to discourage trivial submission
@@ -728,10 +749,11 @@ That last line is intentionally conservative. It is strong enough without claimi
 ## 21. Immediate Next Actions
-1. Freeze the `P1` environment contract in code and docs.
-2. Implement fresh verifier wiring in this repo.
-3. Run fixture checks before heavy training work.
-4. Run manual playtests before heavy training work.
-5. Mark the current reward as `V0`.
-6. Log the first real pathology and reward revision.
-7. Do not let notebook or video work outrun the environment evidence.

 - [x] Northflank smoke test has passed on the team H100
 - [x] current 3-knob family has been checked against the real low-fidelity verifier
 - [ ] parameterization repair is implemented so triangularity is controllable
+- [ ] explicit VMEC failure semantics are implemented
+- [ ] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
 - [ ] tracked `P1` fixtures are added
 - [ ] manual playtest evidence is recorded
 - [ ] heuristic baseline is refreshed for the real verifier path
    - `submit`
    - exhausted budget
+Failure semantics must also be explicit:
+- if VMEC or the forward model fails, the run still consumes budget
+- the observation must expose that the step failed
+- the reward must apply a documented penalty
+- the environment must not silently replace the failed result with a fake success path
 ### Terminal Contract
 The episode should end cleanly and deterministically.
 The verifier should be boundary-based. Parameterization-specific logic should not be treated as verifier truth.
+Current execution rule:
+- do not narrate guessed repaired-family ranges, deltas, or a larger budget as settled defaults until they are measured on the repaired family
 ## 11. Reward V0
 The reward in this document is not the final reward. It is `Reward V0`.
 - do not tune reward further until the repaired low-dimensional family can actually approach P1 feasibility
 - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
 - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
+- do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
 ### Reward V0 Failure Modes To Test
 - `restore_best` is useful without becoming an exploit
 - heuristic should beat random on mean episode reward
 - low-fidelity interaction is predictive enough for useful policy learning
+- useful repaired-family parameter ranges and deltas
+- whether the current budget should stay at `6` or change after playtesting
 These should not be narrated as facts in the final demo until validated.
 - observation schema frozen
 - action schema frozen
 - terminal conditions frozen
+- explicit VMEC failure semantics defined
+- low-fi vs high-fi metric labeling defined
 ### Gate 2: Verifier Wiring Pass
 ### Phase 1
+Repair the low-dimensional parameterization, wire the verifier split cleanly, and run a small measured sweep before fixture checks.
 Deliverables:
+- repaired low-dimensional boundary builder
+- boundary-based verifier split
+- explicit VMEC failure semantics
+- measured parameter ranges, deltas, and candidate reset seeds
 ### Phase 2
+Freeze initial fixtures and manual-playtest the environment.
 Deliverables:
+- one good or near-boundary fixture
+- bad fixtures
 - 5 to 10 episode logs
 - notes on leverage, ambiguity, and pathologies
 - simplify the initial states
 - tighten the action set
 - reduce magnitude choices
+- keep the environment more learnable before changing the budget
 ### If the task is too easy
 Instead:
+- first verify that parameterization repair and reset seeds did not make the task trivial
 - adjust budget
 - adjust magnitudes
 - adjust reward to discourage trivial submission
 ## 21. Immediate Next Actions
+1. Repair the low-dimensional boundary parameterization so triangularity is controllable.
+2. Split boundary construction from official boundary evaluation.
+3. Add explicit VMEC failure semantics and clear low-fi vs high-fi labeling.
+4. Run a small measured sweep before locking ranges, deltas, or budget changes.
+5. Freeze fixtures and run manual playtests before heavy training work.
+6. Mark the current reward as `V0`.
+7. Log the first real pathology and reward revision.
+8. Do not let notebook or video work outrun the environment evidence.

docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED Viewed

@@ -9,19 +9,23 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
 ## Current Branch Status
 - [x] `P1` task is locked
-- [x] rotating-ellipse `P1` contract is implemented in the working tree
 - [x] baselines and API surface have been moved to the `P1` contract
 - [x] add a post-terminal guard in `step()`
 - [x] replace the synthetic evaluator with `constellaration`
 - [x] re-run baselines on the real verifier path
 - [x] commit the Northflank smoke workflow and note
 - [x] pass the Northflank smoke test on the team H100
 - [ ] add tracked fixtures and manual playtest evidence
 - [ ] refresh the heuristic baseline after the real-verifier rerun
 Current caution:
-- do not assume the default baseline params are a near-feasible playtest start; on the current real verifier path they are still deeply infeasible, so fixture discovery comes first
 ## Plan V2 Inheritance
@@ -73,9 +77,10 @@ Artifacts:
    - reward shaping
    - manual playtesting
 5. Mark open assumptions explicitly:
-   - whether the rotating-ellipse action set is expressive enough
    - whether the fixed step budget is enough
    - whether `restore_best` is useful without becoming an exploit
 Exit condition: a human can read the spec and understand how to act in the environment.
@@ -90,34 +95,41 @@ Transition rule:
 ## Hour 2-4: Verify Wiring, Then Manual Playtest
-1. Run fixture checks:
    - known-good or near-winning design
    - near-boundary designs
    - clearly bad designs
-   - do not rely on the default baseline params as the only starting point
-2. Confirm:
    - verifier outputs are sane
    - reward ordering is sane
    - objective direction is correct
-3. Manually play 5 to 10 episodes.
-4. Log for each step:
    - observation
    - chosen action
    - expected effect
    - returned reward
    - confusion or exploit if observed
-5. Identify at least one bad incentive or exploit.
-6. Patch reward or penalty logic immediately.
-7. Write the reward shaping story:
    - initial reward V0
    - bad behavior
    - refinement to reward V1
    - improved behavior
-8. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
 Exit condition: you can explain why the environment now rewards the intended behavior.
 Artifacts:
 - fixture check note
 - manual playtest log
 - reward shaping note
@@ -205,16 +217,17 @@ Artifacts:
 ## Artifact Order
 1. Environment spec
-2. Fixture check note
-3. Manual playtest log
-4. Reward revision note
-5. Stable task run
-6. Random baseline
-7. Heuristic baseline
-8. Northflank traces or training evidence
-9. Colab training or eval evidence
-10. Demo recording
-11. Repo polish
 ## Non-Negotiables
@@ -223,6 +236,7 @@ Artifacts:
 - Do not optimize training before manual playtesting.
 - Do not rely on reward curves alone; keep trajectory evidence.
 - Do not narrate hypotheses as facts before they are checked.
 - Do not polish the repo or video before the environment and baselines are real.
 - Treat judge comments as pressure toward clarity and reproducibility, not broader unsupported claims.
 - Do not force a training-centric story if the strongest evidence is environment quality plus baselines.

 ## Current Branch Status
 - [x] `P1` task is locked
+- [x] 3-knob rotating-ellipse `P1` contract is implemented in the working tree
 - [x] baselines and API surface have been moved to the `P1` contract
 - [x] add a post-terminal guard in `step()`
 - [x] replace the synthetic evaluator with `constellaration`
 - [x] re-run baselines on the real verifier path
 - [x] commit the Northflank smoke workflow and note
 - [x] pass the Northflank smoke test on the team H100
+- [x] verify that the current 3-knob family is blocked on P1 triangularity under the real verifier path
+- [ ] repair the low-dimensional parameterization
+- [ ] add explicit VMEC failure semantics
+- [ ] label low-fi `run` truth vs high-fi `submit` truth in the task surface
 - [ ] add tracked fixtures and manual playtest evidence
 - [ ] refresh the heuristic baseline after the real-verifier rerun
 Current caution:
+- do not assume the current 3-knob family is a viable playtest start; parameterization repair comes before fixture discovery, manual playtesting, and heuristic refresh
 ## Plan V2 Inheritance
    - reward shaping
    - manual playtesting
 5. Mark open assumptions explicitly:
+   - whether the repaired low-dimensional action set is expressive enough
    - whether the fixed step budget is enough
    - whether `restore_best` is useful without becoming an exploit
+   - whether repaired-family ranges and deltas need adjustment after measurement
 Exit condition: a human can read the spec and understand how to act in the environment.
 ## Hour 2-4: Verify Wiring, Then Manual Playtest
+1. Repair the low-dimensional parameterization so triangularity is controllable.
+2. Add explicit VMEC failure semantics and visible failure observations.
+3. Label low-fi `run` truth vs high-fi `submit` truth clearly.
+4. Run a small measured sweep on the repaired family before freezing defaults.
+5. Run fixture checks:
    - known-good or near-winning design
    - near-boundary designs
    - clearly bad designs
+   - do not rely on the current default baseline params as the only starting point
+6. Confirm:
    - verifier outputs are sane
    - reward ordering is sane
    - objective direction is correct
+7. Manually play 5 to 10 episodes.
+8. Log for each step:
    - observation
    - chosen action
    - expected effect
    - returned reward
    - confusion or exploit if observed
+9. Identify at least one bad incentive or exploit.
+10. Patch reward or penalty logic immediately.
+11. Write the reward shaping story:
    - initial reward V0
    - bad behavior
    - refinement to reward V1
    - improved behavior
+12. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
 Exit condition: you can explain why the environment now rewards the intended behavior.
 Artifacts:
+- repaired low-dimensional boundary plan
+- explicit failure semantics note
+- measured range and delta note
 - fixture check note
 - manual playtest log
 - reward shaping note
 ## Artifact Order
 1. Environment spec
+2. Repaired parameterization note
+3. Fixture check note
+4. Manual playtest log
+5. Reward revision note
+6. Stable task run
+7. Random baseline
+8. Heuristic baseline
+9. Northflank traces or training evidence
+10. Colab training or eval evidence
+11. Demo recording
+12. Repo polish
 ## Non-Negotiables
 - Do not optimize training before manual playtesting.
 - Do not rely on reward curves alone; keep trajectory evidence.
 - Do not narrate hypotheses as facts before they are checked.
+- Do not guess repaired-family ranges, deltas, or budget changes without a measured sweep.
 - Do not polish the repo or video before the environment and baselines are real.
 - Treat judge comments as pressure toward clarity and reproducibility, not broader unsupported claims.
 - Do not force a training-centric story if the strongest evidence is environment quality plus baselines.

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -72,6 +72,7 @@ The verifier layer should own:
 - official `P1` feasibility semantics
 - official `P1` objective direction
 - score ordering
 The verifier layer should not own:
@@ -91,6 +92,12 @@ Target controllable knobs:
 - `rotational_transform`
 - `triangularity_scale`
 Important naming rule:
 - once triangularity is injected explicitly, stop describing the family as plain upstream “rotating ellipse”
@@ -168,6 +175,8 @@ Do not add:
 - bonuses for matching a known winner
 - hand-coded constraint tricks to hide a blocked action family
 ## Reset Strategy
 Start with frozen exact seeds, not jitter.
@@ -209,9 +218,11 @@ before tuning reward further
 3. Update the action and state schema in [fusion_lab/models.py](../fusion_lab/models.py).
 4. Update the episode loop and observation labeling in [server/environment.py](../server/environment.py).
 5. Update the task summary in [server/app.py](../server/app.py).
-6. Freeze 1-2 repaired low-dimensional fixtures.
-7. Run manual playtesting.
-8. Refresh the heuristic baseline only after that evidence exists.
 ## Out of Scope

 - official `P1` feasibility semantics
 - official `P1` objective direction
 - score ordering
+- explicit failure results when VMEC or forward-model evaluation fails
 The verifier layer should not own:
 - `rotational_transform`
 - `triangularity_scale`
+Current measurement rule:
+- do not lock exact repaired-family ranges or deltas from prose alone
+- measure them on the repaired boundary family before presenting them as defaults
+- especially treat `rotational_transform` bounds, `triangularity_scale` deltas, and budget changes as open until measured
 Important naming rule:
 - once triangularity is injected explicitly, stop describing the family as plain upstream “rotating ellipse”
 - bonuses for matching a known winner
 - hand-coded constraint tricks to hide a blocked action family
+Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
 ## Reset Strategy
 Start with frozen exact seeds, not jitter.
 3. Update the action and state schema in [fusion_lab/models.py](../fusion_lab/models.py).
 4. Update the episode loop and observation labeling in [server/environment.py](../server/environment.py).
 5. Update the task summary in [server/app.py](../server/app.py).
+6. Add explicit VMEC failure semantics in [server/environment.py](../server/environment.py).
+7. Run a small measured sweep to choose ranges, deltas, and reset seeds.
+8. Freeze 1-2 repaired low-dimensional fixtures.
+9. Run manual playtesting.
+10. Refresh the heuristic baseline only after that evidence exists.
 ## Out of Scope