Commit ·
88d9b78
1
Parent(s): fe3a41d
fix: align submit scoring with fidelity
Browse files- README.md +2 -0
- TODO.md +11 -0
- docs/FUSION_DELIVERABLES_MAP.md +9 -10
- docs/FUSION_DESIGN_LAB_PLAN_V2.md +9 -8
- docs/FUSION_NEXT_12_HOURS_CHECKLIST.md +11 -14
- docs/P1_ENV_CONTRACT_V1.md +6 -0
- fusion_lab/models.py +11 -0
- server/environment.py +68 -15
README.md
CHANGED
|
@@ -45,6 +45,7 @@ Implementation status:
|
|
| 45 |
- [x] Update the action contract from 3 knobs to the repaired low-dimensional family
|
| 46 |
- [x] Add explicit VMEC failure semantics to the environment contract
|
| 47 |
- [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
|
|
|
|
| 48 |
- [ ] Add tracked `P1` fixtures under `server/data/p1/`
|
| 49 |
- [ ] Run manual playtesting and record the first reward pathology
|
| 50 |
- [ ] Refresh the heuristic baseline for the real verifier path
|
|
@@ -57,6 +58,7 @@ Implementation status:
|
|
| 57 |
- The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
|
| 58 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 59 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
|
|
|
| 60 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 61 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 62 |
|
|
|
|
| 45 |
- [x] Update the action contract from 3 knobs to the repaired low-dimensional family
|
| 46 |
- [x] Add explicit VMEC failure semantics to the environment contract
|
| 47 |
- [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
|
| 48 |
+
- [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
|
| 49 |
- [ ] Add tracked `P1` fixtures under `server/data/p1/`
|
| 50 |
- [ ] Run manual playtesting and record the first reward pathology
|
| 51 |
- [ ] Refresh the heuristic baseline for the real verifier path
|
|
|
|
| 58 |
- The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
|
| 59 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 60 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
| 61 |
+
- Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
|
| 62 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 63 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 64 |
|
TODO.md
CHANGED
|
@@ -33,6 +33,7 @@ Priority source:
|
|
| 33 |
- [x] update the action schema from 3 knobs to the repaired low-dimensional family
|
| 34 |
- [x] add explicit VMEC failure semantics
|
| 35 |
- [x] label low-fi vs high-fi truth in the observation/task surface
|
|
|
|
| 36 |
- [ ] tracked `P1` fixtures
|
| 37 |
- [ ] manual playtest log
|
| 38 |
- [x] settle the non-submit terminal reward policy
|
|
@@ -146,6 +147,15 @@ flowchart TD
|
|
| 146 |
Related:
|
| 147 |
[P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
|
| 148 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
## Validation and Reward
|
| 150 |
|
| 151 |
- [ ] Run a small measured sweep on the repaired low-dimensional family
|
|
@@ -253,5 +263,6 @@ flowchart TD
|
|
| 253 |
- [ ] Do not let notebook or demo work outrun environment evidence
|
| 254 |
- [ ] Do not add training-first complexity before manual playtesting
|
| 255 |
- [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
|
|
|
|
| 256 |
- [ ] Do not describe the current baseline reset state as feasible or near-feasible
|
| 257 |
- [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
|
|
|
|
| 33 |
- [x] update the action schema from 3 knobs to the repaired low-dimensional family
|
| 34 |
- [x] add explicit VMEC failure semantics
|
| 35 |
- [x] label low-fi vs high-fi truth in the observation/task surface
|
| 36 |
+
- [x] separate high-fi submit scoring/reporting from low-fi rollout score state
|
| 37 |
- [ ] tracked `P1` fixtures
|
| 38 |
- [ ] manual playtest log
|
| 39 |
- [x] settle the non-submit terminal reward policy
|
|
|
|
| 147 |
Related:
|
| 148 |
[P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
|
| 149 |
|
| 150 |
+
- [x] Separate high-fi submit scoring/reporting from low-fi rollout score state
|
| 151 |
+
Completed:
|
| 152 |
+
submit-time reward now uses a high-fidelity initial reference, and submit summaries / displayed best score use high-fidelity state instead of low-fidelity rollout state
|
| 153 |
+
Files:
|
| 154 |
+
[server/environment.py](server/environment.py)
|
| 155 |
+
[fusion_lab/models.py](fusion_lab/models.py)
|
| 156 |
+
Related:
|
| 157 |
+
[P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
|
| 158 |
+
|
| 159 |
## Validation and Reward
|
| 160 |
|
| 161 |
- [ ] Run a small measured sweep on the repaired low-dimensional family
|
|
|
|
| 263 |
- [ ] Do not let notebook or demo work outrun environment evidence
|
| 264 |
- [ ] Do not add training-first complexity before manual playtesting
|
| 265 |
- [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
|
| 266 |
+
- [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
|
| 267 |
- [ ] Do not describe the current baseline reset state as feasible or near-feasible
|
| 268 |
- [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
|
docs/FUSION_DELIVERABLES_MAP.md
CHANGED
|
@@ -17,6 +17,7 @@ Use this map to sequence execution, not to reopen already-locked task choices.
|
|
| 17 |
- [x] repaired low-dimensional boundary builder is implemented
|
| 18 |
- [x] explicit VMEC failure semantics are implemented
|
| 19 |
- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
|
|
|
|
| 20 |
- [ ] tracked fixtures are checked in
|
| 21 |
- [ ] manual playtest evidence exists
|
| 22 |
- [ ] heuristic baseline has been refreshed for the real verifier path
|
|
@@ -110,13 +111,11 @@ flowchart LR
|
|
| 110 |
|
| 111 |
Northflank compute bring-up and smoke validation are complete.
|
| 112 |
|
| 113 |
-
1.
|
| 114 |
-
2. Add
|
| 115 |
-
3.
|
| 116 |
-
4.
|
| 117 |
-
5.
|
| 118 |
-
6.
|
| 119 |
-
7.
|
| 120 |
-
8.
|
| 121 |
-
9. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
|
| 122 |
-
10. Polish the repo only after the artifacts are real.
|
|
|
|
| 17 |
- [x] repaired low-dimensional boundary builder is implemented
|
| 18 |
- [x] explicit VMEC failure semantics are implemented
|
| 19 |
- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
|
| 20 |
+
- [x] terminal submit scoring/reporting is fidelity-consistent
|
| 21 |
- [ ] tracked fixtures are checked in
|
| 22 |
- [ ] manual playtest evidence exists
|
| 23 |
- [ ] heuristic baseline has been refreshed for the real verifier path
|
|
|
|
| 111 |
|
| 112 |
Northflank compute bring-up and smoke validation are complete.
|
| 113 |
|
| 114 |
+
1. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
|
| 115 |
+
2. Add tracked fixtures and run fixture sanity checks.
|
| 116 |
+
3. Manual-playtest the environment and record the first real pathology, if any.
|
| 117 |
+
4. Refresh the heuristic baseline from that evidence.
|
| 118 |
+
5. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
|
| 119 |
+
6. Use the notebook to show traces and comparisons; include training only if it adds signal.
|
| 120 |
+
7. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
|
| 121 |
+
8. Polish the repo only after the artifacts are real.
|
|
|
|
|
|
docs/FUSION_DESIGN_LAB_PLAN_V2.md
CHANGED
|
@@ -18,6 +18,7 @@
|
|
| 18 |
- [x] parameterization repair is implemented so triangularity is controllable
|
| 19 |
- [x] explicit VMEC failure semantics are implemented
|
| 20 |
- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
|
|
|
|
| 21 |
- [ ] tracked `P1` fixtures are added
|
| 22 |
- [ ] manual playtest evidence is recorded
|
| 23 |
- [ ] heuristic baseline is refreshed for the real verifier path
|
|
@@ -26,6 +27,7 @@
|
|
| 26 |
Current caution:
|
| 27 |
|
| 28 |
- the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
|
|
|
|
| 29 |
|
| 30 |
## 1. Submission Thesis
|
| 31 |
|
|
@@ -347,6 +349,7 @@ Current execution note:
|
|
| 347 |
- once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
|
| 348 |
- clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
|
| 349 |
- do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
|
|
|
|
| 350 |
|
| 351 |
### Reward V0 Failure Modes To Test
|
| 352 |
|
|
@@ -555,6 +558,7 @@ The repo should make the environment easy to understand:
|
|
| 555 |
|
| 556 |
- local modify -> verify -> observe loop works
|
| 557 |
- at least one end-to-end episode is stable
|
|
|
|
| 558 |
|
| 559 |
### Gate 5: Reward V1
|
| 560 |
|
|
@@ -752,11 +756,8 @@ That last line is intentionally conservative. It is strong enough without claimi
|
|
| 752 |
|
| 753 |
## 21. Immediate Next Actions
|
| 754 |
|
| 755 |
-
1.
|
| 756 |
-
2.
|
| 757 |
-
3.
|
| 758 |
-
4.
|
| 759 |
-
5.
|
| 760 |
-
6. Mark the current reward as `V0`.
|
| 761 |
-
7. Log the first real pathology and reward revision.
|
| 762 |
-
8. Do not let notebook or video work outrun the environment evidence.
|
|
|
|
| 18 |
- [x] parameterization repair is implemented so triangularity is controllable
|
| 19 |
- [x] explicit VMEC failure semantics are implemented
|
| 20 |
- [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
|
| 21 |
+
- [x] terminal scoring/reporting is fidelity-consistent between low-fi rollout state and high-fi submit truth
|
| 22 |
- [ ] tracked `P1` fixtures are added
|
| 23 |
- [ ] manual playtest evidence is recorded
|
| 24 |
- [ ] heuristic baseline is refreshed for the real verifier path
|
|
|
|
| 27 |
Current caution:
|
| 28 |
|
| 29 |
- the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
|
| 30 |
+
- terminal scoring/reporting now uses a fidelity-consistent basis at episode end: high-fi `submit` comparisons are no longer anchored to low-fi rollout score state
|
| 31 |
|
| 32 |
## 1. Submission Thesis
|
| 33 |
|
|
|
|
| 349 |
- once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
|
| 350 |
- clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
|
| 351 |
- do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
|
| 352 |
+
- keep terminal reward and reporting fidelity-consistent; do not compare high-fi submit scores against low-fi best/initial score state
|
| 353 |
|
| 354 |
### Reward V0 Failure Modes To Test
|
| 355 |
|
|
|
|
| 558 |
|
| 559 |
- local modify -> verify -> observe loop works
|
| 560 |
- at least one end-to-end episode is stable
|
| 561 |
+
- submit-time reward/reporting does not mix low-fi and high-fi score state
|
| 562 |
|
| 563 |
### Gate 5: Reward V1
|
| 564 |
|
|
|
|
| 756 |
|
| 757 |
## 21. Immediate Next Actions
|
| 758 |
|
| 759 |
+
1. Run a small measured sweep before locking ranges, deltas, or budget changes.
|
| 760 |
+
2. Freeze fixtures and run manual playtests before heavy training work.
|
| 761 |
+
3. Mark the current reward as `V0`.
|
| 762 |
+
4. Log the first real pathology and reward revision.
|
| 763 |
+
5. Do not let notebook or video work outrun the environment evidence.
|
|
|
|
|
|
|
|
|
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md
CHANGED
|
@@ -20,12 +20,14 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
|
|
| 20 |
- [x] repair the low-dimensional parameterization
|
| 21 |
- [x] add explicit VMEC failure semantics
|
| 22 |
- [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
|
|
|
|
| 23 |
- [ ] add tracked fixtures and manual playtest evidence
|
| 24 |
- [ ] refresh the heuristic baseline after the real-verifier rerun
|
| 25 |
|
| 26 |
Current caution:
|
| 27 |
|
| 28 |
- do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
|
|
|
|
| 29 |
|
| 30 |
## Plan V2 Inheritance
|
| 31 |
|
|
@@ -95,40 +97,35 @@ Transition rule:
|
|
| 95 |
|
| 96 |
## Hour 2-4: Verify Wiring, Then Manual Playtest
|
| 97 |
|
| 98 |
-
1.
|
| 99 |
-
2.
|
| 100 |
-
3. Label low-fi `run` truth vs high-fi `submit` truth clearly.
|
| 101 |
-
4. Run a small measured sweep on the repaired family before freezing defaults.
|
| 102 |
-
5. Run fixture checks:
|
| 103 |
- known-good or near-winning design
|
| 104 |
- near-boundary designs
|
| 105 |
- clearly bad designs
|
| 106 |
- do not rely on the current default baseline params as the only starting point
|
| 107 |
-
|
| 108 |
- verifier outputs are sane
|
| 109 |
- reward ordering is sane
|
| 110 |
- objective direction is correct
|
| 111 |
-
|
| 112 |
-
|
| 113 |
- observation
|
| 114 |
- chosen action
|
| 115 |
- expected effect
|
| 116 |
- returned reward
|
| 117 |
- confusion or exploit if observed
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
- initial reward V0
|
| 122 |
- bad behavior
|
| 123 |
- refinement to reward V1
|
| 124 |
- improved behavior
|
| 125 |
-
|
| 126 |
|
| 127 |
Exit condition: you can explain why the environment now rewards the intended behavior.
|
| 128 |
|
| 129 |
Artifacts:
|
| 130 |
-
- repaired low-dimensional boundary plan
|
| 131 |
-
- explicit failure semantics note
|
| 132 |
- measured range and delta note
|
| 133 |
- fixture check note
|
| 134 |
- manual playtest log
|
|
|
|
| 20 |
- [x] repair the low-dimensional parameterization
|
| 21 |
- [x] add explicit VMEC failure semantics
|
| 22 |
- [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
|
| 23 |
+
- [x] separate high-fi submit scoring/reporting from low-fi rollout score state
|
| 24 |
- [ ] add tracked fixtures and manual playtest evidence
|
| 25 |
- [ ] refresh the heuristic baseline after the real-verifier rerun
|
| 26 |
|
| 27 |
Current caution:
|
| 28 |
|
| 29 |
- do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
|
| 30 |
+
- do not present submit-time score comparisons as clean unless they are grounded in the now-separated high-fi submit state
|
| 31 |
|
| 32 |
## Plan V2 Inheritance
|
| 33 |
|
|
|
|
| 97 |
|
| 98 |
## Hour 2-4: Verify Wiring, Then Manual Playtest
|
| 99 |
|
| 100 |
+
1. Run a small measured sweep on the repaired family before freezing defaults.
|
| 101 |
+
2. Run fixture checks:
|
|
|
|
|
|
|
|
|
|
| 102 |
- known-good or near-winning design
|
| 103 |
- near-boundary designs
|
| 104 |
- clearly bad designs
|
| 105 |
- do not rely on the current default baseline params as the only starting point
|
| 106 |
+
3. Confirm:
|
| 107 |
- verifier outputs are sane
|
| 108 |
- reward ordering is sane
|
| 109 |
- objective direction is correct
|
| 110 |
+
4. Manually play 5 to 10 episodes.
|
| 111 |
+
5. Log for each step:
|
| 112 |
- observation
|
| 113 |
- chosen action
|
| 114 |
- expected effect
|
| 115 |
- returned reward
|
| 116 |
- confusion or exploit if observed
|
| 117 |
+
6. Identify at least one bad incentive or exploit.
|
| 118 |
+
7. Patch reward or penalty logic immediately.
|
| 119 |
+
8. Write the reward shaping story:
|
| 120 |
- initial reward V0
|
| 121 |
- bad behavior
|
| 122 |
- refinement to reward V1
|
| 123 |
- improved behavior
|
| 124 |
+
9. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
|
| 125 |
|
| 126 |
Exit condition: you can explain why the environment now rewards the intended behavior.
|
| 127 |
|
| 128 |
Artifacts:
|
|
|
|
|
|
|
| 129 |
- measured range and delta note
|
| 130 |
- fixture check note
|
| 131 |
- manual playtest log
|
docs/P1_ENV_CONTRACT_V1.md
CHANGED
|
@@ -169,6 +169,7 @@ Current repo state:
|
|
| 169 |
|
| 170 |
- the live observation surface now exposes evaluation fidelity and failure state
|
| 171 |
- the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
|
|
|
|
| 172 |
|
| 173 |
## Reward V0
|
| 174 |
|
|
@@ -197,6 +198,11 @@ Do not add:
|
|
| 197 |
|
| 198 |
Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
## Reset Strategy
|
| 201 |
|
| 202 |
Start with frozen exact seeds, not jitter.
|
|
|
|
| 169 |
|
| 170 |
- the live observation surface now exposes evaluation fidelity and failure state
|
| 171 |
- the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
|
| 172 |
+
- terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
|
| 173 |
|
| 174 |
## Reward V0
|
| 175 |
|
|
|
|
| 198 |
|
| 199 |
Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
|
| 200 |
|
| 201 |
+
Additional fidelity rule:
|
| 202 |
+
|
| 203 |
+
- do not compare a high-fidelity submit score against low-fidelity `initial_score` or `best_score` state
|
| 204 |
+
- terminal reward and submit summaries should use a fidelity-consistent basis
|
| 205 |
+
|
| 206 |
## Reset Strategy
|
| 207 |
|
| 208 |
Start with frozen exact seeds, not jitter.
|
fusion_lab/models.py
CHANGED
|
@@ -53,6 +53,14 @@ class StellaratorObservation(Observation):
|
|
| 53 |
|
| 54 |
|
| 55 |
class StellaratorState(State):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
current_params: LowDimBoundaryParams = Field(
|
| 57 |
default_factory=lambda: LowDimBoundaryParams(
|
| 58 |
aspect_ratio=3.6,
|
|
@@ -70,8 +78,11 @@ class StellaratorState(State):
|
|
| 70 |
)
|
| 71 |
)
|
| 72 |
initial_score: float = 0.0
|
|
|
|
| 73 |
best_score: float = 0.0
|
| 74 |
best_feasibility: float = float("inf")
|
|
|
|
|
|
|
| 75 |
budget_total: int = 6
|
| 76 |
budget_remaining: int = 6
|
| 77 |
episode_done: bool = False
|
|
|
|
| 53 |
|
| 54 |
|
| 55 |
class StellaratorState(State):
|
| 56 |
+
initial_params: LowDimBoundaryParams = Field(
|
| 57 |
+
default_factory=lambda: LowDimBoundaryParams(
|
| 58 |
+
aspect_ratio=3.6,
|
| 59 |
+
elongation=1.4,
|
| 60 |
+
rotational_transform=1.6,
|
| 61 |
+
triangularity_scale=0.55,
|
| 62 |
+
)
|
| 63 |
+
)
|
| 64 |
current_params: LowDimBoundaryParams = Field(
|
| 65 |
default_factory=lambda: LowDimBoundaryParams(
|
| 66 |
aspect_ratio=3.6,
|
|
|
|
| 78 |
)
|
| 79 |
)
|
| 80 |
initial_score: float = 0.0
|
| 81 |
+
initial_high_fidelity_score: float | None = None
|
| 82 |
best_score: float = 0.0
|
| 83 |
best_feasibility: float = float("inf")
|
| 84 |
+
best_high_fidelity_score: float | None = None
|
| 85 |
+
best_high_fidelity_feasibility: float = float("inf")
|
| 86 |
budget_total: int = 6
|
| 87 |
budget_remaining: int = 6
|
| 88 |
episode_done: bool = False
|
server/environment.py
CHANGED
|
@@ -89,11 +89,13 @@ class StellaratorEnvironment(
|
|
| 89 |
self._state = StellaratorState(
|
| 90 |
episode_id=episode_id,
|
| 91 |
step_count=0,
|
|
|
|
| 92 |
current_params=params,
|
| 93 |
best_params=params,
|
| 94 |
initial_score=metrics.p1_score,
|
| 95 |
best_score=metrics.p1_score,
|
| 96 |
best_feasibility=metrics.p1_feasibility,
|
|
|
|
| 97 |
budget_total=BUDGET,
|
| 98 |
budget_remaining=BUDGET,
|
| 99 |
episode_done=False,
|
|
@@ -170,8 +172,15 @@ class StellaratorEnvironment(
|
|
| 170 |
|
| 171 |
def _handle_submit(self) -> StellaratorObservation:
|
| 172 |
metrics = self._evaluate_params(self._state.current_params, fidelity="high")
|
| 173 |
-
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
self._state.history.append(summary)
|
| 176 |
self._state.episode_done = True
|
| 177 |
self._last_metrics = metrics
|
|
@@ -229,6 +238,7 @@ class StellaratorEnvironment(
|
|
| 229 |
metrics: EvaluationMetrics,
|
| 230 |
intent: str,
|
| 231 |
done: bool,
|
|
|
|
| 232 |
) -> float:
|
| 233 |
previous_metrics = self._reference_metrics(metrics)
|
| 234 |
if metrics.evaluation_failed:
|
|
@@ -257,13 +267,14 @@ class StellaratorEnvironment(
|
|
| 257 |
reward -= 0.1
|
| 258 |
|
| 259 |
if intent == "submit" or done:
|
| 260 |
-
|
| 261 |
-
|
|
|
|
|
|
|
| 262 |
)
|
|
|
|
| 263 |
if improved:
|
| 264 |
-
ratio = (metrics.p1_score -
|
| 265 |
-
1.0 - self._state.initial_score, 1e-6
|
| 266 |
-
)
|
| 267 |
if intent == "submit":
|
| 268 |
reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
|
| 269 |
else:
|
|
@@ -290,11 +301,14 @@ class StellaratorEnvironment(
|
|
| 290 |
text_lines.append(f"failure_reason={metrics.failure_reason}")
|
| 291 |
text_lines.extend(
|
| 292 |
[
|
| 293 |
-
f"max_elongation={metrics.max_elongation:.4f} | best_score={self.
|
| 294 |
f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
|
| 295 |
f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
|
| 296 |
f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
|
| 297 |
-
|
|
|
|
|
|
|
|
|
|
| 298 |
f"vacuum_well={metrics.vacuum_well:.4f}",
|
| 299 |
f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
|
| 300 |
f"step={self._state.step_count} | budget={self._state.budget_remaining}/{self._state.budget_total}",
|
|
@@ -315,8 +329,8 @@ class StellaratorEnvironment(
|
|
| 315 |
failure_reason=metrics.failure_reason,
|
| 316 |
step_number=self._state.step_count,
|
| 317 |
budget_remaining=self._state.budget_remaining,
|
| 318 |
-
best_score=self.
|
| 319 |
-
best_feasibility=self.
|
| 320 |
constraints_satisfied=metrics.constraints_satisfied,
|
| 321 |
target_spec=TARGET_SPEC,
|
| 322 |
reward=reward,
|
|
@@ -349,13 +363,17 @@ class StellaratorEnvironment(
|
|
| 349 |
f"Low-fidelity evaluation. {objective_summary}"
|
| 350 |
)
|
| 351 |
|
| 352 |
-
def _summary_submit(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 353 |
if metrics.evaluation_failed:
|
| 354 |
return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
|
| 355 |
return (
|
| 356 |
-
f"Submitted
|
| 357 |
-
f"
|
| 358 |
-
f"
|
| 359 |
f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
|
| 360 |
)
|
| 361 |
|
|
@@ -412,6 +430,41 @@ class StellaratorEnvironment(
|
|
| 412 |
return self._last_successful_metrics
|
| 413 |
return fallback
|
| 414 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 415 |
def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
|
| 416 |
if metrics.evaluation_failed:
|
| 417 |
return
|
|
|
|
| 89 |
self._state = StellaratorState(
|
| 90 |
episode_id=episode_id,
|
| 91 |
step_count=0,
|
| 92 |
+
initial_params=params,
|
| 93 |
current_params=params,
|
| 94 |
best_params=params,
|
| 95 |
initial_score=metrics.p1_score,
|
| 96 |
best_score=metrics.p1_score,
|
| 97 |
best_feasibility=metrics.p1_feasibility,
|
| 98 |
+
best_high_fidelity_feasibility=float("inf"),
|
| 99 |
budget_total=BUDGET,
|
| 100 |
budget_remaining=BUDGET,
|
| 101 |
episode_done=False,
|
|
|
|
| 172 |
|
| 173 |
def _handle_submit(self) -> StellaratorObservation:
|
| 174 |
metrics = self._evaluate_params(self._state.current_params, fidelity="high")
|
| 175 |
+
initial_submit_metrics = self._initial_high_fidelity_metrics()
|
| 176 |
+
best_submit_metrics = self._refresh_best_high_fidelity_metrics(metrics)
|
| 177 |
+
reward = self._compute_reward(
|
| 178 |
+
metrics,
|
| 179 |
+
"submit",
|
| 180 |
+
done=True,
|
| 181 |
+
initial_reference_score=initial_submit_metrics.p1_score,
|
| 182 |
+
)
|
| 183 |
+
summary = self._summary_submit(metrics, best_submit_metrics)
|
| 184 |
self._state.history.append(summary)
|
| 185 |
self._state.episode_done = True
|
| 186 |
self._last_metrics = metrics
|
|
|
|
| 238 |
metrics: EvaluationMetrics,
|
| 239 |
intent: str,
|
| 240 |
done: bool,
|
| 241 |
+
initial_reference_score: float | None = None,
|
| 242 |
) -> float:
|
| 243 |
previous_metrics = self._reference_metrics(metrics)
|
| 244 |
if metrics.evaluation_failed:
|
|
|
|
| 267 |
reward -= 0.1
|
| 268 |
|
| 269 |
if intent == "submit" or done:
|
| 270 |
+
base_score = (
|
| 271 |
+
initial_reference_score
|
| 272 |
+
if initial_reference_score is not None
|
| 273 |
+
else self._state.initial_score
|
| 274 |
)
|
| 275 |
+
improved = metrics.constraints_satisfied and metrics.p1_score > base_score
|
| 276 |
if improved:
|
| 277 |
+
ratio = (metrics.p1_score - base_score) / max(1.0 - base_score, 1e-6)
|
|
|
|
|
|
|
| 278 |
if intent == "submit":
|
| 279 |
reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
|
| 280 |
else:
|
|
|
|
| 301 |
text_lines.append(f"failure_reason={metrics.failure_reason}")
|
| 302 |
text_lines.extend(
|
| 303 |
[
|
| 304 |
+
f"max_elongation={metrics.max_elongation:.4f} | best_score={self._display_best_score(metrics):.6f}",
|
| 305 |
f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
|
| 306 |
f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
|
| 307 |
f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
|
| 308 |
+
(
|
| 309 |
+
f"feasibility={metrics.p1_feasibility:.6f} | "
|
| 310 |
+
f"best_feasibility={self._display_best_feasibility(metrics):.6f}"
|
| 311 |
+
),
|
| 312 |
f"vacuum_well={metrics.vacuum_well:.4f}",
|
| 313 |
f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
|
| 314 |
f"step={self._state.step_count} | budget={self._state.budget_remaining}/{self._state.budget_total}",
|
|
|
|
| 329 |
failure_reason=metrics.failure_reason,
|
| 330 |
step_number=self._state.step_count,
|
| 331 |
budget_remaining=self._state.budget_remaining,
|
| 332 |
+
best_score=self._display_best_score(metrics),
|
| 333 |
+
best_feasibility=self._display_best_feasibility(metrics),
|
| 334 |
constraints_satisfied=metrics.constraints_satisfied,
|
| 335 |
target_spec=TARGET_SPEC,
|
| 336 |
reward=reward,
|
|
|
|
| 363 |
f"Low-fidelity evaluation. {objective_summary}"
|
| 364 |
)
|
| 365 |
|
| 366 |
+
def _summary_submit(
|
| 367 |
+
self,
|
| 368 |
+
metrics: EvaluationMetrics,
|
| 369 |
+
best_submit_metrics: EvaluationMetrics,
|
| 370 |
+
) -> str:
|
| 371 |
if metrics.evaluation_failed:
|
| 372 |
return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
|
| 373 |
return (
|
| 374 |
+
f"Submitted current_high_fidelity_score={metrics.p1_score:.6f}, "
|
| 375 |
+
f"best_high_fidelity_score={best_submit_metrics.p1_score:.6f}, "
|
| 376 |
+
f"best_high_fidelity_feasibility={best_submit_metrics.p1_feasibility:.6f}, "
|
| 377 |
f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
|
| 378 |
)
|
| 379 |
|
|
|
|
| 430 |
return self._last_successful_metrics
|
| 431 |
return fallback
|
| 432 |
|
| 433 |
+
def _initial_high_fidelity_metrics(self) -> EvaluationMetrics:
|
| 434 |
+
if self._state.initial_high_fidelity_score is not None:
|
| 435 |
+
return self._evaluate_params(self._state.initial_params, fidelity="high")
|
| 436 |
+
metrics = self._evaluate_params(self._state.initial_params, fidelity="high")
|
| 437 |
+
self._state.initial_high_fidelity_score = metrics.p1_score
|
| 438 |
+
return metrics
|
| 439 |
+
|
| 440 |
+
def _refresh_best_high_fidelity_metrics(
|
| 441 |
+
self,
|
| 442 |
+
current_submit_metrics: EvaluationMetrics,
|
| 443 |
+
) -> EvaluationMetrics:
|
| 444 |
+
best_metrics = current_submit_metrics
|
| 445 |
+
if self._state.best_params != self._state.current_params:
|
| 446 |
+
best_metrics = self._evaluate_params(self._state.best_params, fidelity="high")
|
| 447 |
+
|
| 448 |
+
self._state.best_high_fidelity_score = best_metrics.p1_score
|
| 449 |
+
self._state.best_high_fidelity_feasibility = best_metrics.p1_feasibility
|
| 450 |
+
return best_metrics
|
| 451 |
+
|
| 452 |
+
def _display_best_score(self, metrics: EvaluationMetrics) -> float:
|
| 453 |
+
if (
|
| 454 |
+
metrics.evaluation_fidelity == "high"
|
| 455 |
+
and self._state.best_high_fidelity_score is not None
|
| 456 |
+
):
|
| 457 |
+
return self._state.best_high_fidelity_score
|
| 458 |
+
return self._state.best_score
|
| 459 |
+
|
| 460 |
+
def _display_best_feasibility(self, metrics: EvaluationMetrics) -> float:
|
| 461 |
+
if (
|
| 462 |
+
metrics.evaluation_fidelity == "high"
|
| 463 |
+
and self._state.best_high_fidelity_score is not None
|
| 464 |
+
):
|
| 465 |
+
return self._state.best_high_fidelity_feasibility
|
| 466 |
+
return self._state.best_feasibility
|
| 467 |
+
|
| 468 |
def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
|
| 469 |
if metrics.evaluation_failed:
|
| 470 |
return
|