CreativeEngineer commited on
Commit
88d9b78
·
1 Parent(s): fe3a41d

fix: align submit scoring with fidelity

Browse files
README.md CHANGED
@@ -45,6 +45,7 @@ Implementation status:
45
  - [x] Update the action contract from 3 knobs to the repaired low-dimensional family
46
  - [x] Add explicit VMEC failure semantics to the environment contract
47
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 
48
  - [ ] Add tracked `P1` fixtures under `server/data/p1/`
49
  - [ ] Run manual playtesting and record the first reward pathology
50
  - [ ] Refresh the heuristic baseline for the real verifier path
@@ -57,6 +58,7 @@ Implementation status:
57
  - The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
58
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
59
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 
60
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
61
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
62
 
 
45
  - [x] Update the action contract from 3 knobs to the repaired low-dimensional family
46
  - [x] Add explicit VMEC failure semantics to the environment contract
47
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
48
+ - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
49
  - [ ] Add tracked `P1` fixtures under `server/data/p1/`
50
  - [ ] Run manual playtesting and record the first reward pathology
51
  - [ ] Refresh the heuristic baseline for the real verifier path
 
58
  - The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
59
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
60
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
61
+ - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
62
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
63
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
64
 
TODO.md CHANGED
@@ -33,6 +33,7 @@ Priority source:
33
  - [x] update the action schema from 3 knobs to the repaired low-dimensional family
34
  - [x] add explicit VMEC failure semantics
35
  - [x] label low-fi vs high-fi truth in the observation/task surface
 
36
  - [ ] tracked `P1` fixtures
37
  - [ ] manual playtest log
38
  - [x] settle the non-submit terminal reward policy
@@ -146,6 +147,15 @@ flowchart TD
146
  Related:
147
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
148
 
 
 
 
 
 
 
 
 
 
149
  ## Validation and Reward
150
 
151
  - [ ] Run a small measured sweep on the repaired low-dimensional family
@@ -253,5 +263,6 @@ flowchart TD
253
  - [ ] Do not let notebook or demo work outrun environment evidence
254
  - [ ] Do not add training-first complexity before manual playtesting
255
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 
256
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
257
  - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
 
33
  - [x] update the action schema from 3 knobs to the repaired low-dimensional family
34
  - [x] add explicit VMEC failure semantics
35
  - [x] label low-fi vs high-fi truth in the observation/task surface
36
+ - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
37
  - [ ] tracked `P1` fixtures
38
  - [ ] manual playtest log
39
  - [x] settle the non-submit terminal reward policy
 
147
  Related:
148
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
149
 
150
+ - [x] Separate high-fi submit scoring/reporting from low-fi rollout score state
151
+ Completed:
152
+ submit-time reward now uses a high-fidelity initial reference, and submit summaries / displayed best score use high-fidelity state instead of low-fidelity rollout state
153
+ Files:
154
+ [server/environment.py](server/environment.py)
155
+ [fusion_lab/models.py](fusion_lab/models.py)
156
+ Related:
157
+ [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
158
+
159
  ## Validation and Reward
160
 
161
  - [ ] Run a small measured sweep on the repaired low-dimensional family
 
263
  - [ ] Do not let notebook or demo work outrun environment evidence
264
  - [ ] Do not add training-first complexity before manual playtesting
265
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
266
+ - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
267
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
268
  - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
docs/FUSION_DELIVERABLES_MAP.md CHANGED
@@ -17,6 +17,7 @@ Use this map to sequence execution, not to reopen already-locked task choices.
17
  - [x] repaired low-dimensional boundary builder is implemented
18
  - [x] explicit VMEC failure semantics are implemented
19
  - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
 
20
  - [ ] tracked fixtures are checked in
21
  - [ ] manual playtest evidence exists
22
  - [ ] heuristic baseline has been refreshed for the real verifier path
@@ -110,13 +111,11 @@ flowchart LR
110
 
111
  Northflank compute bring-up and smoke validation are complete.
112
 
113
- 1. Repair the low-dimensional parameterization so triangularity is controllable under the official verifier.
114
- 2. Add explicit VMEC failure semantics and clear low-fi vs high-fi observation labeling.
115
- 3. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
116
- 4. Add tracked fixtures and run fixture sanity checks.
117
- 5. Manual-playtest the environment and record the first real pathology, if any.
118
- 6. Refresh the heuristic baseline from that evidence.
119
- 7. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
120
- 8. Use the notebook to show traces and comparisons; include training only if it adds signal.
121
- 9. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
122
- 10. Polish the repo only after the artifacts are real.
 
17
  - [x] repaired low-dimensional boundary builder is implemented
18
  - [x] explicit VMEC failure semantics are implemented
19
  - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
20
+ - [x] terminal submit scoring/reporting is fidelity-consistent
21
  - [ ] tracked fixtures are checked in
22
  - [ ] manual playtest evidence exists
23
  - [ ] heuristic baseline has been refreshed for the real verifier path
 
111
 
112
  Northflank compute bring-up and smoke validation are complete.
113
 
114
+ 1. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
115
+ 2. Add tracked fixtures and run fixture sanity checks.
116
+ 3. Manual-playtest the environment and record the first real pathology, if any.
117
+ 4. Refresh the heuristic baseline from that evidence.
118
+ 5. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
119
+ 6. Use the notebook to show traces and comparisons; include training only if it adds signal.
120
+ 7. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
121
+ 8. Polish the repo only after the artifacts are real.
 
 
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -18,6 +18,7 @@
18
  - [x] parameterization repair is implemented so triangularity is controllable
19
  - [x] explicit VMEC failure semantics are implemented
20
  - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
 
21
  - [ ] tracked `P1` fixtures are added
22
  - [ ] manual playtest evidence is recorded
23
  - [ ] heuristic baseline is refreshed for the real verifier path
@@ -26,6 +27,7 @@
26
  Current caution:
27
 
28
  - the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
 
29
 
30
  ## 1. Submission Thesis
31
 
@@ -347,6 +349,7 @@ Current execution note:
347
  - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
348
  - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
349
  - do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
 
350
 
351
  ### Reward V0 Failure Modes To Test
352
 
@@ -555,6 +558,7 @@ The repo should make the environment easy to understand:
555
 
556
  - local modify -> verify -> observe loop works
557
  - at least one end-to-end episode is stable
 
558
 
559
  ### Gate 5: Reward V1
560
 
@@ -752,11 +756,8 @@ That last line is intentionally conservative. It is strong enough without claimi
752
 
753
  ## 21. Immediate Next Actions
754
 
755
- 1. Repair the low-dimensional boundary parameterization so triangularity is controllable.
756
- 2. Split boundary construction from official boundary evaluation.
757
- 3. Add explicit VMEC failure semantics and clear low-fi vs high-fi labeling.
758
- 4. Run a small measured sweep before locking ranges, deltas, or budget changes.
759
- 5. Freeze fixtures and run manual playtests before heavy training work.
760
- 6. Mark the current reward as `V0`.
761
- 7. Log the first real pathology and reward revision.
762
- 8. Do not let notebook or video work outrun the environment evidence.
 
18
  - [x] parameterization repair is implemented so triangularity is controllable
19
  - [x] explicit VMEC failure semantics are implemented
20
  - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
21
+ - [x] terminal scoring/reporting is fidelity-consistent between low-fi rollout state and high-fi submit truth
22
  - [ ] tracked `P1` fixtures are added
23
  - [ ] manual playtest evidence is recorded
24
  - [ ] heuristic baseline is refreshed for the real verifier path
 
27
  Current caution:
28
 
29
  - the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
30
+ - terminal scoring/reporting now uses a fidelity-consistent basis at episode end: high-fi `submit` comparisons are no longer anchored to low-fi rollout score state
31
 
32
  ## 1. Submission Thesis
33
 
 
349
  - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
350
  - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
351
  - do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
352
+ - keep terminal reward and reporting fidelity-consistent; do not compare high-fi submit scores against low-fi best/initial score state
353
 
354
  ### Reward V0 Failure Modes To Test
355
 
 
558
 
559
  - local modify -> verify -> observe loop works
560
  - at least one end-to-end episode is stable
561
+ - submit-time reward/reporting does not mix low-fi and high-fi score state
562
 
563
  ### Gate 5: Reward V1
564
 
 
756
 
757
  ## 21. Immediate Next Actions
758
 
759
+ 1. Run a small measured sweep before locking ranges, deltas, or budget changes.
760
+ 2. Freeze fixtures and run manual playtests before heavy training work.
761
+ 3. Mark the current reward as `V0`.
762
+ 4. Log the first real pathology and reward revision.
763
+ 5. Do not let notebook or video work outrun the environment evidence.
 
 
 
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED
@@ -20,12 +20,14 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
20
  - [x] repair the low-dimensional parameterization
21
  - [x] add explicit VMEC failure semantics
22
  - [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
 
23
  - [ ] add tracked fixtures and manual playtest evidence
24
  - [ ] refresh the heuristic baseline after the real-verifier rerun
25
 
26
  Current caution:
27
 
28
  - do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
 
29
 
30
  ## Plan V2 Inheritance
31
 
@@ -95,40 +97,35 @@ Transition rule:
95
 
96
  ## Hour 2-4: Verify Wiring, Then Manual Playtest
97
 
98
- 1. Repair the low-dimensional parameterization so triangularity is controllable.
99
- 2. Add explicit VMEC failure semantics and visible failure observations.
100
- 3. Label low-fi `run` truth vs high-fi `submit` truth clearly.
101
- 4. Run a small measured sweep on the repaired family before freezing defaults.
102
- 5. Run fixture checks:
103
  - known-good or near-winning design
104
  - near-boundary designs
105
  - clearly bad designs
106
  - do not rely on the current default baseline params as the only starting point
107
- 6. Confirm:
108
  - verifier outputs are sane
109
  - reward ordering is sane
110
  - objective direction is correct
111
- 7. Manually play 5 to 10 episodes.
112
- 8. Log for each step:
113
  - observation
114
  - chosen action
115
  - expected effect
116
  - returned reward
117
  - confusion or exploit if observed
118
- 9. Identify at least one bad incentive or exploit.
119
- 10. Patch reward or penalty logic immediately.
120
- 11. Write the reward shaping story:
121
  - initial reward V0
122
  - bad behavior
123
  - refinement to reward V1
124
  - improved behavior
125
- 12. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
126
 
127
  Exit condition: you can explain why the environment now rewards the intended behavior.
128
 
129
  Artifacts:
130
- - repaired low-dimensional boundary plan
131
- - explicit failure semantics note
132
  - measured range and delta note
133
  - fixture check note
134
  - manual playtest log
 
20
  - [x] repair the low-dimensional parameterization
21
  - [x] add explicit VMEC failure semantics
22
  - [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
23
+ - [x] separate high-fi submit scoring/reporting from low-fi rollout score state
24
  - [ ] add tracked fixtures and manual playtest evidence
25
  - [ ] refresh the heuristic baseline after the real-verifier rerun
26
 
27
  Current caution:
28
 
29
  - do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
30
+ - do not present submit-time score comparisons as clean unless they are grounded in the now-separated high-fi submit state
31
 
32
  ## Plan V2 Inheritance
33
 
 
97
 
98
  ## Hour 2-4: Verify Wiring, Then Manual Playtest
99
 
100
+ 1. Run a small measured sweep on the repaired family before freezing defaults.
101
+ 2. Run fixture checks:
 
 
 
102
  - known-good or near-winning design
103
  - near-boundary designs
104
  - clearly bad designs
105
  - do not rely on the current default baseline params as the only starting point
106
+ 3. Confirm:
107
  - verifier outputs are sane
108
  - reward ordering is sane
109
  - objective direction is correct
110
+ 4. Manually play 5 to 10 episodes.
111
+ 5. Log for each step:
112
  - observation
113
  - chosen action
114
  - expected effect
115
  - returned reward
116
  - confusion or exploit if observed
117
+ 6. Identify at least one bad incentive or exploit.
118
+ 7. Patch reward or penalty logic immediately.
119
+ 8. Write the reward shaping story:
120
  - initial reward V0
121
  - bad behavior
122
  - refinement to reward V1
123
  - improved behavior
124
+ 9. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
125
 
126
  Exit condition: you can explain why the environment now rewards the intended behavior.
127
 
128
  Artifacts:
 
 
129
  - measured range and delta note
130
  - fixture check note
131
  - manual playtest log
docs/P1_ENV_CONTRACT_V1.md CHANGED
@@ -169,6 +169,7 @@ Current repo state:
169
 
170
  - the live observation surface now exposes evaluation fidelity and failure state
171
  - the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
 
172
 
173
  ## Reward V0
174
 
@@ -197,6 +198,11 @@ Do not add:
197
 
198
  Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
199
 
 
 
 
 
 
200
  ## Reset Strategy
201
 
202
  Start with frozen exact seeds, not jitter.
 
169
 
170
  - the live observation surface now exposes evaluation fidelity and failure state
171
  - the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
172
+ - terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
173
 
174
  ## Reward V0
175
 
 
198
 
199
  Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
200
 
201
+ Additional fidelity rule:
202
+
203
+ - do not compare a high-fidelity submit score against low-fidelity `initial_score` or `best_score` state
204
+ - terminal reward and submit summaries should use a fidelity-consistent basis
205
+
206
  ## Reset Strategy
207
 
208
  Start with frozen exact seeds, not jitter.
fusion_lab/models.py CHANGED
@@ -53,6 +53,14 @@ class StellaratorObservation(Observation):
53
 
54
 
55
  class StellaratorState(State):
 
 
 
 
 
 
 
 
56
  current_params: LowDimBoundaryParams = Field(
57
  default_factory=lambda: LowDimBoundaryParams(
58
  aspect_ratio=3.6,
@@ -70,8 +78,11 @@ class StellaratorState(State):
70
  )
71
  )
72
  initial_score: float = 0.0
 
73
  best_score: float = 0.0
74
  best_feasibility: float = float("inf")
 
 
75
  budget_total: int = 6
76
  budget_remaining: int = 6
77
  episode_done: bool = False
 
53
 
54
 
55
  class StellaratorState(State):
56
+ initial_params: LowDimBoundaryParams = Field(
57
+ default_factory=lambda: LowDimBoundaryParams(
58
+ aspect_ratio=3.6,
59
+ elongation=1.4,
60
+ rotational_transform=1.6,
61
+ triangularity_scale=0.55,
62
+ )
63
+ )
64
  current_params: LowDimBoundaryParams = Field(
65
  default_factory=lambda: LowDimBoundaryParams(
66
  aspect_ratio=3.6,
 
78
  )
79
  )
80
  initial_score: float = 0.0
81
+ initial_high_fidelity_score: float | None = None
82
  best_score: float = 0.0
83
  best_feasibility: float = float("inf")
84
+ best_high_fidelity_score: float | None = None
85
+ best_high_fidelity_feasibility: float = float("inf")
86
  budget_total: int = 6
87
  budget_remaining: int = 6
88
  episode_done: bool = False
server/environment.py CHANGED
@@ -89,11 +89,13 @@ class StellaratorEnvironment(
89
  self._state = StellaratorState(
90
  episode_id=episode_id,
91
  step_count=0,
 
92
  current_params=params,
93
  best_params=params,
94
  initial_score=metrics.p1_score,
95
  best_score=metrics.p1_score,
96
  best_feasibility=metrics.p1_feasibility,
 
97
  budget_total=BUDGET,
98
  budget_remaining=BUDGET,
99
  episode_done=False,
@@ -170,8 +172,15 @@ class StellaratorEnvironment(
170
 
171
  def _handle_submit(self) -> StellaratorObservation:
172
  metrics = self._evaluate_params(self._state.current_params, fidelity="high")
173
- reward = self._compute_reward(metrics, "submit", done=True)
174
- summary = self._summary_submit(metrics)
 
 
 
 
 
 
 
175
  self._state.history.append(summary)
176
  self._state.episode_done = True
177
  self._last_metrics = metrics
@@ -229,6 +238,7 @@ class StellaratorEnvironment(
229
  metrics: EvaluationMetrics,
230
  intent: str,
231
  done: bool,
 
232
  ) -> float:
233
  previous_metrics = self._reference_metrics(metrics)
234
  if metrics.evaluation_failed:
@@ -257,13 +267,14 @@ class StellaratorEnvironment(
257
  reward -= 0.1
258
 
259
  if intent == "submit" or done:
260
- improved = (
261
- metrics.constraints_satisfied and metrics.p1_score > self._state.initial_score
 
 
262
  )
 
263
  if improved:
264
- ratio = (metrics.p1_score - self._state.initial_score) / max(
265
- 1.0 - self._state.initial_score, 1e-6
266
- )
267
  if intent == "submit":
268
  reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
269
  else:
@@ -290,11 +301,14 @@ class StellaratorEnvironment(
290
  text_lines.append(f"failure_reason={metrics.failure_reason}")
291
  text_lines.extend(
292
  [
293
- f"max_elongation={metrics.max_elongation:.4f} | best_score={self._state.best_score:.6f}",
294
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
295
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
296
  f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
297
- f"feasibility={metrics.p1_feasibility:.6f} | best_feasibility={self._state.best_feasibility:.6f}",
 
 
 
298
  f"vacuum_well={metrics.vacuum_well:.4f}",
299
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
300
  f"step={self._state.step_count} | budget={self._state.budget_remaining}/{self._state.budget_total}",
@@ -315,8 +329,8 @@ class StellaratorEnvironment(
315
  failure_reason=metrics.failure_reason,
316
  step_number=self._state.step_count,
317
  budget_remaining=self._state.budget_remaining,
318
- best_score=self._state.best_score,
319
- best_feasibility=self._state.best_feasibility,
320
  constraints_satisfied=metrics.constraints_satisfied,
321
  target_spec=TARGET_SPEC,
322
  reward=reward,
@@ -349,13 +363,17 @@ class StellaratorEnvironment(
349
  f"Low-fidelity evaluation. {objective_summary}"
350
  )
351
 
352
- def _summary_submit(self, metrics: EvaluationMetrics) -> str:
 
 
 
 
353
  if metrics.evaluation_failed:
354
  return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
355
  return (
356
- f"Submitted current_score={metrics.p1_score:.6f}, "
357
- f"best_seen_score={self._state.best_score:.6f}, "
358
- f"best_feasibility={self._state.best_feasibility:.6f}, "
359
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
360
  )
361
 
@@ -412,6 +430,41 @@ class StellaratorEnvironment(
412
  return self._last_successful_metrics
413
  return fallback
414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
415
  def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
416
  if metrics.evaluation_failed:
417
  return
 
89
  self._state = StellaratorState(
90
  episode_id=episode_id,
91
  step_count=0,
92
+ initial_params=params,
93
  current_params=params,
94
  best_params=params,
95
  initial_score=metrics.p1_score,
96
  best_score=metrics.p1_score,
97
  best_feasibility=metrics.p1_feasibility,
98
+ best_high_fidelity_feasibility=float("inf"),
99
  budget_total=BUDGET,
100
  budget_remaining=BUDGET,
101
  episode_done=False,
 
172
 
173
  def _handle_submit(self) -> StellaratorObservation:
174
  metrics = self._evaluate_params(self._state.current_params, fidelity="high")
175
+ initial_submit_metrics = self._initial_high_fidelity_metrics()
176
+ best_submit_metrics = self._refresh_best_high_fidelity_metrics(metrics)
177
+ reward = self._compute_reward(
178
+ metrics,
179
+ "submit",
180
+ done=True,
181
+ initial_reference_score=initial_submit_metrics.p1_score,
182
+ )
183
+ summary = self._summary_submit(metrics, best_submit_metrics)
184
  self._state.history.append(summary)
185
  self._state.episode_done = True
186
  self._last_metrics = metrics
 
238
  metrics: EvaluationMetrics,
239
  intent: str,
240
  done: bool,
241
+ initial_reference_score: float | None = None,
242
  ) -> float:
243
  previous_metrics = self._reference_metrics(metrics)
244
  if metrics.evaluation_failed:
 
267
  reward -= 0.1
268
 
269
  if intent == "submit" or done:
270
+ base_score = (
271
+ initial_reference_score
272
+ if initial_reference_score is not None
273
+ else self._state.initial_score
274
  )
275
+ improved = metrics.constraints_satisfied and metrics.p1_score > base_score
276
  if improved:
277
+ ratio = (metrics.p1_score - base_score) / max(1.0 - base_score, 1e-6)
 
 
278
  if intent == "submit":
279
  reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
280
  else:
 
301
  text_lines.append(f"failure_reason={metrics.failure_reason}")
302
  text_lines.extend(
303
  [
304
+ f"max_elongation={metrics.max_elongation:.4f} | best_score={self._display_best_score(metrics):.6f}",
305
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
306
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
307
  f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
308
+ (
309
+ f"feasibility={metrics.p1_feasibility:.6f} | "
310
+ f"best_feasibility={self._display_best_feasibility(metrics):.6f}"
311
+ ),
312
  f"vacuum_well={metrics.vacuum_well:.4f}",
313
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
314
  f"step={self._state.step_count} | budget={self._state.budget_remaining}/{self._state.budget_total}",
 
329
  failure_reason=metrics.failure_reason,
330
  step_number=self._state.step_count,
331
  budget_remaining=self._state.budget_remaining,
332
+ best_score=self._display_best_score(metrics),
333
+ best_feasibility=self._display_best_feasibility(metrics),
334
  constraints_satisfied=metrics.constraints_satisfied,
335
  target_spec=TARGET_SPEC,
336
  reward=reward,
 
363
  f"Low-fidelity evaluation. {objective_summary}"
364
  )
365
 
366
+ def _summary_submit(
367
+ self,
368
+ metrics: EvaluationMetrics,
369
+ best_submit_metrics: EvaluationMetrics,
370
+ ) -> str:
371
  if metrics.evaluation_failed:
372
  return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
373
  return (
374
+ f"Submitted current_high_fidelity_score={metrics.p1_score:.6f}, "
375
+ f"best_high_fidelity_score={best_submit_metrics.p1_score:.6f}, "
376
+ f"best_high_fidelity_feasibility={best_submit_metrics.p1_feasibility:.6f}, "
377
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
378
  )
379
 
 
430
  return self._last_successful_metrics
431
  return fallback
432
 
433
+ def _initial_high_fidelity_metrics(self) -> EvaluationMetrics:
434
+ if self._state.initial_high_fidelity_score is not None:
435
+ return self._evaluate_params(self._state.initial_params, fidelity="high")
436
+ metrics = self._evaluate_params(self._state.initial_params, fidelity="high")
437
+ self._state.initial_high_fidelity_score = metrics.p1_score
438
+ return metrics
439
+
440
+ def _refresh_best_high_fidelity_metrics(
441
+ self,
442
+ current_submit_metrics: EvaluationMetrics,
443
+ ) -> EvaluationMetrics:
444
+ best_metrics = current_submit_metrics
445
+ if self._state.best_params != self._state.current_params:
446
+ best_metrics = self._evaluate_params(self._state.best_params, fidelity="high")
447
+
448
+ self._state.best_high_fidelity_score = best_metrics.p1_score
449
+ self._state.best_high_fidelity_feasibility = best_metrics.p1_feasibility
450
+ return best_metrics
451
+
452
+ def _display_best_score(self, metrics: EvaluationMetrics) -> float:
453
+ if (
454
+ metrics.evaluation_fidelity == "high"
455
+ and self._state.best_high_fidelity_score is not None
456
+ ):
457
+ return self._state.best_high_fidelity_score
458
+ return self._state.best_score
459
+
460
+ def _display_best_feasibility(self, metrics: EvaluationMetrics) -> float:
461
+ if (
462
+ metrics.evaluation_fidelity == "high"
463
+ and self._state.best_high_fidelity_score is not None
464
+ ):
465
+ return self._state.best_high_fidelity_feasibility
466
+ return self._state.best_feasibility
467
+
468
  def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
469
  if metrics.evaluation_failed:
470
  return