Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

88d9b78

1 Parent(s): fe3a41d

fix: align submit scoring with fidelity

Browse files

Files changed (8) hide show

README.md +2 -0
TODO.md +11 -0
docs/FUSION_DELIVERABLES_MAP.md +9 -10
docs/FUSION_DESIGN_LAB_PLAN_V2.md +9 -8
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md +11 -14
docs/P1_ENV_CONTRACT_V1.md +6 -0
fusion_lab/models.py +11 -0
server/environment.py +68 -15

README.md CHANGED Viewed

@@ -45,6 +45,7 @@ Implementation status:
 - [x] Update the action contract from 3 knobs to the repaired low-dimensional family
 - [x] Add explicit VMEC failure semantics to the environment contract
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
@@ -57,6 +58,7 @@ Implementation status:
 - The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.

 - [x] Update the action contract from 3 knobs to the repaired low-dimensional family
 - [x] Add explicit VMEC failure semantics to the environment contract
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
+- [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
 - The repaired low-dimensional family still needs measured ranges and deltas. Do not narrate guessed `rotational_transform` bounds, `triangularity_scale` deltas, or a larger budget as validated facts until they are measured on the repaired environment.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
+- Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.

TODO.md CHANGED Viewed

@@ -33,6 +33,7 @@ Priority source:
 - [x] update the action schema from 3 knobs to the repaired low-dimensional family
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi vs high-fi truth in the observation/task surface
 - [ ] tracked `P1` fixtures
 - [ ] manual playtest log
 - [x] settle the non-submit terminal reward policy
@@ -146,6 +147,15 @@ flowchart TD
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
 ## Validation and Reward
 - [ ] Run a small measured sweep on the repaired low-dimensional family
@@ -253,5 +263,6 @@ flowchart TD
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not add training-first complexity before manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible
 - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting

 - [x] update the action schema from 3 knobs to the repaired low-dimensional family
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi vs high-fi truth in the observation/task surface
+- [x] separate high-fi submit scoring/reporting from low-fi rollout score state
 - [ ] tracked `P1` fixtures
 - [ ] manual playtest log
 - [x] settle the non-submit terminal reward policy
   Related:
   [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
+- [x] Separate high-fi submit scoring/reporting from low-fi rollout score state
+  Completed:
+  submit-time reward now uses a high-fidelity initial reference, and submit summaries / displayed best score use high-fidelity state instead of low-fidelity rollout state
+  Files:
+  [server/environment.py](server/environment.py)
+  [fusion_lab/models.py](fusion_lab/models.py)
+  Related:
+  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
 ## Validation and Reward
 - [ ] Run a small measured sweep on the repaired low-dimensional family
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not add training-first complexity before manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
+- [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible
 - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting

docs/FUSION_DELIVERABLES_MAP.md CHANGED Viewed

@@ -17,6 +17,7 @@ Use this map to sequence execution, not to reopen already-locked task choices.
 - [x] repaired low-dimensional boundary builder is implemented
 - [x] explicit VMEC failure semantics are implemented
 - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
 - [ ] tracked fixtures are checked in
 - [ ] manual playtest evidence exists
 - [ ] heuristic baseline has been refreshed for the real verifier path
@@ -110,13 +111,11 @@ flowchart LR
 Northflank compute bring-up and smoke validation are complete.
-1. Repair the low-dimensional parameterization so triangularity is controllable under the official verifier.
-2. Add explicit VMEC failure semantics and clear low-fi vs high-fi observation labeling.
-3. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
-4. Add tracked fixtures and run fixture sanity checks.
-5. Manual-playtest the environment and record the first real pathology, if any.
-6. Refresh the heuristic baseline from that evidence.
-7. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
-8. Use the notebook to show traces and comparisons; include training only if it adds signal.
-9. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
-10. Polish the repo only after the artifacts are real.

 - [x] repaired low-dimensional boundary builder is implemented
 - [x] explicit VMEC failure semantics are implemented
 - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly
+- [x] terminal submit scoring/reporting is fidelity-consistent
 - [ ] tracked fixtures are checked in
 - [ ] manual playtest evidence exists
 - [ ] heuristic baseline has been refreshed for the real verifier path
 Northflank compute bring-up and smoke validation are complete.
+1. Run a small measured sweep before locking ranges, deltas, reset seeds, or budget changes.
+2. Add tracked fixtures and run fixture sanity checks.
+3. Manual-playtest the environment and record the first real pathology, if any.
+4. Refresh the heuristic baseline from that evidence.
+5. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
+6. Use the notebook to show traces and comparisons; include training only if it adds signal.
+7. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
+8. Polish the repo only after the artifacts are real.

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -18,6 +18,7 @@
 - [x] parameterization repair is implemented so triangularity is controllable
 - [x] explicit VMEC failure semantics are implemented
 - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
 - [ ] tracked `P1` fixtures are added
 - [ ] manual playtest evidence is recorded
 - [ ] heuristic baseline is refreshed for the real verifier path
@@ -26,6 +27,7 @@
 Current caution:
 - the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
 ## 1. Submission Thesis
@@ -347,6 +349,7 @@ Current execution note:
 - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
 - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
 - do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
 ### Reward V0 Failure Modes To Test
@@ -555,6 +558,7 @@ The repo should make the environment easy to understand:
 - local modify -> verify -> observe loop works
 - at least one end-to-end episode is stable
 ### Gate 5: Reward V1
@@ -752,11 +756,8 @@ That last line is intentionally conservative. It is strong enough without claimi
 ## 21. Immediate Next Actions
-1. Repair the low-dimensional boundary parameterization so triangularity is controllable.
-2. Split boundary construction from official boundary evaluation.
-3. Add explicit VMEC failure semantics and clear low-fi vs high-fi labeling.
-4. Run a small measured sweep before locking ranges, deltas, or budget changes.
-5. Freeze fixtures and run manual playtests before heavy training work.
-6. Mark the current reward as `V0`.
-7. Log the first real pathology and reward revision.
-8. Do not let notebook or video work outrun the environment evidence.

 - [x] parameterization repair is implemented so triangularity is controllable
 - [x] explicit VMEC failure semantics are implemented
 - [x] low-fi `run` truth vs high-fi `submit` truth is labeled clearly in the environment surface
+- [x] terminal scoring/reporting is fidelity-consistent between low-fi rollout state and high-fi submit truth
 - [ ] tracked `P1` fixtures are added
 - [ ] manual playtest evidence is recorded
 - [ ] heuristic baseline is refreshed for the real verifier path
 Current caution:
 - the repaired family is now live, but the exact ranges, deltas, and reset seeds still need a measured sweep before they should be treated as stable defaults
+- terminal scoring/reporting now uses a fidelity-consistent basis at episode end: high-fi `submit` comparisons are no longer anchored to low-fi rollout score state
 ## 1. Submission Thesis
 - once parameterization is repaired, keep `Reward V0` scalar and feasibility-first
 - clearly distinguish low-fidelity step-time metrics from high-fidelity submit-time truth in the observation contract and docs
 - do not use reward complexity to compensate for missing action expressivity or missing VMEC failure semantics
+- keep terminal reward and reporting fidelity-consistent; do not compare high-fi submit scores against low-fi best/initial score state
 ### Reward V0 Failure Modes To Test
 - local modify -> verify -> observe loop works
 - at least one end-to-end episode is stable
+- submit-time reward/reporting does not mix low-fi and high-fi score state
 ### Gate 5: Reward V1
 ## 21. Immediate Next Actions
+1. Run a small measured sweep before locking ranges, deltas, or budget changes.
+2. Freeze fixtures and run manual playtests before heavy training work.
+3. Mark the current reward as `V0`.
+4. Log the first real pathology and reward revision.
+5. Do not let notebook or video work outrun the environment evidence.

docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED Viewed

@@ -20,12 +20,14 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
 - [x] repair the low-dimensional parameterization
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
 - [ ] add tracked fixtures and manual playtest evidence
 - [ ] refresh the heuristic baseline after the real-verifier rerun
 Current caution:
 - do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
 ## Plan V2 Inheritance
@@ -95,40 +97,35 @@ Transition rule:
 ## Hour 2-4: Verify Wiring, Then Manual Playtest
-1. Repair the low-dimensional parameterization so triangularity is controllable.
-2. Add explicit VMEC failure semantics and visible failure observations.
-3. Label low-fi `run` truth vs high-fi `submit` truth clearly.
-4. Run a small measured sweep on the repaired family before freezing defaults.
-5. Run fixture checks:
    - known-good or near-winning design
    - near-boundary designs
    - clearly bad designs
    - do not rely on the current default baseline params as the only starting point
-6. Confirm:
    - verifier outputs are sane
    - reward ordering is sane
    - objective direction is correct
-7. Manually play 5 to 10 episodes.
-8. Log for each step:
    - observation
    - chosen action
    - expected effect
    - returned reward
    - confusion or exploit if observed
-9. Identify at least one bad incentive or exploit.
-10. Patch reward or penalty logic immediately.
-11. Write the reward shaping story:
    - initial reward V0
    - bad behavior
    - refinement to reward V1
    - improved behavior
-12. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
 Exit condition: you can explain why the environment now rewards the intended behavior.
 Artifacts:
-- repaired low-dimensional boundary plan
-- explicit failure semantics note
 - measured range and delta note
 - fixture check note
 - manual playtest log

 - [x] repair the low-dimensional parameterization
 - [x] add explicit VMEC failure semantics
 - [x] label low-fi `run` truth vs high-fi `submit` truth in the task surface
+- [x] separate high-fi submit scoring/reporting from low-fi rollout score state
 - [ ] add tracked fixtures and manual playtest evidence
 - [ ] refresh the heuristic baseline after the real-verifier rerun
 Current caution:
 - do not assume the first repaired defaults are final; run a measured sweep before treating ranges, deltas, or reset seeds as stable
+- do not present submit-time score comparisons as clean unless they are grounded in the now-separated high-fi submit state
 ## Plan V2 Inheritance
 ## Hour 2-4: Verify Wiring, Then Manual Playtest
+1. Run a small measured sweep on the repaired family before freezing defaults.
+2. Run fixture checks:
    - known-good or near-winning design
    - near-boundary designs
    - clearly bad designs
    - do not rely on the current default baseline params as the only starting point
+3. Confirm:
    - verifier outputs are sane
    - reward ordering is sane
    - objective direction is correct
+4. Manually play 5 to 10 episodes.
+5. Log for each step:
    - observation
    - chosen action
    - expected effect
    - returned reward
    - confusion or exploit if observed
+6. Identify at least one bad incentive or exploit.
+7. Patch reward or penalty logic immediately.
+8. Write the reward shaping story:
    - initial reward V0
    - bad behavior
    - refinement to reward V1
    - improved behavior
+9. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
 Exit condition: you can explain why the environment now rewards the intended behavior.
 Artifacts:
 - measured range and delta note
 - fixture check note
 - manual playtest log

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -169,6 +169,7 @@ Current repo state:
 - the live observation surface now exposes evaluation fidelity and failure state
 - the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
 ## Reward V0
@@ -197,6 +198,11 @@ Do not add:
 Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
 ## Reset Strategy
 Start with frozen exact seeds, not jitter.

 - the live observation surface now exposes evaluation fidelity and failure state
 - the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
+- terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
 ## Reward V0
 Do not use reward complexity to compensate for missing action expressivity or missing crash semantics.
+Additional fidelity rule:
+- do not compare a high-fidelity submit score against low-fidelity `initial_score` or `best_score` state
+- terminal reward and submit summaries should use a fidelity-consistent basis
 ## Reset Strategy
 Start with frozen exact seeds, not jitter.

fusion_lab/models.py CHANGED Viewed

@@ -53,6 +53,14 @@ class StellaratorObservation(Observation):
 class StellaratorState(State):
     current_params: LowDimBoundaryParams = Field(
         default_factory=lambda: LowDimBoundaryParams(
             aspect_ratio=3.6,
@@ -70,8 +78,11 @@ class StellaratorState(State):
         )
     )
     initial_score: float = 0.0
     best_score: float = 0.0
     best_feasibility: float = float("inf")
     budget_total: int = 6
     budget_remaining: int = 6
     episode_done: bool = False

 class StellaratorState(State):
+    initial_params: LowDimBoundaryParams = Field(
+        default_factory=lambda: LowDimBoundaryParams(
+            aspect_ratio=3.6,
+            elongation=1.4,
+            rotational_transform=1.6,
+            triangularity_scale=0.55,
+        )
+    )
     current_params: LowDimBoundaryParams = Field(
         default_factory=lambda: LowDimBoundaryParams(
             aspect_ratio=3.6,
         )
     )
     initial_score: float = 0.0
+    initial_high_fidelity_score: float | None = None
     best_score: float = 0.0
     best_feasibility: float = float("inf")
+    best_high_fidelity_score: float | None = None
+    best_high_fidelity_feasibility: float = float("inf")
     budget_total: int = 6
     budget_remaining: int = 6
     episode_done: bool = False

server/environment.py CHANGED Viewed

@@ -89,11 +89,13 @@ class StellaratorEnvironment(
         self._state = StellaratorState(
             episode_id=episode_id,
             step_count=0,
             current_params=params,
             best_params=params,
             initial_score=metrics.p1_score,
             best_score=metrics.p1_score,
             best_feasibility=metrics.p1_feasibility,
             budget_total=BUDGET,
             budget_remaining=BUDGET,
             episode_done=False,
@@ -170,8 +172,15 @@ class StellaratorEnvironment(
     def _handle_submit(self) -> StellaratorObservation:
         metrics = self._evaluate_params(self._state.current_params, fidelity="high")
-        reward = self._compute_reward(metrics, "submit", done=True)
-        summary = self._summary_submit(metrics)
         self._state.history.append(summary)
         self._state.episode_done = True
         self._last_metrics = metrics
@@ -229,6 +238,7 @@ class StellaratorEnvironment(
         metrics: EvaluationMetrics,
         intent: str,
         done: bool,
     ) -> float:
         previous_metrics = self._reference_metrics(metrics)
         if metrics.evaluation_failed:
@@ -257,13 +267,14 @@ class StellaratorEnvironment(
             reward -= 0.1
         if intent == "submit" or done:
-            improved = (
-                metrics.constraints_satisfied and metrics.p1_score > self._state.initial_score
             )
             if improved:
-                ratio = (metrics.p1_score - self._state.initial_score) / max(
-                    1.0 - self._state.initial_score, 1e-6
-                )
                 if intent == "submit":
                     reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
                 else:
@@ -290,11 +301,14 @@ class StellaratorEnvironment(
             text_lines.append(f"failure_reason={metrics.failure_reason}")
         text_lines.extend(
             [
-                f"max_elongation={metrics.max_elongation:.4f}  |  best_score={self._state.best_score:.6f}",
                 f"aspect_ratio={metrics.aspect_ratio:.4f}  (<= {ASPECT_RATIO_MAX:.1f})",
                 f"average_triangularity={metrics.average_triangularity:.4f}  (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
                 f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f}  (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
-                f"feasibility={metrics.p1_feasibility:.6f}  |  best_feasibility={self._state.best_feasibility:.6f}",
                 f"vacuum_well={metrics.vacuum_well:.4f}",
                 f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
                 f"step={self._state.step_count}  |  budget={self._state.budget_remaining}/{self._state.budget_total}",
@@ -315,8 +329,8 @@ class StellaratorEnvironment(
             failure_reason=metrics.failure_reason,
             step_number=self._state.step_count,
             budget_remaining=self._state.budget_remaining,
-            best_score=self._state.best_score,
-            best_feasibility=self._state.best_feasibility,
             constraints_satisfied=metrics.constraints_satisfied,
             target_spec=TARGET_SPEC,
             reward=reward,
@@ -349,13 +363,17 @@ class StellaratorEnvironment(
             f"Low-fidelity evaluation. {objective_summary}"
         )
-    def _summary_submit(self, metrics: EvaluationMetrics) -> str:
         if metrics.evaluation_failed:
             return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
         return (
-            f"Submitted current_score={metrics.p1_score:.6f}, "
-            f"best_seen_score={self._state.best_score:.6f}, "
-            f"best_feasibility={self._state.best_feasibility:.6f}, "
             f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
         )
@@ -412,6 +430,41 @@ class StellaratorEnvironment(
             return self._last_successful_metrics
         return fallback
     def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
         if metrics.evaluation_failed:
             return

         self._state = StellaratorState(
             episode_id=episode_id,
             step_count=0,
+            initial_params=params,
             current_params=params,
             best_params=params,
             initial_score=metrics.p1_score,
             best_score=metrics.p1_score,
             best_feasibility=metrics.p1_feasibility,
+            best_high_fidelity_feasibility=float("inf"),
             budget_total=BUDGET,
             budget_remaining=BUDGET,
             episode_done=False,
     def _handle_submit(self) -> StellaratorObservation:
         metrics = self._evaluate_params(self._state.current_params, fidelity="high")
+        initial_submit_metrics = self._initial_high_fidelity_metrics()
+        best_submit_metrics = self._refresh_best_high_fidelity_metrics(metrics)
+        reward = self._compute_reward(
+            metrics,
+            "submit",
+            done=True,
+            initial_reference_score=initial_submit_metrics.p1_score,
+        )
+        summary = self._summary_submit(metrics, best_submit_metrics)
         self._state.history.append(summary)
         self._state.episode_done = True
         self._last_metrics = metrics
         metrics: EvaluationMetrics,
         intent: str,
         done: bool,
+        initial_reference_score: float | None = None,
     ) -> float:
         previous_metrics = self._reference_metrics(metrics)
         if metrics.evaluation_failed:
             reward -= 0.1
         if intent == "submit" or done:
+            base_score = (
+                initial_reference_score
+                if initial_reference_score is not None
+                else self._state.initial_score
             )
+            improved = metrics.constraints_satisfied and metrics.p1_score > base_score
             if improved:
+                ratio = (metrics.p1_score - base_score) / max(1.0 - base_score, 1e-6)
                 if intent == "submit":
                     reward += 5.0 * ratio + self._state.budget_remaining / self._state.budget_total
                 else:
             text_lines.append(f"failure_reason={metrics.failure_reason}")
         text_lines.extend(
             [
+                f"max_elongation={metrics.max_elongation:.4f}  |  best_score={self._display_best_score(metrics):.6f}",
                 f"aspect_ratio={metrics.aspect_ratio:.4f}  (<= {ASPECT_RATIO_MAX:.1f})",
                 f"average_triangularity={metrics.average_triangularity:.4f}  (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
                 f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f}  (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
+                (
+                    f"feasibility={metrics.p1_feasibility:.6f}  |  "
+                    f"best_feasibility={self._display_best_feasibility(metrics):.6f}"
+                ),
                 f"vacuum_well={metrics.vacuum_well:.4f}",
                 f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
                 f"step={self._state.step_count}  |  budget={self._state.budget_remaining}/{self._state.budget_total}",
             failure_reason=metrics.failure_reason,
             step_number=self._state.step_count,
             budget_remaining=self._state.budget_remaining,
+            best_score=self._display_best_score(metrics),
+            best_feasibility=self._display_best_feasibility(metrics),
             constraints_satisfied=metrics.constraints_satisfied,
             target_spec=TARGET_SPEC,
             reward=reward,
             f"Low-fidelity evaluation. {objective_summary}"
         )
+    def _summary_submit(
+        self,
+        metrics: EvaluationMetrics,
+        best_submit_metrics: EvaluationMetrics,
+    ) -> str:
         if metrics.evaluation_failed:
             return f"Submit failed during high-fidelity evaluation: {metrics.failure_reason}"
         return (
+            f"Submitted current_high_fidelity_score={metrics.p1_score:.6f}, "
+            f"best_high_fidelity_score={best_submit_metrics.p1_score:.6f}, "
+            f"best_high_fidelity_feasibility={best_submit_metrics.p1_feasibility:.6f}, "
             f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}."
         )
             return self._last_successful_metrics
         return fallback
+    def _initial_high_fidelity_metrics(self) -> EvaluationMetrics:
+        if self._state.initial_high_fidelity_score is not None:
+            return self._evaluate_params(self._state.initial_params, fidelity="high")
+        metrics = self._evaluate_params(self._state.initial_params, fidelity="high")
+        self._state.initial_high_fidelity_score = metrics.p1_score
+        return metrics
+    def _refresh_best_high_fidelity_metrics(
+        self,
+        current_submit_metrics: EvaluationMetrics,
+    ) -> EvaluationMetrics:
+        best_metrics = current_submit_metrics
+        if self._state.best_params != self._state.current_params:
+            best_metrics = self._evaluate_params(self._state.best_params, fidelity="high")
+        self._state.best_high_fidelity_score = best_metrics.p1_score
+        self._state.best_high_fidelity_feasibility = best_metrics.p1_feasibility
+        return best_metrics
+    def _display_best_score(self, metrics: EvaluationMetrics) -> float:
+        if (
+            metrics.evaluation_fidelity == "high"
+            and self._state.best_high_fidelity_score is not None
+        ):
+            return self._state.best_high_fidelity_score
+        return self._state.best_score
+    def _display_best_feasibility(self, metrics: EvaluationMetrics) -> float:
+        if (
+            metrics.evaluation_fidelity == "high"
+            and self._state.best_high_fidelity_score is not None
+        ):
+            return self._state.best_high_fidelity_feasibility
+        return self._state.best_feasibility
     def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
         if metrics.evaluation_failed:
             return