Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

2fccde8

1 Parent(s): 2cb6617

feat: add verifier-native reward v1

Browse files

Files changed (13) hide show

README.md +2 -2
TODO.md +7 -5
docs/FUSION_DESIGN_LAB_PLAN_V2.md +1 -1
docs/P1_ENV_CONTRACT_V1.md +18 -5
docs/P1_MANUAL_PLAYTEST_LOG.md +3 -1
docs/P1_PARAMETERIZATION_DEEPDIVE.md +1 -1
fusion_lab/llm_agent.py +6 -2
fusion_lab/models.py +13 -0
models.py +2 -0
server/data/p1/lowfi_feasible_local.json +1 -1
server/environment.py +58 -7
server/physics.py +55 -1
training/notebooks/fusion_design_lab_training.ipynb +2 -2

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ An RL environment where agents optimize stellarator fusion reactor designs by ad
 |---|---|
 | `aspect_ratio` | ≤ 4.0 |
 | `average_triangularity` | ≤ -0.5 |
-| `edge_iota_over_nfp` | ≥ 0.3 |
 The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
@@ -66,7 +66,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
 - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
-- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.

 |---|---|
 | `aspect_ratio` | ≤ 4.0 |
 | `average_triangularity` | ≤ -0.5 |
+| `abs(edge_iota_over_nfp)` | ≥ 0.3 |
 The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
 - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
+- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `from_boundary_resolution`; do not present step-time metrics as final submission metrics.
 - The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.

TODO.md CHANGED Viewed

@@ -72,7 +72,7 @@ flowchart TD
 - [x] Lock the exact `P1` environment contract
   Goal:
-  freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
@@ -213,16 +213,16 @@ flowchart TD
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
-- [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
   Goal:
   keep a short exploit -> fix -> behavior improvement story
   Related:
   [AGENTS.md](AGENTS.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
-- [ ] Write down whether `Reward V0` survives unchanged
   Goal:
-  if playtesting does not reveal a real pathology, record that outcome explicitly instead of forcing a `V1`
   Related:
   [README.md](README.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
@@ -294,4 +294,6 @@ flowchart TD
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible
-- [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting

 - [x] Lock the exact `P1` environment contract
   Goal:
+  freeze observation schema, action schema, episode loop, terminal conditions, and the live reward contract
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
+- [x] Update reward from `V0` to `V1` after playtesting exposed a real repair-path pathology
   Goal:
   keep a short exploit -> fix -> behavior improvement story
   Related:
   [AGENTS.md](AGENTS.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
+- [x] Write down why `Reward V0` did not survive unchanged
   Goal:
+  document the concrete pathology: pure `Δ official_feasibility` hid useful non-dominant repairs because official feasibility is a max over normalized constraint violations
   Related:
   [README.md](README.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible
+- [x] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
+  Note:
+  completed by recording the concrete `Reward V0` pathology and only then moving to `Reward V1`

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -138,7 +138,7 @@ The environment contract must stay narrow and legible:
 - low-fidelity verifier for normal steps
 - high-fidelity verifier for `submit`
 - readable observation surface with explicit fidelity labeling
-- `Reward V0` kept simple and feasibility-first until playtesting proves a real pathology
 The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.

 - low-fidelity verifier for normal steps
 - high-fidelity verifier for `submit`
 - readable observation surface with explicit fidelity labeling
+- `Reward V1` kept verifier-native and repair-first, with official normalized violation telemetry
 The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -92,6 +92,10 @@ Required fields:
 - `aspect_ratio`
 - `average_triangularity`
 - `edge_iota_over_nfp`
 - `p1_feasibility`
 - `p1_score`
 - `constraints_satisfied`
@@ -118,6 +122,8 @@ Interpretation rules:
 - high-fidelity `submit` metrics must be labeled as high-fidelity
 - low-fidelity and high-fidelity best-state reporting must stay separate
 - the observation must be understandable without hidden state
 - reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
 - action telemetry must expose parameter values before and after the action, including clamped and no-op moves
@@ -183,24 +189,31 @@ Training and evaluation rule:
 - keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
 - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
-## 9. Reward V0
-`Reward V0` is the live reward contract until playtesting proves a concrete pathology.
 Target behavior:
 - infeasible to feasible crossing gets a clear positive bonus
 - feasible to infeasible regression gets a clear penalty
-- when both states are infeasible, reduced official feasibility violation should help
 - when both states are feasible, lower `max_elongation` should help
-- non-submit actions pay a small step cost
 - `submit` should be better than passive exhaustion when the design is genuinely improved
 - recovery after a failed evaluation may receive a modest bounded bonus
 Rules:
 - keep reward scalar and verifier-driven
-- do not add mode-specific or parameter-specific reward shaping
 - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
 ## 10. Reset and Fixture Policy

 - `aspect_ratio`
 - `average_triangularity`
 - `edge_iota_over_nfp`
+- `aspect_ratio_violation`
+- `triangularity_violation`
+- `iota_violation`
+- `dominant_constraint`
 - `p1_feasibility`
 - `p1_score`
 - `constraints_satisfied`
 - high-fidelity `submit` metrics must be labeled as high-fidelity
 - low-fidelity and high-fidelity best-state reporting must stay separate
 - the observation must be understandable without hidden state
+- normalized constraint-violation telemetry must follow the official `P1` constraint scales
+- the dominant active constraint must be visible so a human can explain repair-phase rewards
 - reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
 - action telemetry must expose parameter values before and after the action, including clamped and no-op moves
 - keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
 - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
+## 9. Reward V1
+`Reward V1` replaces `Reward V0` because the old infeasible shaping only used `Δ official_feasibility`.
+That was too coarse once the transferred P1 findings made the main pathology clear: official
+feasibility is a max over normalized constraint violations, so useful repair steps on
+non-dominant constraints could be nearly invisible to the reward.
 Target behavior:
 - infeasible to feasible crossing gets a clear positive bonus
 - feasible to infeasible regression gets a clear penalty
+- when both states are infeasible, reduced official feasibility violation should still help
+- when both states are infeasible, reduced normalized triangularity violation should help the most
+- when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
 - when both states are feasible, lower `max_elongation` should help
+- larger `run` actions should pay a larger step cost than smaller `run` actions
+- `restore_best` should keep a flat non-submit step cost
 - `submit` should be better than passive exhaustion when the design is genuinely improved
 - recovery after a failed evaluation may receive a modest bounded bonus
 Rules:
 - keep reward scalar and verifier-driven
+- keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
+- do not add family-specific reward shaping from `scadena`, `CreativeEngineer`, `Samet`, or `egodos`
 - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
 ## 10. Reset and Fixture Policy

docs/P1_MANUAL_PLAYTEST_LOG.md CHANGED Viewed

@@ -49,7 +49,7 @@ Step 1:
 Current conclusion:
-- Reward V0 is legible on the low-fidelity repair path around the default reset seed
 - a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
 Episode C: submit-side manual trace
@@ -80,3 +80,5 @@ Step sequence:
 Artifact:
 - [manual submit trace JSON](../baselines/submit_side_trace.json)

 Current conclusion:
+- At the time of this initial playtest, Reward V0 was legible on the low-fidelity repair path around the default reset seed
 - a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
 Episode C: submit-side manual trace
 Artifact:
 - [manual submit trace JSON](../baselines/submit_side_trace.json)
+  Note:
+  this is a historical submit-side artifact from the earlier Reward V0 / pre-telemetry contract surface. Keep it as supporting evidence for the old submit path, not as the current Reward V1 observation-format example.

docs/P1_PARAMETERIZATION_DEEPDIVE.md CHANGED Viewed

@@ -37,7 +37,7 @@ Observed behavior:
 `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
-The historical `rotational_transform` range was also too low to reach the `edge_iota_over_nfp >= 0.3` requirement reliably.
 ## 2. Original Winning Session

 `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
+The historical `rotational_transform` range was also too low to reach the `abs(edge_iota_over_nfp) >= 0.3` requirement reliably.
 ## 2. Original Winning Session

fusion_lab/llm_agent.py CHANGED Viewed

@@ -45,7 +45,7 @@ Action rules:
 Constraint directions:
 - aspect_ratio <= 4.0
 - average_triangularity <= -0.5
-- edge_iota_over_nfp >= 0.3"""
 def _extract_json_array(text: str) -> str | None:
@@ -143,7 +143,11 @@ def format_observation(observation: StellaratorObservation) -> str:
         f"- average_triangularity: {observation.average_triangularity:.6f} "
         "(must stay <= -0.5)\n"
         f"- edge_iota_over_nfp: {observation.edge_iota_over_nfp:.4f} "
-        "(must stay >= 0.3)\n"
         f"- p1_score: {observation.p1_score:.4f}\n"
         f"- p1_feasibility: {observation.p1_feasibility:.6f}\n"
         f"- constraints_satisfied: {observation.constraints_satisfied}\n"

 Constraint directions:
 - aspect_ratio <= 4.0
 - average_triangularity <= -0.5
+- abs(edge_iota_over_nfp) >= 0.3"""
 def _extract_json_array(text: str) -> str | None:
         f"- average_triangularity: {observation.average_triangularity:.6f} "
         "(must stay <= -0.5)\n"
         f"- edge_iota_over_nfp: {observation.edge_iota_over_nfp:.4f} "
+        "(must satisfy abs(.) >= 0.3)\n"
+        f"- aspect_ratio_violation: {observation.aspect_ratio_violation:.6f}\n"
+        f"- triangularity_violation: {observation.triangularity_violation:.6f}\n"
+        f"- iota_violation: {observation.iota_violation:.6f}\n"
+        f"- dominant_constraint: {observation.dominant_constraint}\n"
         f"- p1_score: {observation.p1_score:.4f}\n"
         f"- p1_feasibility: {observation.p1_feasibility:.6f}\n"
         f"- constraints_satisfied: {observation.constraints_satisfied}\n"

fusion_lab/models.py CHANGED Viewed

@@ -6,6 +6,12 @@ from openenv.core import Action, Observation, State
 from pydantic import BaseModel, Field
 ActionIntent = Literal["run", "submit", "restore_best"]
 ParameterName = Literal[
     "aspect_ratio",
     "elongation",
@@ -51,6 +57,9 @@ class RewardBreakdown(BaseModel):
     feasibility_crossing_bonus: float = 0.0
     feasibility_regression_penalty: float = 0.0
     feasibility_delta_reward: float = 0.0
     objective_delta_reward: float = 0.0
     step_cost: float = 0.0
     recovery_bonus: float = 0.0
@@ -94,6 +103,10 @@ class StellaratorObservation(Observation):
     aspect_ratio: float = 0.0
     average_triangularity: float = 0.0
     edge_iota_over_nfp: float = 0.0
     p1_score: float = 0.0
     p1_feasibility: float = 0.0
     vacuum_well: float = 0.0

 from pydantic import BaseModel, Field
 ActionIntent = Literal["run", "submit", "restore_best"]
+ConstraintName = Literal[
+    "none",
+    "aspect_ratio",
+    "average_triangularity",
+    "edge_iota_over_nfp",
+]
 ParameterName = Literal[
     "aspect_ratio",
     "elongation",
     feasibility_crossing_bonus: float = 0.0
     feasibility_regression_penalty: float = 0.0
     feasibility_delta_reward: float = 0.0
+    aspect_ratio_repair_reward: float = 0.0
+    triangularity_repair_reward: float = 0.0
+    iota_repair_reward: float = 0.0
     objective_delta_reward: float = 0.0
     step_cost: float = 0.0
     recovery_bonus: float = 0.0
     aspect_ratio: float = 0.0
     average_triangularity: float = 0.0
     edge_iota_over_nfp: float = 0.0
+    aspect_ratio_violation: float = 0.0
+    triangularity_violation: float = 0.0
+    iota_violation: float = 0.0
+    dominant_constraint: ConstraintName = "none"
     p1_score: float = 0.0
     p1_feasibility: float = 0.0
     vacuum_well: float = 0.0

models.py CHANGED Viewed

@@ -3,6 +3,7 @@
 from fusion_lab.models import (
     ActionMonitor,
     ActionIntent,
     DirectionName,
     EvaluationFidelityName,
     LowDimBoundaryParams,
@@ -20,6 +21,7 @@ from fusion_lab.models import (
 __all__ = [
     "ActionIntent",
     "ActionMonitor",
     "DirectionName",
     "EvaluationFidelityName",
     "LowDimBoundaryParams",

 from fusion_lab.models import (
     ActionMonitor,
     ActionIntent,
+    ConstraintName,
     DirectionName,
     EvaluationFidelityName,
     LowDimBoundaryParams,
 __all__ = [
     "ActionIntent",
     "ActionMonitor",
+    "ConstraintName",
     "DirectionName",
     "EvaluationFidelityName",
     "LowDimBoundaryParams",

server/data/p1/lowfi_feasible_local.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "status": "low_fidelity_calibrated",
   "notes": [
     "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
-    "Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
     "High-fidelity submit spot check is complete."
   ],
   "params": {

   "status": "low_fidelity_calibrated",
   "notes": [
     "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
+    "Useful as a low-fidelity feasibility reference for reward sanity checks.",
     "High-fidelity submit spot check is complete."
   ],
   "params": {

server/environment.py CHANGED Viewed

@@ -8,6 +8,7 @@ from fusion_lab.models import (
     ActionMonitor,
     ActionIntent,
     LowDimBoundaryParams,
     RewardBreakdown,
     StellaratorAction,
     StellaratorObservation,
@@ -43,12 +44,22 @@ PARAMETER_DELTAS: Final[dict[str, dict[str, float]]] = {
 TARGET_SPEC: Final[str] = (
     "Optimize the P1 benchmark using a custom low-dimensional boundary family derived "
     "from a rotating-ellipse seed. Constraints: aspect ratio <= 4.0, average "
-    "triangularity <= -0.5, edge rotational transform / n_field_periods >= 0.3. "
     "Run actions use low-fidelity verification. Submit uses high-fidelity verification. "
     "Budget: 6 evaluations."
 )
 FAILURE_PENALTY: Final[float] = -2.0
 class StellaratorEnvironment(
@@ -150,7 +161,12 @@ class StellaratorEnvironment(
         self._update_best(params, metrics)
         done = self._state.budget_remaining <= 0
-        reward_breakdown = self._compute_reward_breakdown(metrics, action.intent, done)
         reward = reward_breakdown.total
         summary = self._summary_run(action, metrics, action_monitor)
         self._state.history.append(summary)
@@ -265,6 +281,7 @@ class StellaratorEnvironment(
         metrics: EvaluationMetrics,
         intent: ActionIntent,
         done: bool,
         initial_reference_score: float | None = None,
     ) -> RewardBreakdown:
         recovered_from_failure = self._recovered_from_failed_evaluation(metrics)
@@ -282,7 +299,7 @@ class StellaratorEnvironment(
         if metrics.evaluation_failed:
             breakdown.failure_penalty = FAILURE_PENALTY
             if intent != "submit":
-                breakdown.step_cost = -0.1
             if intent == "submit":
                 breakdown.failure_submit_penalty = -1.0
             elif done:
@@ -302,10 +319,19 @@ class StellaratorEnvironment(
         else:
             breakdown.feasibility_delta_reward = (
                 previous_metrics.p1_feasibility - metrics.p1_feasibility
-            ) * 5.0
         if intent != "submit":
-            breakdown.step_cost = -0.1
         if recovered_from_failure:
             breakdown.recovery_bonus = 1.0
@@ -365,8 +391,15 @@ class StellaratorEnvironment(
                 f"max_elongation={metrics.max_elongation:.4f}",
                 f"aspect_ratio={metrics.aspect_ratio:.4f}  (<= {ASPECT_RATIO_MAX:.1f})",
                 f"average_triangularity={metrics.average_triangularity:.4f}  (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
-                f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f}  (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
                 f"feasibility={metrics.p1_feasibility:.6f}",
                 f"best_low_fidelity_score={best_low_fidelity_score:.6f}",
                 f"best_low_fidelity_feasibility={best_low_fidelity_feasibility:.6f}",
                 (
@@ -394,6 +427,10 @@ class StellaratorEnvironment(
             aspect_ratio=metrics.aspect_ratio,
             average_triangularity=metrics.average_triangularity,
             edge_iota_over_nfp=metrics.edge_iota_over_nfp,
             p1_score=metrics.p1_score,
             p1_feasibility=metrics.p1_feasibility,
             vacuum_well=metrics.vacuum_well,
@@ -450,7 +487,8 @@ class StellaratorEnvironment(
         else:
             delta = previous_metrics.p1_feasibility - metrics.p1_feasibility
             objective_summary = (
-                f"feasibility changed by {delta:+.6f} to {metrics.p1_feasibility:.6f}."
             )
         return (
             f"Applied {action.parameter} {action.direction} {action.magnitude}. "
@@ -606,6 +644,13 @@ class StellaratorEnvironment(
             return "The requested move was clipped to stay inside the allowed parameter range. "
         return ""
     def _reward_total(self, breakdown: RewardBreakdown) -> float:
         total = (
             breakdown.invalid_action_penalty
@@ -615,6 +660,9 @@ class StellaratorEnvironment(
             + breakdown.feasibility_crossing_bonus
             + breakdown.feasibility_regression_penalty
             + breakdown.feasibility_delta_reward
             + breakdown.objective_delta_reward
             + breakdown.step_cost
             + breakdown.recovery_bonus
@@ -633,6 +681,9 @@ class StellaratorEnvironment(
             ("feasibility_crossing_bonus", breakdown.feasibility_crossing_bonus),
             ("feasibility_regression_penalty", breakdown.feasibility_regression_penalty),
             ("feasibility_delta_reward", breakdown.feasibility_delta_reward),
             ("objective_delta_reward", breakdown.objective_delta_reward),
             ("step_cost", breakdown.step_cost),
             ("recovery_bonus", breakdown.recovery_bonus),

     ActionMonitor,
     ActionIntent,
     LowDimBoundaryParams,
+    MagnitudeName,
     RewardBreakdown,
     StellaratorAction,
     StellaratorObservation,
 TARGET_SPEC: Final[str] = (
     "Optimize the P1 benchmark using a custom low-dimensional boundary family derived "
     "from a rotating-ellipse seed. Constraints: aspect ratio <= 4.0, average "
+    "triangularity <= -0.5, abs(edge rotational transform / n_field_periods) >= 0.3. "
     "Run actions use low-fidelity verification. Submit uses high-fidelity verification. "
     "Budget: 6 evaluations."
 )
 FAILURE_PENALTY: Final[float] = -2.0
+FEASIBILITY_DELTA_WEIGHT: Final[float] = 2.0
+TRIANGULARITY_REPAIR_WEIGHT: Final[float] = 2.0
+ASPECT_RATIO_REPAIR_WEIGHT: Final[float] = 1.0
+IOTA_REPAIR_WEIGHT: Final[float] = 1.0
+STEP_COST_BY_MAGNITUDE: Final[dict[MagnitudeName, float]] = {
+    "small": -0.05,
+    "medium": -0.1,
+    "large": -0.2,
+}
+RESTORE_STEP_COST: Final[float] = -0.1
 class StellaratorEnvironment(
         self._update_best(params, metrics)
         done = self._state.budget_remaining <= 0
+        reward_breakdown = self._compute_reward_breakdown(
+            metrics,
+            action.intent,
+            done,
+            magnitude=action.magnitude,
+        )
         reward = reward_breakdown.total
         summary = self._summary_run(action, metrics, action_monitor)
         self._state.history.append(summary)
         metrics: EvaluationMetrics,
         intent: ActionIntent,
         done: bool,
+        magnitude: MagnitudeName | None = None,
         initial_reference_score: float | None = None,
     ) -> RewardBreakdown:
         recovered_from_failure = self._recovered_from_failed_evaluation(metrics)
         if metrics.evaluation_failed:
             breakdown.failure_penalty = FAILURE_PENALTY
             if intent != "submit":
+                breakdown.step_cost = self._step_cost(intent=intent, magnitude=magnitude)
             if intent == "submit":
                 breakdown.failure_submit_penalty = -1.0
             elif done:
         else:
             breakdown.feasibility_delta_reward = (
                 previous_metrics.p1_feasibility - metrics.p1_feasibility
+            ) * FEASIBILITY_DELTA_WEIGHT
+            breakdown.triangularity_repair_reward = (
+                previous_metrics.triangularity_violation - metrics.triangularity_violation
+            ) * TRIANGULARITY_REPAIR_WEIGHT
+            breakdown.aspect_ratio_repair_reward = (
+                previous_metrics.aspect_ratio_violation - metrics.aspect_ratio_violation
+            ) * ASPECT_RATIO_REPAIR_WEIGHT
+            breakdown.iota_repair_reward = (
+                previous_metrics.iota_violation - metrics.iota_violation
+            ) * IOTA_REPAIR_WEIGHT
         if intent != "submit":
+            breakdown.step_cost = self._step_cost(intent=intent, magnitude=magnitude)
         if recovered_from_failure:
             breakdown.recovery_bonus = 1.0
                 f"max_elongation={metrics.max_elongation:.4f}",
                 f"aspect_ratio={metrics.aspect_ratio:.4f}  (<= {ASPECT_RATIO_MAX:.1f})",
                 f"average_triangularity={metrics.average_triangularity:.4f}  (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
+                (
+                    "edge_iota_over_nfp="
+                    f"{metrics.edge_iota_over_nfp:.4f}  (abs(.) >= {EDGE_IOTA_OVER_NFP_MIN:.1f})"
+                ),
                 f"feasibility={metrics.p1_feasibility:.6f}",
+                f"aspect_ratio_violation={metrics.aspect_ratio_violation:.6f}",
+                f"triangularity_violation={metrics.triangularity_violation:.6f}",
+                f"iota_violation={metrics.iota_violation:.6f}",
+                f"dominant_constraint={metrics.dominant_constraint}",
                 f"best_low_fidelity_score={best_low_fidelity_score:.6f}",
                 f"best_low_fidelity_feasibility={best_low_fidelity_feasibility:.6f}",
                 (
             aspect_ratio=metrics.aspect_ratio,
             average_triangularity=metrics.average_triangularity,
             edge_iota_over_nfp=metrics.edge_iota_over_nfp,
+            aspect_ratio_violation=metrics.aspect_ratio_violation,
+            triangularity_violation=metrics.triangularity_violation,
+            iota_violation=metrics.iota_violation,
+            dominant_constraint=metrics.dominant_constraint,
             p1_score=metrics.p1_score,
             p1_feasibility=metrics.p1_feasibility,
             vacuum_well=metrics.vacuum_well,
         else:
             delta = previous_metrics.p1_feasibility - metrics.p1_feasibility
             objective_summary = (
+                f"feasibility changed by {delta:+.6f} to {metrics.p1_feasibility:.6f}; "
+                f"dominant_constraint={metrics.dominant_constraint}."
             )
         return (
             f"Applied {action.parameter} {action.direction} {action.magnitude}. "
             return "The requested move was clipped to stay inside the allowed parameter range. "
         return ""
+    def _step_cost(self, *, intent: ActionIntent, magnitude: MagnitudeName | None) -> float:
+        if intent == "restore_best":
+            return RESTORE_STEP_COST
+        if magnitude is None:
+            return STEP_COST_BY_MAGNITUDE["medium"]
+        return STEP_COST_BY_MAGNITUDE[magnitude]
     def _reward_total(self, breakdown: RewardBreakdown) -> float:
         total = (
             breakdown.invalid_action_penalty
             + breakdown.feasibility_crossing_bonus
             + breakdown.feasibility_regression_penalty
             + breakdown.feasibility_delta_reward
+            + breakdown.aspect_ratio_repair_reward
+            + breakdown.triangularity_repair_reward
+            + breakdown.iota_repair_reward
             + breakdown.objective_delta_reward
             + breakdown.step_cost
             + breakdown.recovery_bonus
             ("feasibility_crossing_bonus", breakdown.feasibility_crossing_bonus),
             ("feasibility_regression_penalty", breakdown.feasibility_regression_penalty),
             ("feasibility_delta_reward", breakdown.feasibility_delta_reward),
+            ("aspect_ratio_repair_reward", breakdown.aspect_ratio_repair_reward),
+            ("triangularity_repair_reward", breakdown.triangularity_repair_reward),
+            ("iota_repair_reward", breakdown.iota_repair_reward),
             ("objective_delta_reward", breakdown.objective_delta_reward),
             ("step_cost", breakdown.step_cost),
             ("recovery_bonus", breakdown.recovery_bonus),

server/physics.py CHANGED Viewed

@@ -15,7 +15,7 @@ from constellaration.geometry.surface_rz_fourier import SurfaceRZFourier
 from constellaration.initial_guess import generate_rotating_ellipse
 from constellaration.problems import GeometricalProblem
-from fusion_lab.models import LowDimBoundaryParams
 ASPECT_RATIO_MAX: Final[float] = 4.0
 AVERAGE_TRIANGULARITY_MAX: Final[float] = -0.5
@@ -35,6 +35,10 @@ class EvaluationMetrics:
     aspect_ratio: float
     average_triangularity: float
     edge_iota_over_nfp: float
     p1_score: float
     p1_feasibility: float
     constraints_satisfied: bool
@@ -119,12 +123,19 @@ def _to_evaluation_metrics(
     if not minimize_objective:
         raise ValueError("P1 objective is expected to be minimize-only.")
     p1_score = _score_from_objective(float(objective)) if constraints_satisfied else 0.0
     return EvaluationMetrics(
         max_elongation=float(objective),
         aspect_ratio=float(metrics.aspect_ratio),
         average_triangularity=float(metrics.average_triangularity),
         edge_iota_over_nfp=float(metrics.edge_rotational_transform_over_n_field_periods),
         p1_score=p1_score,
         p1_feasibility=p1_feasibility,
         constraints_satisfied=constraints_satisfied,
@@ -145,6 +156,10 @@ def _failure_metrics(
         aspect_ratio=0.0,
         average_triangularity=0.0,
         edge_iota_over_nfp=0.0,
         p1_score=0.0,
         p1_feasibility=FAILED_FEASIBILITY,
         constraints_satisfied=False,
@@ -158,3 +173,42 @@ def _failure_metrics(
 def _score_from_objective(objective: float) -> float:
     normalized = min(max((objective - 1.0) / 9.0, 0.0), 1.0)
     return 1.0 - normalized

 from constellaration.initial_guess import generate_rotating_ellipse
 from constellaration.problems import GeometricalProblem
+from fusion_lab.models import ConstraintName, LowDimBoundaryParams
 ASPECT_RATIO_MAX: Final[float] = 4.0
 AVERAGE_TRIANGULARITY_MAX: Final[float] = -0.5
     aspect_ratio: float
     average_triangularity: float
     edge_iota_over_nfp: float
+    aspect_ratio_violation: float
+    triangularity_violation: float
+    iota_violation: float
+    dominant_constraint: ConstraintName
     p1_score: float
     p1_feasibility: float
     constraints_satisfied: bool
     if not minimize_objective:
         raise ValueError("P1 objective is expected to be minimize-only.")
     p1_score = _score_from_objective(float(objective)) if constraints_satisfied else 0.0
+    aspect_ratio_violation, triangularity_violation, iota_violation, dominant_constraint = (
+        _constraint_violation_metrics(metrics)
+    )
     return EvaluationMetrics(
         max_elongation=float(objective),
         aspect_ratio=float(metrics.aspect_ratio),
         average_triangularity=float(metrics.average_triangularity),
         edge_iota_over_nfp=float(metrics.edge_rotational_transform_over_n_field_periods),
+        aspect_ratio_violation=aspect_ratio_violation,
+        triangularity_violation=triangularity_violation,
+        iota_violation=iota_violation,
+        dominant_constraint=dominant_constraint,
         p1_score=p1_score,
         p1_feasibility=p1_feasibility,
         constraints_satisfied=constraints_satisfied,
         aspect_ratio=0.0,
         average_triangularity=0.0,
         edge_iota_over_nfp=0.0,
+        aspect_ratio_violation=0.0,
+        triangularity_violation=0.0,
+        iota_violation=0.0,
+        dominant_constraint="none",
         p1_score=0.0,
         p1_feasibility=FAILED_FEASIBILITY,
         constraints_satisfied=False,
 def _score_from_objective(objective: float) -> float:
     normalized = min(max((objective - 1.0) / 9.0, 0.0), 1.0)
     return 1.0 - normalized
+def _constraint_violation_metrics(
+    metrics: ConstellarationMetrics,
+) -> tuple[float, float, float, ConstraintName]:
+    aspect_ratio_violation = max(float(metrics.aspect_ratio) - ASPECT_RATIO_MAX, 0.0) / (
+        ASPECT_RATIO_MAX
+    )
+    triangularity_violation = max(
+        float(metrics.average_triangularity) - AVERAGE_TRIANGULARITY_MAX,
+        0.0,
+    ) / abs(AVERAGE_TRIANGULARITY_MAX)
+    iota_violation = (
+        max(
+            EDGE_IOTA_OVER_NFP_MIN
+            - abs(float(metrics.edge_rotational_transform_over_n_field_periods)),
+            0.0,
+        )
+        / EDGE_IOTA_OVER_NFP_MIN
+    )
+    dominant_constraint: ConstraintName = "none"
+    dominant_violation = 0.0
+    constraint_violations: tuple[tuple[ConstraintName, float], ...] = (
+        ("aspect_ratio", aspect_ratio_violation),
+        ("average_triangularity", triangularity_violation),
+        ("edge_iota_over_nfp", iota_violation),
+    )
+    for constraint_name, violation in constraint_violations:
+        if violation > dominant_violation:
+            dominant_constraint = constraint_name
+            dominant_violation = violation
+    return (
+        aspect_ratio_violation,
+        triangularity_violation,
+        iota_violation,
+        dominant_constraint,
+    )

training/notebooks/fusion_design_lab_training.ipynb CHANGED Viewed

@@ -12,7 +12,7 @@
     "The agent interacts with a constrained optimization environment where it adjusts 4 geometric knobs of a stellarator boundary, aiming to **minimize max elongation** while satisfying 3 hard physics constraints:\n",
     "- `aspect_ratio ≤ 4.0`\n",
     "- `average_triangularity ≤ -0.5`\n",
-    "- `edge_iota_over_nfp ≥ 0.3`\n",
     "\n",
     "Each episode has **6 evaluations** budgeted. The agent produces a plan of actions and the environment scores it via the `constellaration` physics verifier.\n",
     "\n",
@@ -198,7 +198,7 @@
    "source": [
     "## 6. Reward Function\n",
     "\n",
-    "The environment reward executes each generated action plan in the stellarator environment and returns the cumulative low-fidelity Reward V0 from the live environment. The environment's built-in reward decomposes feasibility (+3/-3 crossing bonuses, feasibility progress), objective (max elongation improvement), step costs, and failure penalties — see `server/environment.py:_compute_reward_breakdown(...)`.\n",
     "\n",
     "For the current training workflow, the notebook ignores `submit` and does not auto-submit. GRPO therefore optimizes the low-fidelity `run` path only. The live observation telemetry still exposes `reward_breakdown` and `action_monitor` for debugging reward behavior.\n"
    ]

     "The agent interacts with a constrained optimization environment where it adjusts 4 geometric knobs of a stellarator boundary, aiming to **minimize max elongation** while satisfying 3 hard physics constraints:\n",
     "- `aspect_ratio ≤ 4.0`\n",
     "- `average_triangularity ≤ -0.5`\n",
+    "- `abs(edge_iota_over_nfp) ≥ 0.3`\n",
     "\n",
     "Each episode has **6 evaluations** budgeted. The agent produces a plan of actions and the environment scores it via the `constellaration` physics verifier.\n",
     "\n",
    "source": [
     "## 6. Reward Function\n",
     "\n",
+    "The environment reward executes each generated action plan in the stellarator environment and returns the cumulative low-fidelity Reward V1 from the live environment. The environment's built-in reward decomposes feasibility (+3/-3 crossing bonuses, official feasibility progress, weighted triangularity/aspect/iota repair terms), objective (max elongation improvement), step costs, and failure penalties — see `server/environment.py:_compute_reward_breakdown(...)`.\n",
     "\n",
     "For the current training workflow, the notebook ignores `submit` and does not auto-submit. GRPO therefore optimizes the low-fidelity `run` path only. The live observation telemetry still exposes `reward_breakdown` and `action_monitor` for debugging reward behavior.\n"
    ]