CreativeEngineer commited on
Commit
2fccde8
·
1 Parent(s): 2cb6617

feat: add verifier-native reward v1

Browse files
README.md CHANGED
@@ -13,7 +13,7 @@ An RL environment where agents optimize stellarator fusion reactor designs by ad
13
  |---|---|
14
  | `aspect_ratio` | ≤ 4.0 |
15
  | `average_triangularity` | ≤ -0.5 |
16
- | `edge_iota_over_nfp` | ≥ 0.3 |
17
 
18
  The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
19
 
@@ -66,7 +66,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
66
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
67
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
68
  - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
69
- - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
70
  - The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
71
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
72
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 
13
  |---|---|
14
  | `aspect_ratio` | ≤ 4.0 |
15
  | `average_triangularity` | ≤ -0.5 |
16
+ | `abs(edge_iota_over_nfp)` | ≥ 0.3 |
17
 
18
  The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
19
 
 
66
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
67
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
68
  - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
69
+ - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `from_boundary_resolution`; do not present step-time metrics as final submission metrics.
70
  - The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
71
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
72
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
TODO.md CHANGED
@@ -72,7 +72,7 @@ flowchart TD
72
 
73
  - [x] Lock the exact `P1` environment contract
74
  Goal:
75
- freeze observation schema, action schema, episode loop, terminal conditions, and `Reward V0`
76
  Related:
77
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
78
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
@@ -213,16 +213,16 @@ flowchart TD
213
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
214
  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
215
 
216
- - [ ] Update reward from `V0` to `V1` if playtesting reveals a real pathology
217
  Goal:
218
  keep a short exploit -> fix -> behavior improvement story
219
  Related:
220
  [AGENTS.md](AGENTS.md),
221
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
222
 
223
- - [ ] Write down whether `Reward V0` survives unchanged
224
  Goal:
225
- if playtesting does not reveal a real pathology, record that outcome explicitly instead of forcing a `V1`
226
  Related:
227
  [README.md](README.md),
228
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
@@ -294,4 +294,6 @@ flowchart TD
294
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
295
  - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
296
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
297
- - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
 
 
 
72
 
73
  - [x] Lock the exact `P1` environment contract
74
  Goal:
75
+ freeze observation schema, action schema, episode loop, terminal conditions, and the live reward contract
76
  Related:
77
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
78
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 
213
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
214
  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
215
 
216
+ - [x] Update reward from `V0` to `V1` after playtesting exposed a real repair-path pathology
217
  Goal:
218
  keep a short exploit -> fix -> behavior improvement story
219
  Related:
220
  [AGENTS.md](AGENTS.md),
221
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
222
 
223
+ - [x] Write down why `Reward V0` did not survive unchanged
224
  Goal:
225
+ document the concrete pathology: pure official_feasibility` hid useful non-dominant repairs because official feasibility is a max over normalized constraint violations
226
  Related:
227
  [README.md](README.md),
228
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 
294
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
295
  - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
296
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
297
+ - [x] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
298
+ Note:
299
+ completed by recording the concrete `Reward V0` pathology and only then moving to `Reward V1`
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -138,7 +138,7 @@ The environment contract must stay narrow and legible:
138
  - low-fidelity verifier for normal steps
139
  - high-fidelity verifier for `submit`
140
  - readable observation surface with explicit fidelity labeling
141
- - `Reward V0` kept simple and feasibility-first until playtesting proves a real pathology
142
 
143
  The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
144
 
 
138
  - low-fidelity verifier for normal steps
139
  - high-fidelity verifier for `submit`
140
  - readable observation surface with explicit fidelity labeling
141
+ - `Reward V1` kept verifier-native and repair-first, with official normalized violation telemetry
142
 
143
  The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
144
 
docs/P1_ENV_CONTRACT_V1.md CHANGED
@@ -92,6 +92,10 @@ Required fields:
92
  - `aspect_ratio`
93
  - `average_triangularity`
94
  - `edge_iota_over_nfp`
 
 
 
 
95
  - `p1_feasibility`
96
  - `p1_score`
97
  - `constraints_satisfied`
@@ -118,6 +122,8 @@ Interpretation rules:
118
  - high-fidelity `submit` metrics must be labeled as high-fidelity
119
  - low-fidelity and high-fidelity best-state reporting must stay separate
120
  - the observation must be understandable without hidden state
 
 
121
  - reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
122
  - action telemetry must expose parameter values before and after the action, including clamped and no-op moves
123
 
@@ -183,24 +189,31 @@ Training and evaluation rule:
183
  - keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
184
  - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
185
 
186
- ## 9. Reward V0
187
 
188
- `Reward V0` is the live reward contract until playtesting proves a concrete pathology.
 
 
 
189
 
190
  Target behavior:
191
 
192
  - infeasible to feasible crossing gets a clear positive bonus
193
  - feasible to infeasible regression gets a clear penalty
194
- - when both states are infeasible, reduced official feasibility violation should help
 
 
195
  - when both states are feasible, lower `max_elongation` should help
196
- - non-submit actions pay a small step cost
 
197
  - `submit` should be better than passive exhaustion when the design is genuinely improved
198
  - recovery after a failed evaluation may receive a modest bounded bonus
199
 
200
  Rules:
201
 
202
  - keep reward scalar and verifier-driven
203
- - do not add mode-specific or parameter-specific reward shaping
 
204
  - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
205
 
206
  ## 10. Reset and Fixture Policy
 
92
  - `aspect_ratio`
93
  - `average_triangularity`
94
  - `edge_iota_over_nfp`
95
+ - `aspect_ratio_violation`
96
+ - `triangularity_violation`
97
+ - `iota_violation`
98
+ - `dominant_constraint`
99
  - `p1_feasibility`
100
  - `p1_score`
101
  - `constraints_satisfied`
 
122
  - high-fidelity `submit` metrics must be labeled as high-fidelity
123
  - low-fidelity and high-fidelity best-state reporting must stay separate
124
  - the observation must be understandable without hidden state
125
+ - normalized constraint-violation telemetry must follow the official `P1` constraint scales
126
+ - the dominant active constraint must be visible so a human can explain repair-phase rewards
127
  - reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
128
  - action telemetry must expose parameter values before and after the action, including clamped and no-op moves
129
 
 
189
  - keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
190
  - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
191
 
192
+ ## 9. Reward V1
193
 
194
+ `Reward V1` replaces `Reward V0` because the old infeasible shaping only used `Δ official_feasibility`.
195
+ That was too coarse once the transferred P1 findings made the main pathology clear: official
196
+ feasibility is a max over normalized constraint violations, so useful repair steps on
197
+ non-dominant constraints could be nearly invisible to the reward.
198
 
199
  Target behavior:
200
 
201
  - infeasible to feasible crossing gets a clear positive bonus
202
  - feasible to infeasible regression gets a clear penalty
203
+ - when both states are infeasible, reduced official feasibility violation should still help
204
+ - when both states are infeasible, reduced normalized triangularity violation should help the most
205
+ - when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
206
  - when both states are feasible, lower `max_elongation` should help
207
+ - larger `run` actions should pay a larger step cost than smaller `run` actions
208
+ - `restore_best` should keep a flat non-submit step cost
209
  - `submit` should be better than passive exhaustion when the design is genuinely improved
210
  - recovery after a failed evaluation may receive a modest bounded bonus
211
 
212
  Rules:
213
 
214
  - keep reward scalar and verifier-driven
215
+ - keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
216
+ - do not add family-specific reward shaping from `scadena`, `CreativeEngineer`, `Samet`, or `egodos`
217
  - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
218
 
219
  ## 10. Reset and Fixture Policy
docs/P1_MANUAL_PLAYTEST_LOG.md CHANGED
@@ -49,7 +49,7 @@ Step 1:
49
 
50
  Current conclusion:
51
 
52
- - Reward V0 is legible on the low-fidelity repair path around the default reset seed
53
  - a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
54
 
55
  Episode C: submit-side manual trace
@@ -80,3 +80,5 @@ Step sequence:
80
  Artifact:
81
 
82
  - [manual submit trace JSON](../baselines/submit_side_trace.json)
 
 
 
49
 
50
  Current conclusion:
51
 
52
+ - At the time of this initial playtest, Reward V0 was legible on the low-fidelity repair path around the default reset seed
53
  - a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
54
 
55
  Episode C: submit-side manual trace
 
80
  Artifact:
81
 
82
  - [manual submit trace JSON](../baselines/submit_side_trace.json)
83
+ Note:
84
+ this is a historical submit-side artifact from the earlier Reward V0 / pre-telemetry contract surface. Keep it as supporting evidence for the old submit path, not as the current Reward V1 observation-format example.
docs/P1_PARAMETERIZATION_DEEPDIVE.md CHANGED
@@ -37,7 +37,7 @@ Observed behavior:
37
 
38
  `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
39
 
40
- The historical `rotational_transform` range was also too low to reach the `edge_iota_over_nfp >= 0.3` requirement reliably.
41
 
42
  ## 2. Original Winning Session
43
 
 
37
 
38
  `generate_rotating_ellipse(aspect_ratio, elongation, rotational_transform, n_field_periods)` does not meaningfully expose the Fourier mode that controls triangularity.
39
 
40
+ The historical `rotational_transform` range was also too low to reach the `abs(edge_iota_over_nfp) >= 0.3` requirement reliably.
41
 
42
  ## 2. Original Winning Session
43
 
fusion_lab/llm_agent.py CHANGED
@@ -45,7 +45,7 @@ Action rules:
45
  Constraint directions:
46
  - aspect_ratio <= 4.0
47
  - average_triangularity <= -0.5
48
- - edge_iota_over_nfp >= 0.3"""
49
 
50
 
51
  def _extract_json_array(text: str) -> str | None:
@@ -143,7 +143,11 @@ def format_observation(observation: StellaratorObservation) -> str:
143
  f"- average_triangularity: {observation.average_triangularity:.6f} "
144
  "(must stay <= -0.5)\n"
145
  f"- edge_iota_over_nfp: {observation.edge_iota_over_nfp:.4f} "
146
- "(must stay >= 0.3)\n"
 
 
 
 
147
  f"- p1_score: {observation.p1_score:.4f}\n"
148
  f"- p1_feasibility: {observation.p1_feasibility:.6f}\n"
149
  f"- constraints_satisfied: {observation.constraints_satisfied}\n"
 
45
  Constraint directions:
46
  - aspect_ratio <= 4.0
47
  - average_triangularity <= -0.5
48
+ - abs(edge_iota_over_nfp) >= 0.3"""
49
 
50
 
51
  def _extract_json_array(text: str) -> str | None:
 
143
  f"- average_triangularity: {observation.average_triangularity:.6f} "
144
  "(must stay <= -0.5)\n"
145
  f"- edge_iota_over_nfp: {observation.edge_iota_over_nfp:.4f} "
146
+ "(must satisfy abs(.) >= 0.3)\n"
147
+ f"- aspect_ratio_violation: {observation.aspect_ratio_violation:.6f}\n"
148
+ f"- triangularity_violation: {observation.triangularity_violation:.6f}\n"
149
+ f"- iota_violation: {observation.iota_violation:.6f}\n"
150
+ f"- dominant_constraint: {observation.dominant_constraint}\n"
151
  f"- p1_score: {observation.p1_score:.4f}\n"
152
  f"- p1_feasibility: {observation.p1_feasibility:.6f}\n"
153
  f"- constraints_satisfied: {observation.constraints_satisfied}\n"
fusion_lab/models.py CHANGED
@@ -6,6 +6,12 @@ from openenv.core import Action, Observation, State
6
  from pydantic import BaseModel, Field
7
 
8
  ActionIntent = Literal["run", "submit", "restore_best"]
 
 
 
 
 
 
9
  ParameterName = Literal[
10
  "aspect_ratio",
11
  "elongation",
@@ -51,6 +57,9 @@ class RewardBreakdown(BaseModel):
51
  feasibility_crossing_bonus: float = 0.0
52
  feasibility_regression_penalty: float = 0.0
53
  feasibility_delta_reward: float = 0.0
 
 
 
54
  objective_delta_reward: float = 0.0
55
  step_cost: float = 0.0
56
  recovery_bonus: float = 0.0
@@ -94,6 +103,10 @@ class StellaratorObservation(Observation):
94
  aspect_ratio: float = 0.0
95
  average_triangularity: float = 0.0
96
  edge_iota_over_nfp: float = 0.0
 
 
 
 
97
  p1_score: float = 0.0
98
  p1_feasibility: float = 0.0
99
  vacuum_well: float = 0.0
 
6
  from pydantic import BaseModel, Field
7
 
8
  ActionIntent = Literal["run", "submit", "restore_best"]
9
+ ConstraintName = Literal[
10
+ "none",
11
+ "aspect_ratio",
12
+ "average_triangularity",
13
+ "edge_iota_over_nfp",
14
+ ]
15
  ParameterName = Literal[
16
  "aspect_ratio",
17
  "elongation",
 
57
  feasibility_crossing_bonus: float = 0.0
58
  feasibility_regression_penalty: float = 0.0
59
  feasibility_delta_reward: float = 0.0
60
+ aspect_ratio_repair_reward: float = 0.0
61
+ triangularity_repair_reward: float = 0.0
62
+ iota_repair_reward: float = 0.0
63
  objective_delta_reward: float = 0.0
64
  step_cost: float = 0.0
65
  recovery_bonus: float = 0.0
 
103
  aspect_ratio: float = 0.0
104
  average_triangularity: float = 0.0
105
  edge_iota_over_nfp: float = 0.0
106
+ aspect_ratio_violation: float = 0.0
107
+ triangularity_violation: float = 0.0
108
+ iota_violation: float = 0.0
109
+ dominant_constraint: ConstraintName = "none"
110
  p1_score: float = 0.0
111
  p1_feasibility: float = 0.0
112
  vacuum_well: float = 0.0
models.py CHANGED
@@ -3,6 +3,7 @@
3
  from fusion_lab.models import (
4
  ActionMonitor,
5
  ActionIntent,
 
6
  DirectionName,
7
  EvaluationFidelityName,
8
  LowDimBoundaryParams,
@@ -20,6 +21,7 @@ from fusion_lab.models import (
20
  __all__ = [
21
  "ActionIntent",
22
  "ActionMonitor",
 
23
  "DirectionName",
24
  "EvaluationFidelityName",
25
  "LowDimBoundaryParams",
 
3
  from fusion_lab.models import (
4
  ActionMonitor,
5
  ActionIntent,
6
+ ConstraintName,
7
  DirectionName,
8
  EvaluationFidelityName,
9
  LowDimBoundaryParams,
 
21
  __all__ = [
22
  "ActionIntent",
23
  "ActionMonitor",
24
+ "ConstraintName",
25
  "DirectionName",
26
  "EvaluationFidelityName",
27
  "LowDimBoundaryParams",
server/data/p1/lowfi_feasible_local.json CHANGED
@@ -3,7 +3,7 @@
3
  "status": "low_fidelity_calibrated",
4
  "notes": [
5
  "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
6
- "Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
7
  "High-fidelity submit spot check is complete."
8
  ],
9
  "params": {
 
3
  "status": "low_fidelity_calibrated",
4
  "notes": [
5
  "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
6
+ "Useful as a low-fidelity feasibility reference for reward sanity checks.",
7
  "High-fidelity submit spot check is complete."
8
  ],
9
  "params": {
server/environment.py CHANGED
@@ -8,6 +8,7 @@ from fusion_lab.models import (
8
  ActionMonitor,
9
  ActionIntent,
10
  LowDimBoundaryParams,
 
11
  RewardBreakdown,
12
  StellaratorAction,
13
  StellaratorObservation,
@@ -43,12 +44,22 @@ PARAMETER_DELTAS: Final[dict[str, dict[str, float]]] = {
43
  TARGET_SPEC: Final[str] = (
44
  "Optimize the P1 benchmark using a custom low-dimensional boundary family derived "
45
  "from a rotating-ellipse seed. Constraints: aspect ratio <= 4.0, average "
46
- "triangularity <= -0.5, edge rotational transform / n_field_periods >= 0.3. "
47
  "Run actions use low-fidelity verification. Submit uses high-fidelity verification. "
48
  "Budget: 6 evaluations."
49
  )
50
 
51
  FAILURE_PENALTY: Final[float] = -2.0
 
 
 
 
 
 
 
 
 
 
52
 
53
 
54
  class StellaratorEnvironment(
@@ -150,7 +161,12 @@ class StellaratorEnvironment(
150
  self._update_best(params, metrics)
151
 
152
  done = self._state.budget_remaining <= 0
153
- reward_breakdown = self._compute_reward_breakdown(metrics, action.intent, done)
 
 
 
 
 
154
  reward = reward_breakdown.total
155
  summary = self._summary_run(action, metrics, action_monitor)
156
  self._state.history.append(summary)
@@ -265,6 +281,7 @@ class StellaratorEnvironment(
265
  metrics: EvaluationMetrics,
266
  intent: ActionIntent,
267
  done: bool,
 
268
  initial_reference_score: float | None = None,
269
  ) -> RewardBreakdown:
270
  recovered_from_failure = self._recovered_from_failed_evaluation(metrics)
@@ -282,7 +299,7 @@ class StellaratorEnvironment(
282
  if metrics.evaluation_failed:
283
  breakdown.failure_penalty = FAILURE_PENALTY
284
  if intent != "submit":
285
- breakdown.step_cost = -0.1
286
  if intent == "submit":
287
  breakdown.failure_submit_penalty = -1.0
288
  elif done:
@@ -302,10 +319,19 @@ class StellaratorEnvironment(
302
  else:
303
  breakdown.feasibility_delta_reward = (
304
  previous_metrics.p1_feasibility - metrics.p1_feasibility
305
- ) * 5.0
 
 
 
 
 
 
 
 
 
306
 
307
  if intent != "submit":
308
- breakdown.step_cost = -0.1
309
 
310
  if recovered_from_failure:
311
  breakdown.recovery_bonus = 1.0
@@ -365,8 +391,15 @@ class StellaratorEnvironment(
365
  f"max_elongation={metrics.max_elongation:.4f}",
366
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
367
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
368
- f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
 
 
 
369
  f"feasibility={metrics.p1_feasibility:.6f}",
 
 
 
 
370
  f"best_low_fidelity_score={best_low_fidelity_score:.6f}",
371
  f"best_low_fidelity_feasibility={best_low_fidelity_feasibility:.6f}",
372
  (
@@ -394,6 +427,10 @@ class StellaratorEnvironment(
394
  aspect_ratio=metrics.aspect_ratio,
395
  average_triangularity=metrics.average_triangularity,
396
  edge_iota_over_nfp=metrics.edge_iota_over_nfp,
 
 
 
 
397
  p1_score=metrics.p1_score,
398
  p1_feasibility=metrics.p1_feasibility,
399
  vacuum_well=metrics.vacuum_well,
@@ -450,7 +487,8 @@ class StellaratorEnvironment(
450
  else:
451
  delta = previous_metrics.p1_feasibility - metrics.p1_feasibility
452
  objective_summary = (
453
- f"feasibility changed by {delta:+.6f} to {metrics.p1_feasibility:.6f}."
 
454
  )
455
  return (
456
  f"Applied {action.parameter} {action.direction} {action.magnitude}. "
@@ -606,6 +644,13 @@ class StellaratorEnvironment(
606
  return "The requested move was clipped to stay inside the allowed parameter range. "
607
  return ""
608
 
 
 
 
 
 
 
 
609
  def _reward_total(self, breakdown: RewardBreakdown) -> float:
610
  total = (
611
  breakdown.invalid_action_penalty
@@ -615,6 +660,9 @@ class StellaratorEnvironment(
615
  + breakdown.feasibility_crossing_bonus
616
  + breakdown.feasibility_regression_penalty
617
  + breakdown.feasibility_delta_reward
 
 
 
618
  + breakdown.objective_delta_reward
619
  + breakdown.step_cost
620
  + breakdown.recovery_bonus
@@ -633,6 +681,9 @@ class StellaratorEnvironment(
633
  ("feasibility_crossing_bonus", breakdown.feasibility_crossing_bonus),
634
  ("feasibility_regression_penalty", breakdown.feasibility_regression_penalty),
635
  ("feasibility_delta_reward", breakdown.feasibility_delta_reward),
 
 
 
636
  ("objective_delta_reward", breakdown.objective_delta_reward),
637
  ("step_cost", breakdown.step_cost),
638
  ("recovery_bonus", breakdown.recovery_bonus),
 
8
  ActionMonitor,
9
  ActionIntent,
10
  LowDimBoundaryParams,
11
+ MagnitudeName,
12
  RewardBreakdown,
13
  StellaratorAction,
14
  StellaratorObservation,
 
44
  TARGET_SPEC: Final[str] = (
45
  "Optimize the P1 benchmark using a custom low-dimensional boundary family derived "
46
  "from a rotating-ellipse seed. Constraints: aspect ratio <= 4.0, average "
47
+ "triangularity <= -0.5, abs(edge rotational transform / n_field_periods) >= 0.3. "
48
  "Run actions use low-fidelity verification. Submit uses high-fidelity verification. "
49
  "Budget: 6 evaluations."
50
  )
51
 
52
  FAILURE_PENALTY: Final[float] = -2.0
53
+ FEASIBILITY_DELTA_WEIGHT: Final[float] = 2.0
54
+ TRIANGULARITY_REPAIR_WEIGHT: Final[float] = 2.0
55
+ ASPECT_RATIO_REPAIR_WEIGHT: Final[float] = 1.0
56
+ IOTA_REPAIR_WEIGHT: Final[float] = 1.0
57
+ STEP_COST_BY_MAGNITUDE: Final[dict[MagnitudeName, float]] = {
58
+ "small": -0.05,
59
+ "medium": -0.1,
60
+ "large": -0.2,
61
+ }
62
+ RESTORE_STEP_COST: Final[float] = -0.1
63
 
64
 
65
  class StellaratorEnvironment(
 
161
  self._update_best(params, metrics)
162
 
163
  done = self._state.budget_remaining <= 0
164
+ reward_breakdown = self._compute_reward_breakdown(
165
+ metrics,
166
+ action.intent,
167
+ done,
168
+ magnitude=action.magnitude,
169
+ )
170
  reward = reward_breakdown.total
171
  summary = self._summary_run(action, metrics, action_monitor)
172
  self._state.history.append(summary)
 
281
  metrics: EvaluationMetrics,
282
  intent: ActionIntent,
283
  done: bool,
284
+ magnitude: MagnitudeName | None = None,
285
  initial_reference_score: float | None = None,
286
  ) -> RewardBreakdown:
287
  recovered_from_failure = self._recovered_from_failed_evaluation(metrics)
 
299
  if metrics.evaluation_failed:
300
  breakdown.failure_penalty = FAILURE_PENALTY
301
  if intent != "submit":
302
+ breakdown.step_cost = self._step_cost(intent=intent, magnitude=magnitude)
303
  if intent == "submit":
304
  breakdown.failure_submit_penalty = -1.0
305
  elif done:
 
319
  else:
320
  breakdown.feasibility_delta_reward = (
321
  previous_metrics.p1_feasibility - metrics.p1_feasibility
322
+ ) * FEASIBILITY_DELTA_WEIGHT
323
+ breakdown.triangularity_repair_reward = (
324
+ previous_metrics.triangularity_violation - metrics.triangularity_violation
325
+ ) * TRIANGULARITY_REPAIR_WEIGHT
326
+ breakdown.aspect_ratio_repair_reward = (
327
+ previous_metrics.aspect_ratio_violation - metrics.aspect_ratio_violation
328
+ ) * ASPECT_RATIO_REPAIR_WEIGHT
329
+ breakdown.iota_repair_reward = (
330
+ previous_metrics.iota_violation - metrics.iota_violation
331
+ ) * IOTA_REPAIR_WEIGHT
332
 
333
  if intent != "submit":
334
+ breakdown.step_cost = self._step_cost(intent=intent, magnitude=magnitude)
335
 
336
  if recovered_from_failure:
337
  breakdown.recovery_bonus = 1.0
 
391
  f"max_elongation={metrics.max_elongation:.4f}",
392
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
393
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
394
+ (
395
+ "edge_iota_over_nfp="
396
+ f"{metrics.edge_iota_over_nfp:.4f} (abs(.) >= {EDGE_IOTA_OVER_NFP_MIN:.1f})"
397
+ ),
398
  f"feasibility={metrics.p1_feasibility:.6f}",
399
+ f"aspect_ratio_violation={metrics.aspect_ratio_violation:.6f}",
400
+ f"triangularity_violation={metrics.triangularity_violation:.6f}",
401
+ f"iota_violation={metrics.iota_violation:.6f}",
402
+ f"dominant_constraint={metrics.dominant_constraint}",
403
  f"best_low_fidelity_score={best_low_fidelity_score:.6f}",
404
  f"best_low_fidelity_feasibility={best_low_fidelity_feasibility:.6f}",
405
  (
 
427
  aspect_ratio=metrics.aspect_ratio,
428
  average_triangularity=metrics.average_triangularity,
429
  edge_iota_over_nfp=metrics.edge_iota_over_nfp,
430
+ aspect_ratio_violation=metrics.aspect_ratio_violation,
431
+ triangularity_violation=metrics.triangularity_violation,
432
+ iota_violation=metrics.iota_violation,
433
+ dominant_constraint=metrics.dominant_constraint,
434
  p1_score=metrics.p1_score,
435
  p1_feasibility=metrics.p1_feasibility,
436
  vacuum_well=metrics.vacuum_well,
 
487
  else:
488
  delta = previous_metrics.p1_feasibility - metrics.p1_feasibility
489
  objective_summary = (
490
+ f"feasibility changed by {delta:+.6f} to {metrics.p1_feasibility:.6f}; "
491
+ f"dominant_constraint={metrics.dominant_constraint}."
492
  )
493
  return (
494
  f"Applied {action.parameter} {action.direction} {action.magnitude}. "
 
644
  return "The requested move was clipped to stay inside the allowed parameter range. "
645
  return ""
646
 
647
+ def _step_cost(self, *, intent: ActionIntent, magnitude: MagnitudeName | None) -> float:
648
+ if intent == "restore_best":
649
+ return RESTORE_STEP_COST
650
+ if magnitude is None:
651
+ return STEP_COST_BY_MAGNITUDE["medium"]
652
+ return STEP_COST_BY_MAGNITUDE[magnitude]
653
+
654
  def _reward_total(self, breakdown: RewardBreakdown) -> float:
655
  total = (
656
  breakdown.invalid_action_penalty
 
660
  + breakdown.feasibility_crossing_bonus
661
  + breakdown.feasibility_regression_penalty
662
  + breakdown.feasibility_delta_reward
663
+ + breakdown.aspect_ratio_repair_reward
664
+ + breakdown.triangularity_repair_reward
665
+ + breakdown.iota_repair_reward
666
  + breakdown.objective_delta_reward
667
  + breakdown.step_cost
668
  + breakdown.recovery_bonus
 
681
  ("feasibility_crossing_bonus", breakdown.feasibility_crossing_bonus),
682
  ("feasibility_regression_penalty", breakdown.feasibility_regression_penalty),
683
  ("feasibility_delta_reward", breakdown.feasibility_delta_reward),
684
+ ("aspect_ratio_repair_reward", breakdown.aspect_ratio_repair_reward),
685
+ ("triangularity_repair_reward", breakdown.triangularity_repair_reward),
686
+ ("iota_repair_reward", breakdown.iota_repair_reward),
687
  ("objective_delta_reward", breakdown.objective_delta_reward),
688
  ("step_cost", breakdown.step_cost),
689
  ("recovery_bonus", breakdown.recovery_bonus),
server/physics.py CHANGED
@@ -15,7 +15,7 @@ from constellaration.geometry.surface_rz_fourier import SurfaceRZFourier
15
  from constellaration.initial_guess import generate_rotating_ellipse
16
  from constellaration.problems import GeometricalProblem
17
 
18
- from fusion_lab.models import LowDimBoundaryParams
19
 
20
  ASPECT_RATIO_MAX: Final[float] = 4.0
21
  AVERAGE_TRIANGULARITY_MAX: Final[float] = -0.5
@@ -35,6 +35,10 @@ class EvaluationMetrics:
35
  aspect_ratio: float
36
  average_triangularity: float
37
  edge_iota_over_nfp: float
 
 
 
 
38
  p1_score: float
39
  p1_feasibility: float
40
  constraints_satisfied: bool
@@ -119,12 +123,19 @@ def _to_evaluation_metrics(
119
  if not minimize_objective:
120
  raise ValueError("P1 objective is expected to be minimize-only.")
121
  p1_score = _score_from_objective(float(objective)) if constraints_satisfied else 0.0
 
 
 
122
 
123
  return EvaluationMetrics(
124
  max_elongation=float(objective),
125
  aspect_ratio=float(metrics.aspect_ratio),
126
  average_triangularity=float(metrics.average_triangularity),
127
  edge_iota_over_nfp=float(metrics.edge_rotational_transform_over_n_field_periods),
 
 
 
 
128
  p1_score=p1_score,
129
  p1_feasibility=p1_feasibility,
130
  constraints_satisfied=constraints_satisfied,
@@ -145,6 +156,10 @@ def _failure_metrics(
145
  aspect_ratio=0.0,
146
  average_triangularity=0.0,
147
  edge_iota_over_nfp=0.0,
 
 
 
 
148
  p1_score=0.0,
149
  p1_feasibility=FAILED_FEASIBILITY,
150
  constraints_satisfied=False,
@@ -158,3 +173,42 @@ def _failure_metrics(
158
  def _score_from_objective(objective: float) -> float:
159
  normalized = min(max((objective - 1.0) / 9.0, 0.0), 1.0)
160
  return 1.0 - normalized
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  from constellaration.initial_guess import generate_rotating_ellipse
16
  from constellaration.problems import GeometricalProblem
17
 
18
+ from fusion_lab.models import ConstraintName, LowDimBoundaryParams
19
 
20
  ASPECT_RATIO_MAX: Final[float] = 4.0
21
  AVERAGE_TRIANGULARITY_MAX: Final[float] = -0.5
 
35
  aspect_ratio: float
36
  average_triangularity: float
37
  edge_iota_over_nfp: float
38
+ aspect_ratio_violation: float
39
+ triangularity_violation: float
40
+ iota_violation: float
41
+ dominant_constraint: ConstraintName
42
  p1_score: float
43
  p1_feasibility: float
44
  constraints_satisfied: bool
 
123
  if not minimize_objective:
124
  raise ValueError("P1 objective is expected to be minimize-only.")
125
  p1_score = _score_from_objective(float(objective)) if constraints_satisfied else 0.0
126
+ aspect_ratio_violation, triangularity_violation, iota_violation, dominant_constraint = (
127
+ _constraint_violation_metrics(metrics)
128
+ )
129
 
130
  return EvaluationMetrics(
131
  max_elongation=float(objective),
132
  aspect_ratio=float(metrics.aspect_ratio),
133
  average_triangularity=float(metrics.average_triangularity),
134
  edge_iota_over_nfp=float(metrics.edge_rotational_transform_over_n_field_periods),
135
+ aspect_ratio_violation=aspect_ratio_violation,
136
+ triangularity_violation=triangularity_violation,
137
+ iota_violation=iota_violation,
138
+ dominant_constraint=dominant_constraint,
139
  p1_score=p1_score,
140
  p1_feasibility=p1_feasibility,
141
  constraints_satisfied=constraints_satisfied,
 
156
  aspect_ratio=0.0,
157
  average_triangularity=0.0,
158
  edge_iota_over_nfp=0.0,
159
+ aspect_ratio_violation=0.0,
160
+ triangularity_violation=0.0,
161
+ iota_violation=0.0,
162
+ dominant_constraint="none",
163
  p1_score=0.0,
164
  p1_feasibility=FAILED_FEASIBILITY,
165
  constraints_satisfied=False,
 
173
  def _score_from_objective(objective: float) -> float:
174
  normalized = min(max((objective - 1.0) / 9.0, 0.0), 1.0)
175
  return 1.0 - normalized
176
+
177
+
178
+ def _constraint_violation_metrics(
179
+ metrics: ConstellarationMetrics,
180
+ ) -> tuple[float, float, float, ConstraintName]:
181
+ aspect_ratio_violation = max(float(metrics.aspect_ratio) - ASPECT_RATIO_MAX, 0.0) / (
182
+ ASPECT_RATIO_MAX
183
+ )
184
+ triangularity_violation = max(
185
+ float(metrics.average_triangularity) - AVERAGE_TRIANGULARITY_MAX,
186
+ 0.0,
187
+ ) / abs(AVERAGE_TRIANGULARITY_MAX)
188
+ iota_violation = (
189
+ max(
190
+ EDGE_IOTA_OVER_NFP_MIN
191
+ - abs(float(metrics.edge_rotational_transform_over_n_field_periods)),
192
+ 0.0,
193
+ )
194
+ / EDGE_IOTA_OVER_NFP_MIN
195
+ )
196
+
197
+ dominant_constraint: ConstraintName = "none"
198
+ dominant_violation = 0.0
199
+ constraint_violations: tuple[tuple[ConstraintName, float], ...] = (
200
+ ("aspect_ratio", aspect_ratio_violation),
201
+ ("average_triangularity", triangularity_violation),
202
+ ("edge_iota_over_nfp", iota_violation),
203
+ )
204
+ for constraint_name, violation in constraint_violations:
205
+ if violation > dominant_violation:
206
+ dominant_constraint = constraint_name
207
+ dominant_violation = violation
208
+
209
+ return (
210
+ aspect_ratio_violation,
211
+ triangularity_violation,
212
+ iota_violation,
213
+ dominant_constraint,
214
+ )
training/notebooks/fusion_design_lab_training.ipynb CHANGED
@@ -12,7 +12,7 @@
12
  "The agent interacts with a constrained optimization environment where it adjusts 4 geometric knobs of a stellarator boundary, aiming to **minimize max elongation** while satisfying 3 hard physics constraints:\n",
13
  "- `aspect_ratio ≤ 4.0`\n",
14
  "- `average_triangularity ≤ -0.5`\n",
15
- "- `edge_iota_over_nfp ≥ 0.3`\n",
16
  "\n",
17
  "Each episode has **6 evaluations** budgeted. The agent produces a plan of actions and the environment scores it via the `constellaration` physics verifier.\n",
18
  "\n",
@@ -198,7 +198,7 @@
198
  "source": [
199
  "## 6. Reward Function\n",
200
  "\n",
201
- "The environment reward executes each generated action plan in the stellarator environment and returns the cumulative low-fidelity Reward V0 from the live environment. The environment's built-in reward decomposes feasibility (+3/-3 crossing bonuses, feasibility progress), objective (max elongation improvement), step costs, and failure penalties — see `server/environment.py:_compute_reward_breakdown(...)`.\n",
202
  "\n",
203
  "For the current training workflow, the notebook ignores `submit` and does not auto-submit. GRPO therefore optimizes the low-fidelity `run` path only. The live observation telemetry still exposes `reward_breakdown` and `action_monitor` for debugging reward behavior.\n"
204
  ]
 
12
  "The agent interacts with a constrained optimization environment where it adjusts 4 geometric knobs of a stellarator boundary, aiming to **minimize max elongation** while satisfying 3 hard physics constraints:\n",
13
  "- `aspect_ratio ≤ 4.0`\n",
14
  "- `average_triangularity ≤ -0.5`\n",
15
+ "- `abs(edge_iota_over_nfp) ≥ 0.3`\n",
16
  "\n",
17
  "Each episode has **6 evaluations** budgeted. The agent produces a plan of actions and the environment scores it via the `constellaration` physics verifier.\n",
18
  "\n",
 
198
  "source": [
199
  "## 6. Reward Function\n",
200
  "\n",
201
+ "The environment reward executes each generated action plan in the stellarator environment and returns the cumulative low-fidelity Reward V1 from the live environment. The environment's built-in reward decomposes feasibility (+3/-3 crossing bonuses, official feasibility progress, weighted triangularity/aspect/iota repair terms), objective (max elongation improvement), step costs, and failure penalties — see `server/environment.py:_compute_reward_breakdown(...)`.\n",
202
  "\n",
203
  "For the current training workflow, the notebook ignores `submit` and does not auto-submit. GRPO therefore optimizes the low-fidelity `run` path only. The live observation telemetry still exposes `reward_breakdown` and `action_monitor` for debugging reward behavior.\n"
204
  ]