CreativeEngineer commited on
Commit
6deaccc
·
1 Parent(s): eb446cf

refactor: align p1 runtime contract and baseline reporting

Browse files
README.md CHANGED
@@ -30,8 +30,8 @@ Implementation status:
30
  ## Execution Status
31
 
32
  - [x] Lock the `P1` contract in code
33
- - [x] Rewrite shared models to the rotating-ellipse `P1` schema
34
- - [x] Rewrite the environment loop to the rotating-ellipse `P1` schema
35
  - [x] Update the API/task surface to match `P1`
36
  - [x] Update baseline agents to the `P1` contract
37
  - [x] Add a post-terminal guard so `step()` is a no-op after `done=True`
@@ -59,7 +59,7 @@ Implementation status:
59
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
60
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
61
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
62
- - `best_score` and `best_feasibility` are currently context-dependent in observations: run-time views reflect low-fidelity rollout state, while submit-time views can reflect high-fidelity best state. Keep that distinction explicit in docs, traces, and baseline interpretation until the contract is simplified further.
63
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
64
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
65
 
@@ -121,12 +121,13 @@ uv sync --extra notebooks
121
  ## Immediate Next Steps
122
 
123
  1. Run a small measured sweep on the repaired family to choose useful ranges, deltas, and reset seeds.
124
- 2. Add tracked `P1` fixtures under `server/data/p1`.
125
- 3. Run manual playtest episodes and record the first real reward pathology, if any.
126
- 4. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
127
- 5. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
128
- 6. Deploy the environment to HF Space.
129
- 7. Add the Colab notebook under `training/notebooks`.
 
130
 
131
  These are implementation steps, not another planning phase.
132
 
 
30
  ## Execution Status
31
 
32
  - [x] Lock the `P1` contract in code
33
+ - [x] Rewrite shared models to the repaired low-dimensional `P1` schema
34
+ - [x] Rewrite the environment loop to the repaired low-dimensional `P1` schema
35
  - [x] Update the API/task surface to match `P1`
36
  - [x] Update baseline agents to the `P1` contract
37
  - [x] Add a post-terminal guard so `step()` is a no-op after `done=True`
 
59
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
60
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
61
  - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
62
+ - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
63
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
64
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
65
 
 
121
  ## Immediate Next Steps
122
 
123
  1. Run a small measured sweep on the repaired family to choose useful ranges, deltas, and reset seeds.
124
+ 2. Verify that observation semantics are human-readable and that low-fi `run` versus high-fi `submit` best-state reporting is not ambiguous.
125
+ 3. Add tracked `P1` fixtures under `server/data/p1`.
126
+ 4. Run manual playtest episodes and record the first real reward pathology, if any.
127
+ 5. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
128
+ 6. Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
129
+ 7. Deploy the environment to HF Space.
130
+ 8. Add the Colab notebook under `training/notebooks`.
131
 
132
  These are implementation steps, not another planning phase.
133
 
TODO.md CHANGED
@@ -20,8 +20,8 @@ Priority source:
20
  ## Current State
21
 
22
  - [x] `P1` strategy is locked
23
- - [x] shared models reflect the rotating-ellipse `P1` contract
24
- - [x] environment loop reflects the rotating-ellipse `P1` contract
25
  - [x] API/task surface reflects `P1`
26
  - [x] baselines reflect the `P1` contract
27
  - [x] repo docs call out the low-fi/high-fi `constellaration` split honestly
@@ -74,7 +74,7 @@ flowchart TD
74
 
75
  - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
76
  Goal:
77
- decide whether parameterization repair is a blocker before more reward work
78
  Related:
79
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
80
  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
@@ -164,12 +164,18 @@ flowchart TD
164
  Related:
165
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
166
 
 
 
 
 
 
 
167
  - [ ] Add 1-2 tracked `P1` fixtures
168
  Files:
169
  [server/data/p1/README.md](server/data/p1/README.md),
170
  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
171
  Note:
172
- add fixtures only after the parameterization repair produces a meaningful near-boundary region
173
 
174
  - [ ] Run fixture sanity checks
175
  Goal:
 
20
  ## Current State
21
 
22
  - [x] `P1` strategy is locked
23
+ - [x] shared models reflect the repaired low-dimensional `P1` contract
24
+ - [x] environment loop reflects the repaired low-dimensional `P1` contract
25
  - [x] API/task surface reflects `P1`
26
  - [x] baselines reflect the `P1` contract
27
  - [x] repo docs call out the low-fi/high-fi `constellaration` split honestly
 
74
 
75
  - [x] Verify that the current 3-knob family can or cannot approach P1 feasibility
76
  Goal:
77
+ resolve the historical gating question about whether parameterization repair was required before more reward work
78
  Related:
79
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md),
80
  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
 
164
  Related:
165
  [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
166
 
167
+ - [x] Clarify or split fidelity-dependent best-state observation fields
168
+ Goal:
169
+ replace ambiguous mixed best-state reporting with explicit low-fidelity and high-fidelity best-state fields before fixture evidence or baseline comparisons
170
+ Related:
171
+ [P1 Environment Contract](docs/P1_ENV_CONTRACT_V1.md)
172
+
173
  - [ ] Add 1-2 tracked `P1` fixtures
174
  Files:
175
  [server/data/p1/README.md](server/data/p1/README.md),
176
  [P1 Pivot Record](docs/PIVOT_P1_ROTATING_ELLIPSE.md)
177
  Note:
178
+ add fixtures only after the repaired family is calibrated into a meaningful near-boundary region
179
 
180
  - [ ] Run fixture sanity checks
181
  Goal:
baselines/compare.py CHANGED
@@ -14,27 +14,36 @@ def main(n_episodes: int = 20) -> None:
14
 
15
  random_rewards: list[float] = []
16
  heuristic_rewards: list[float] = []
17
- random_best_scores: list[float] = []
18
- heuristic_best_scores: list[float] = []
 
 
19
 
20
  for i in range(n_episodes):
21
  rr, rt = random_episode(env, seed=i)
 
22
  random_rewards.append(rr)
23
- random_best_scores.append(rt[-1]["best_score"])
 
24
 
25
  hr, ht = heuristic_episode(env, seed=i)
 
26
  heuristic_rewards.append(hr)
27
- heuristic_best_scores.append(ht[-1]["best_score"])
 
28
 
29
  r_mean = sum(random_rewards) / len(random_rewards)
30
  h_mean = sum(heuristic_rewards) / len(heuristic_rewards)
31
- r_score = sum(random_best_scores) / len(random_best_scores)
32
- h_score = sum(heuristic_best_scores) / len(heuristic_best_scores)
 
 
33
 
34
  print(f"{'Metric':<25} {'Random':>12} {'Heuristic':>12}")
35
  print("-" * 51)
36
  print(f"{'Mean reward':<25} {r_mean:>+12.4f} {h_mean:>+12.4f}")
37
- print(f"{'Mean best P1 score':<25} {r_score:>12.6f} {h_score:>12.6f}")
 
38
  print(f"{'Episodes':<25} {n_episodes:>12d} {n_episodes:>12d}")
39
  print()
40
 
@@ -42,6 +51,14 @@ def main(n_episodes: int = 20) -> None:
42
  print(f"Heuristic wins: {wins}/{n_episodes} episodes ({100 * wins / n_episodes:.0f}%)")
43
 
44
 
 
 
 
 
 
 
 
 
45
  if __name__ == "__main__":
46
  n = int(sys.argv[1]) if len(sys.argv) > 1 else 20
47
  main(n)
 
14
 
15
  random_rewards: list[float] = []
16
  heuristic_rewards: list[float] = []
17
+ random_final_scores: list[float] = []
18
+ heuristic_final_scores: list[float] = []
19
+ random_feasible: list[int] = []
20
+ heuristic_feasible: list[int] = []
21
 
22
  for i in range(n_episodes):
23
  rr, rt = random_episode(env, seed=i)
24
+ _require_submit_fidelity(rt[-1], baseline_name="random")
25
  random_rewards.append(rr)
26
+ random_final_scores.append(rt[-1]["score"])
27
+ random_feasible.append(1 if rt[-1]["constraints_satisfied"] else 0)
28
 
29
  hr, ht = heuristic_episode(env, seed=i)
30
+ _require_submit_fidelity(ht[-1], baseline_name="heuristic")
31
  heuristic_rewards.append(hr)
32
+ heuristic_final_scores.append(ht[-1]["score"])
33
+ heuristic_feasible.append(1 if ht[-1]["constraints_satisfied"] else 0)
34
 
35
  r_mean = sum(random_rewards) / len(random_rewards)
36
  h_mean = sum(heuristic_rewards) / len(heuristic_rewards)
37
+ r_score = sum(random_final_scores) / len(random_final_scores)
38
+ h_score = sum(heuristic_final_scores) / len(heuristic_final_scores)
39
+ r_feasible = sum(random_feasible)
40
+ h_feasible = sum(heuristic_feasible)
41
 
42
  print(f"{'Metric':<25} {'Random':>12} {'Heuristic':>12}")
43
  print("-" * 51)
44
  print(f"{'Mean reward':<25} {r_mean:>+12.4f} {h_mean:>+12.4f}")
45
+ print(f"{'Mean final P1 score':<25} {r_score:>12.6f} {h_score:>12.6f}")
46
+ print(f"{'Feasible finals':<25} {r_feasible:>12d} {h_feasible:>12d}")
47
  print(f"{'Episodes':<25} {n_episodes:>12d} {n_episodes:>12d}")
48
  print()
49
 
 
51
  print(f"Heuristic wins: {wins}/{n_episodes} episodes ({100 * wins / n_episodes:.0f}%)")
52
 
53
 
54
+ def _require_submit_fidelity(final_step: dict[str, object], *, baseline_name: str) -> None:
55
+ fidelity = final_step["evaluation_fidelity"]
56
+ if fidelity != "high":
57
+ raise ValueError(
58
+ f"{baseline_name} baseline ended on {fidelity!r} instead of high-fidelity submit."
59
+ )
60
+
61
+
62
  if __name__ == "__main__":
63
  n = int(sys.argv[1]) if len(sys.argv) > 1 else 20
64
  main(n)
baselines/heuristic_agent.py CHANGED
@@ -13,10 +13,19 @@ def heuristic_episode(
13
  ) -> tuple[float, list[dict[str, object]]]:
14
  obs = env.reset(seed=seed)
15
  total_reward = 0.0
16
- trace: list[dict[str, object]] = [{"step": 0, "score": obs.p1_score}]
 
 
 
 
 
 
 
17
 
18
  while not obs.done:
19
- action = _choose_action(obs)
 
 
20
  obs = env.step(action)
21
  total_reward += obs.reward or 0.0
22
  trace.append(
@@ -24,7 +33,8 @@ def heuristic_episode(
24
  "step": len(trace),
25
  "action": _action_label(action),
26
  "score": obs.p1_score,
27
- "best_score": obs.best_score,
 
28
  "reward": obs.reward,
29
  "failure": obs.evaluation_failed,
30
  }
@@ -95,7 +105,8 @@ def main(n_episodes: int = 20) -> None:
95
  rewards.append(total_reward)
96
  print(
97
  f"Episode {i:3d}: steps={len(trace) - 1} "
98
- f"final_score={final['score']:.6f} best_score={final['best_score']:.6f} "
 
99
  f"reward={total_reward:+.4f}"
100
  )
101
 
 
13
  ) -> tuple[float, list[dict[str, object]]]:
14
  obs = env.reset(seed=seed)
15
  total_reward = 0.0
16
+ trace: list[dict[str, object]] = [
17
+ {
18
+ "step": 0,
19
+ "score": obs.p1_score,
20
+ "evaluation_fidelity": obs.evaluation_fidelity,
21
+ "constraints_satisfied": obs.constraints_satisfied,
22
+ }
23
+ ]
24
 
25
  while not obs.done:
26
+ action = (
27
+ StellaratorAction(intent="submit") if obs.budget_remaining <= 1 else _choose_action(obs)
28
+ )
29
  obs = env.step(action)
30
  total_reward += obs.reward or 0.0
31
  trace.append(
 
33
  "step": len(trace),
34
  "action": _action_label(action),
35
  "score": obs.p1_score,
36
+ "evaluation_fidelity": obs.evaluation_fidelity,
37
+ "constraints_satisfied": obs.constraints_satisfied,
38
  "reward": obs.reward,
39
  "failure": obs.evaluation_failed,
40
  }
 
105
  rewards.append(total_reward)
106
  print(
107
  f"Episode {i:3d}: steps={len(trace) - 1} "
108
+ f"final_score={final['score']:.6f} fidelity={final['evaluation_fidelity']} "
109
+ f"constraints={'yes' if final['constraints_satisfied'] else 'no'} "
110
  f"reward={total_reward:+.4f}"
111
  )
112
 
baselines/random_agent.py CHANGED
@@ -24,15 +24,25 @@ def random_episode(
24
  rng = random.Random(seed)
25
  obs = env.reset(seed=seed)
26
  total_reward = 0.0
27
- trace: list[dict[str, object]] = [{"step": 0, "score": obs.p1_score}]
 
 
 
 
 
 
 
28
 
29
  while not obs.done:
30
- action = StellaratorAction(
31
- intent="run",
32
- parameter=rng.choice(PARAMETERS),
33
- direction=rng.choice(DIRECTIONS),
34
- magnitude=rng.choice(MAGNITUDES),
35
- )
 
 
 
36
  obs = env.step(action)
37
  total_reward += obs.reward or 0.0
38
  trace.append(
@@ -40,7 +50,9 @@ def random_episode(
40
  "step": len(trace),
41
  "action": action.intent,
42
  "score": obs.p1_score,
43
- "best_score": obs.best_score,
 
 
44
  "reward": obs.reward,
45
  }
46
  )
@@ -58,7 +70,8 @@ def main(n_episodes: int = 20) -> None:
58
  rewards.append(total_reward)
59
  print(
60
  f"Episode {i:3d}: steps={len(trace) - 1} "
61
- f"final_score={final['score']:.6f} best_score={final['best_score']:.6f} "
 
62
  f"reward={total_reward:+.4f}"
63
  )
64
 
 
24
  rng = random.Random(seed)
25
  obs = env.reset(seed=seed)
26
  total_reward = 0.0
27
+ trace: list[dict[str, object]] = [
28
+ {
29
+ "step": 0,
30
+ "score": obs.p1_score,
31
+ "evaluation_fidelity": obs.evaluation_fidelity,
32
+ "constraints_satisfied": obs.constraints_satisfied,
33
+ }
34
+ ]
35
 
36
  while not obs.done:
37
+ if obs.budget_remaining <= 1:
38
+ action = StellaratorAction(intent="submit")
39
+ else:
40
+ action = StellaratorAction(
41
+ intent="run",
42
+ parameter=rng.choice(PARAMETERS),
43
+ direction=rng.choice(DIRECTIONS),
44
+ magnitude=rng.choice(MAGNITUDES),
45
+ )
46
  obs = env.step(action)
47
  total_reward += obs.reward or 0.0
48
  trace.append(
 
50
  "step": len(trace),
51
  "action": action.intent,
52
  "score": obs.p1_score,
53
+ "evaluation_fidelity": obs.evaluation_fidelity,
54
+ "constraints_satisfied": obs.constraints_satisfied,
55
+ "evaluation_failed": obs.evaluation_failed,
56
  "reward": obs.reward,
57
  }
58
  )
 
70
  rewards.append(total_reward)
71
  print(
72
  f"Episode {i:3d}: steps={len(trace) - 1} "
73
+ f"final_score={final['score']:.6f} fidelity={final['evaluation_fidelity']} "
74
+ f"constraints={'yes' if final['constraints_satisfied'] else 'no'} "
75
  f"reward={total_reward:+.4f}"
76
  )
77
 
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -244,8 +244,10 @@ The observation should expose:
244
  - `failure_reason`
245
  - `step_number`
246
  - `budget_remaining`
247
- - `best_score`
248
- - `best_feasibility`
 
 
249
  - `target_spec`
250
  - concise textual summary of the last action outcome in `diagnostics_text`
251
 
@@ -253,10 +255,9 @@ The observation must be interpretable by a human without additional hidden state
253
 
254
  Current runtime note:
255
 
256
- - `best_score` and `best_feasibility` are not yet fully split by fidelity in the observation schema
257
- - low-fidelity run observations display rollout best state
258
- - high-fidelity submit observations may display high-fidelity best state instead
259
- - keep that distinction explicit in docs and traces until the contract is simplified further
260
 
261
  ### Action Space
262
 
@@ -625,10 +626,11 @@ Deliverables:
625
 
626
  ### Phase 2
627
 
628
- Freeze initial fixtures and manual-playtest the environment.
629
 
630
  Deliverables:
631
 
 
632
  - one good or near-boundary fixture
633
  - bad fixtures
634
  - 5 to 10 episode logs
 
244
  - `failure_reason`
245
  - `step_number`
246
  - `budget_remaining`
247
+ - `best_low_fidelity_score`
248
+ - `best_low_fidelity_feasibility`
249
+ - `best_high_fidelity_score`
250
+ - `best_high_fidelity_feasibility`
251
  - `target_spec`
252
  - concise textual summary of the last action outcome in `diagnostics_text`
253
 
 
255
 
256
  Current runtime note:
257
 
258
+ - the live observation surface now exposes explicit low-fidelity and high-fidelity best-state fields
259
+ - low-fi run steps and high-fi submit steps no longer overload one generic `best_score` field
260
+ - traces and baselines should use the explicit fields instead of reconstructing a mixed best-state story
 
261
 
262
  ### Action Space
263
 
 
626
 
627
  ### Phase 2
628
 
629
+ Audit observation clarity, then freeze initial fixtures and manual-playtest the environment.
630
 
631
  Deliverables:
632
 
633
+ - observation semantics note covering low-fi vs high-fi reporting and best-state fields
634
  - one good or near-boundary fixture
635
  - bad fixtures
636
  - 5 to 10 episode logs
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED
@@ -98,35 +98,40 @@ Transition rule:
98
  ## Hour 2-4: Verify Wiring, Then Manual Playtest
99
 
100
  1. Run a small measured sweep on the repaired family before freezing defaults.
101
- 2. Run fixture checks:
 
 
 
 
102
  - known-good or near-winning design
103
  - near-boundary designs
104
  - clearly bad designs
105
  - do not rely on the current default baseline params as the only starting point
106
- 3. Confirm:
107
  - verifier outputs are sane
108
  - reward ordering is sane
109
  - objective direction is correct
110
- 4. Manually play 5 to 10 episodes.
111
- 5. Log for each step:
112
  - observation
113
  - chosen action
114
  - expected effect
115
  - returned reward
116
  - confusion or exploit if observed
117
- 6. Identify at least one bad incentive or exploit.
118
- 7. Patch reward or penalty logic immediately.
119
- 8. Write the reward shaping story:
120
  - initial reward V0
121
  - bad behavior
122
  - refinement to reward V1
123
  - improved behavior
124
- 9. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
125
 
126
  Exit condition: you can explain why the environment now rewards the intended behavior.
127
 
128
  Artifacts:
129
  - measured range and delta note
 
130
  - fixture check note
131
  - manual playtest log
132
  - reward shaping note
 
98
  ## Hour 2-4: Verify Wiring, Then Manual Playtest
99
 
100
  1. Run a small measured sweep on the repaired family before freezing defaults.
101
+ 2. Audit observation clarity:
102
+ - low-fi `run` metrics are clearly labeled
103
+ - high-fi `submit` metrics are clearly labeled
104
+ - low-fidelity and high-fidelity best-state fields are explicit and human-readable
105
+ 3. Run fixture checks:
106
  - known-good or near-winning design
107
  - near-boundary designs
108
  - clearly bad designs
109
  - do not rely on the current default baseline params as the only starting point
110
+ 4. Confirm:
111
  - verifier outputs are sane
112
  - reward ordering is sane
113
  - objective direction is correct
114
+ 5. Manually play 5 to 10 episodes.
115
+ 6. Log for each step:
116
  - observation
117
  - chosen action
118
  - expected effect
119
  - returned reward
120
  - confusion or exploit if observed
121
+ 7. Identify at least one bad incentive or exploit.
122
+ 8. Patch reward or penalty logic immediately.
123
+ 9. Write the reward shaping story:
124
  - initial reward V0
125
  - bad behavior
126
  - refinement to reward V1
127
  - improved behavior
128
+ 10. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
129
 
130
  Exit condition: you can explain why the environment now rewards the intended behavior.
131
 
132
  Artifacts:
133
  - measured range and delta note
134
+ - observation semantics note
135
  - fixture check note
136
  - manual playtest log
137
  - reward shaping note
docs/P1_ENV_CONTRACT_V1.md CHANGED
@@ -160,8 +160,10 @@ Keep:
160
  - `failure_reason`
161
  - `step_number`
162
  - `budget_remaining`
163
- - `best_score`
164
- - `best_feasibility`
 
 
165
  - `target_spec`
166
  - `diagnostics_text`
167
 
@@ -170,7 +172,7 @@ Add clarity about fidelity:
170
  - low-fidelity step-time metrics should be labeled as such
171
  - high-fidelity submit-time metrics should be labeled as such
172
  - do not expose them as if they are the same truth surface
173
- - in the current runtime, `best_score` and `best_feasibility` can switch meaning with fidelity context, so traces and baselines should not treat them as one invariant metric yet
174
 
175
  This can be done either by:
176
 
@@ -182,7 +184,7 @@ The minimum requirement is that a reader can tell whether a metric came from low
182
  Current repo state:
183
 
184
  - the live observation surface now exposes evaluation fidelity and failure state
185
- - the exact naming can still be refined after playtesting, but low-fi vs high-fi is no longer implicit
186
  - terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
187
 
188
  ## Reward V0
@@ -217,7 +219,7 @@ Do not use reward complexity to compensate for missing action expressivity or mi
217
 
218
  Additional fidelity rule:
219
 
220
- - do not compare a high-fidelity submit score against low-fidelity `initial_score` or `best_score` state
221
  - terminal reward and submit summaries should use a fidelity-consistent basis
222
 
223
  ## Reset Strategy
 
160
  - `failure_reason`
161
  - `step_number`
162
  - `budget_remaining`
163
+ - `best_low_fidelity_score`
164
+ - `best_low_fidelity_feasibility`
165
+ - `best_high_fidelity_score`
166
+ - `best_high_fidelity_feasibility`
167
  - `target_spec`
168
  - `diagnostics_text`
169
 
 
172
  - low-fidelity step-time metrics should be labeled as such
173
  - high-fidelity submit-time metrics should be labeled as such
174
  - do not expose them as if they are the same truth surface
175
+ - the live runtime should expose separate low-fidelity and high-fidelity best-state fields instead of overloading one generic best-state metric
176
 
177
  This can be done either by:
178
 
 
184
  Current repo state:
185
 
186
  - the live observation surface now exposes evaluation fidelity and failure state
187
+ - the live observation surface now exposes separate low-fidelity and high-fidelity best-state fields
188
  - terminal reward/reporting is now fidelity-consistent: `submit` compares against high-fi reference state instead of low-fi rollout score state
189
 
190
  ## Reward V0
 
219
 
220
  Additional fidelity rule:
221
 
222
+ - do not compare a high-fidelity submit score against low-fidelity baseline state
223
  - terminal reward and submit summaries should use a fidelity-consistent basis
224
 
225
  ## Reset Strategy
docs/PIVOT_P1_ROTATING_ELLIPSE.md CHANGED
@@ -147,16 +147,18 @@ evaluation_failed: bool
147
  failure_reason: str
148
  step_number: int
149
  budget_remaining: int
150
- best_score: float
151
- best_feasibility: float
 
 
152
  target_spec: str
153
  ```
154
 
155
  Current requirement:
156
 
157
  - the observation and diagnostics text should make the low-fi vs high-fi distinction explicit
158
- - in the current runtime, `best_score` and `best_feasibility` may reflect low-fidelity rollout state during `run` and high-fidelity best state during `submit`
159
- - do not narrate those fields as one fidelity-independent quantity until the contract is simplified further
160
 
161
  ### Reward V0
162
 
@@ -195,9 +197,12 @@ Current execution note:
195
  step_count: int
196
  current_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
197
  best_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
198
- initial_score: float
199
- best_score: float
200
- best_feasibility: float
 
 
 
201
  history: list[str]
202
  ```
203
 
 
147
  failure_reason: str
148
  step_number: int
149
  budget_remaining: int
150
+ best_low_fidelity_score: float
151
+ best_low_fidelity_feasibility: float
152
+ best_high_fidelity_score: float | None
153
+ best_high_fidelity_feasibility: float | None
154
  target_spec: str
155
  ```
156
 
157
  Current requirement:
158
 
159
  - the observation and diagnostics text should make the low-fi vs high-fi distinction explicit
160
+ - best-state reporting should be split explicitly between low-fidelity rollout state and high-fidelity submit state
161
+ - do not narrate low-fi and high-fi best-state fields as one combined metric
162
 
163
  ### Reward V0
164
 
 
197
  step_count: int
198
  current_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
199
  best_params: {aspect_ratio, elongation, rotational_transform, triangularity_scale}
200
+ initial_low_fidelity_score: float
201
+ initial_high_fidelity_score: float | None
202
+ best_low_fidelity_score: float
203
+ best_low_fidelity_feasibility: float
204
+ best_high_fidelity_score: float | None
205
+ best_high_fidelity_feasibility: float | None
206
  history: list[str]
207
  ```
208
 
fusion_lab/models.py CHANGED
@@ -24,6 +24,15 @@ class LowDimBoundaryParams(BaseModel):
24
  triangularity_scale: float
25
 
26
 
 
 
 
 
 
 
 
 
 
27
  class StellaratorAction(Action):
28
  intent: ActionIntent
29
  parameter: ParameterName | None = None
@@ -46,43 +55,24 @@ class StellaratorObservation(Observation):
46
  failure_reason: str = ""
47
  step_number: int = 0
48
  budget_remaining: int = 6
49
- best_score: float = 0.0
50
- best_feasibility: float = float("inf")
 
 
51
  constraints_satisfied: bool = True
52
  target_spec: str = ""
53
 
54
 
55
  class StellaratorState(State):
56
- initial_params: LowDimBoundaryParams = Field(
57
- default_factory=lambda: LowDimBoundaryParams(
58
- aspect_ratio=3.6,
59
- elongation=1.4,
60
- rotational_transform=1.6,
61
- triangularity_scale=0.55,
62
- )
63
- )
64
- current_params: LowDimBoundaryParams = Field(
65
- default_factory=lambda: LowDimBoundaryParams(
66
- aspect_ratio=3.6,
67
- elongation=1.4,
68
- rotational_transform=1.6,
69
- triangularity_scale=0.55,
70
- )
71
- )
72
- best_params: LowDimBoundaryParams = Field(
73
- default_factory=lambda: LowDimBoundaryParams(
74
- aspect_ratio=3.6,
75
- elongation=1.4,
76
- rotational_transform=1.6,
77
- triangularity_scale=0.55,
78
- )
79
- )
80
- initial_score: float = 0.0
81
  initial_high_fidelity_score: float | None = None
82
- best_score: float = 0.0
83
- best_feasibility: float = float("inf")
84
  best_high_fidelity_score: float | None = None
85
- best_high_fidelity_feasibility: float = float("inf")
86
  budget_total: int = 6
87
  budget_remaining: int = 6
88
  episode_done: bool = False
 
24
  triangularity_scale: float
25
 
26
 
27
+ def default_low_dim_boundary_params() -> LowDimBoundaryParams:
28
+ return LowDimBoundaryParams(
29
+ aspect_ratio=3.6,
30
+ elongation=1.4,
31
+ rotational_transform=1.5,
32
+ triangularity_scale=0.55,
33
+ )
34
+
35
+
36
  class StellaratorAction(Action):
37
  intent: ActionIntent
38
  parameter: ParameterName | None = None
 
55
  failure_reason: str = ""
56
  step_number: int = 0
57
  budget_remaining: int = 6
58
+ best_low_fidelity_score: float = 0.0
59
+ best_low_fidelity_feasibility: float = float("inf")
60
+ best_high_fidelity_score: float | None = None
61
+ best_high_fidelity_feasibility: float | None = None
62
  constraints_satisfied: bool = True
63
  target_spec: str = ""
64
 
65
 
66
  class StellaratorState(State):
67
+ initial_params: LowDimBoundaryParams = Field(default_factory=default_low_dim_boundary_params)
68
+ current_params: LowDimBoundaryParams = Field(default_factory=default_low_dim_boundary_params)
69
+ best_params: LowDimBoundaryParams = Field(default_factory=default_low_dim_boundary_params)
70
+ initial_low_fidelity_score: float = 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  initial_high_fidelity_score: float | None = None
72
+ best_low_fidelity_score: float = 0.0
73
+ best_low_fidelity_feasibility: float = float("inf")
74
  best_high_fidelity_score: float | None = None
75
+ best_high_fidelity_feasibility: float | None = None
76
  budget_total: int = 6
77
  budget_remaining: int = 6
78
  episode_done: bool = False
hackathan_raw_guidance.md DELETED
@@ -1,239 +0,0 @@
1
- ## **OpenEnv Hackathon Participant Guide**
2
-
3
- Welcome to the [OpenEnv Hackathon](https://cerebralvalley.ai/e/open-env-hackathon), hacker! 👋 We’re thrilled to have you on board.
4
-
5
- This guide is your all-in-one resource for the event, including schedule, rules, technical resources, problem statements, judging information, and more. Please read this carefully; most answers can be found here.
6
-
7
- ## **1. Join the [PyTorch Discord Server](https://discord.gg/VBcf6VtfY6)**
8
-
9
- - You’ll be given a Hackathon Participant role by an admin, which will give you access to the hackathon-specific channels.
10
-
11
- - Here, you’ll be able to interact with hackers and sponsors, introduce yourselves, and form teams (for a maximum team size of **3**).
12
-
13
- - If you don't receive your role within **24 hours of joining,** please ping @CV.
14
-
15
- - Please submit your Discord username below so we can grant you the role
16
-
17
- [linkEmbed]
18
-
19
- ## **2. Location**
20
-
21
- **|** Shack15 (1 Ferry Building, Suite 201, San Francisco CA. 94111)
22
-
23
- - **Venue Access:** Shack15 is on the 2nd floor of the Ferry Building. Go up the Ferry Building elevator to the second floor, and turn left. Here you will see the main entrance to Shack15. 
24
-
25
- - **Parking:** Parking near the Ferry Building is extremely limited. Consider parking farther out and taking Uber, Lyft, or Public Transportation. 
26
-
27
- [youtube]
28
-
29
- ## **3. WiFi Information**
30
-
31
- - **Username:** SHACK15_Members
32
-
33
- - **Password:** M3mb3r$4L!f3
34
-
35
- ## **4. Hackathon Schedule**
36
-
37
- **Saturday, March 7 (Outline)**
38
-
39
- - **9:00 AM:** Doors Open •󠁏 Breakfast Served •󠁏 Team Formation
40
-
41
- - **10:00 AM – 11:30AM**: Kick-off presentations with Meta, Hugging Face, UC Berkeley, CoreWeave, OpenPipe, Unsloth AI, Fleet AI, Mercor, Scaler AI Labs, Snorkel AI, Patronus AI, Halluminate and Scale AI
42
-
43
- - **11:30 AM:** Hacking Begins
44
-
45
- - **1:00 PM:** Lunch Served
46
-
47
- - **6:00 PM:** Dinner Served
48
-
49
- - **10:00 PM:** Doors Close •󠁏 Re-entry not permitted
50
-
51
- **Sunday, March 8 (Outline)**
52
-
53
- - **9:00AM:** Doors Open •󠁏 Breakfast Served
54
-
55
- - **1:00PM:** Hacking stops •󠁏 Submissions Due
56
-
57
- - **1:15PM:** First Round Judging Begins
58
-
59
- - **2:00PM:** Lunch Served
60
-
61
- - **3:00PM:** Final Round Judging Begins
62
-
63
- - **4:00PM:** Winners Announced and Closing
64
-
65
- - **5:00PM:** Doors Close
66
-
67
- All presentation slides can be found here
68
-
69
- [linkEmbed]
70
-
71
- ## **5. Hackathon and Submission Rules**
72
-
73
- To keep things fair and aligned with our goals, all teams must follow these rules:
74
-
75
- - **Open Source:** Please ensure your repository is public.
76
-
77
- - **New Work Only:** All projects must be started from scratch during the hackathon with no previous work.
78
-
79
- - **Team Size:** Teams may have up to **3** members.
80
-
81
- - **Banned Projects:** Projects will be disqualified if they: violate legal, ethical, or platform policies, use code, data, or assets you do not have the rights to.
82
-
83
- - Your project **must** use OpenEnv (stable release 0.2.1) deployed on HF spaces
84
-
85
- - You must show a minimal training script for your environment using Unsloth or HF TRL in Colab.
86
-
87
- - You must upload a **one minute** demo video to YouTube talking about your submission.
88
-
89
- ## **6. Hackathon Problem Statements**
90
-
91
- Your project must address at least **one of the five** required problem statements.
92
-
93
- - Some problem statements include **optional partner-sponsored sub-problem statements**, which are additional focus areas related to the main theme.
94
-
95
- - Your project may align with **multiple partner sub-problem statements**, but you can only be **judged for a maximum of two**. Please **select up to two** when submitting.
96
-
97
- - Projects that match these partner sub-problem statements are eligible for **extra partner prizes**, judged separately from the main track winners.
98
-
99
- - Each partner sub-problem statement carries a prize of **$10,000 USD**.
100
-
101
- **Statement 1: Multi-Agent Interactions**
102
-
103
- Environments for this theme involve cooperation, competition, negotiation, and coalition formation. Learning from these environments will enable agents to model the beliefs and incentives of others in partially observable settings. This drives theory-of-mind reasoning and emergent strategic behavior.
104
-
105
- - **Expected Outcome:** an environment that can be used to train multi-agent task handling in a LLM
106
-
107
- - **Example Environments:** Market simulations, compute-allocation negotiations, collaborative puzzle worlds, mixed cooperative/competitive strategy games.
108
-
109
- - **Partner Sub-Themes:**
110
- - **Fleet AI:** Scalable Oversight: Environments that train oversight agents to monitor, analyze, and explain the behavior of other AI agents operating in complex, multi-agent settings.
111
- - **Halluminate:** Multi-Actor Environments: Build a realistic environment where an agent interacts with and manages multiple actors (agents) to discover and achieve the task
112
-
113
- **Statement 2: (Super) Long-Horizon Planning & Instruction Following**
114
-
115
- You will build environments that require deep, multi-step reasoning with sparse or delayed rewards. After using these environments, the goal is to enable agents to decompose goals, track state over extended trajectories, and recover from early mistakes. The aim is to push beyond shallow next-token reasoning toward structured planning and durable internal representations. 
116
-
117
- - **Expected Outcome:** an environment that can capture and improve LLM behaviour on challenging long horizon tasks that need long running sessions beyond context memory limits. 
118
-
119
- - **Example Environments:** Research-planning simulators, large-scale codebase refactoring tasks, strategic resource management worlds, long-horizon logistics optimization, extremely complicated long-horizon instruction following (e.g., 300 instructions scattered around).
120
-
121
- - **Partner Sub-Themes:**
122
- - **Mercor:** Make an environment with capped/uncapped rewards where frontier model rewards scale with token output.
123
-
124
- - **Scale AI:** Environments for long horizon workflows for non-code use cases within a business setting: focusing on either Sales, Project management, or HR & IT.
125
-
126
- **Statement 3: World Modeling**
127
-
128
- - **Statement 3.1: Professional Tasks:** Here you will develop environments that require real interaction with tools, APIs, or dynamic systems where the model is expected to do real hard work instead of exploiting short-cuts to arrive at the desired outcome. Learning from these environments will enable agents to maintain consistent internal state, update beliefs based on outcomes, and orchestrate multi-step workflows. The goal is to strengthen causal reasoning and persistent world models.
129
- - **Expected Outcome:** an environment capturing nuances of a defined partially observable world and improve LLM interaction with it
130
-
131
- - **Example Environments:** Dynamic browser/API ecosystems, enterprise applications, scientific workflow loops (papers → code → experiments), economic simulations with feedback, tool-discovery benchmarks.
132
-
133
- - **Partner Sub-Theme:**
134
- - **Scaler AI Labs:** Multi-App RL Environment for Enterprise Workflows: Create RL environments to demonstrate complex workflows, business rule nuances etc in a large enterprise
135
-
136
- - **Statement 3.2: Personalized Tasks:** Here we will develop an environment that offers real personalized task handling, imagine replying to personal messages or handling dinner conflicts due to work conflicts, replying to tough emails. Think any personal assistant tasks.
137
- - **Expected Outcome:** An environment that gives the model a realistic simulation of handling personal tasks, conflicts and managing them as delegations
138
-
139
- - **Example Environments:** Executive Assistant Meeting Planner, Dinner and drive planning, email and message replying, etc
140
-
141
- - **Partner Sub-Theme:**
142
- - **Patronus AI:** Consumer Workflows with Schema Drift: Multi-step consumer workflow environments where the underlying data schemas, API contracts, and t&cs/policies/rules change.
143
-
144
- **Statement 4: Self-Improvement**
145
-
146
- The focus here is to create environments where agents can learn to generate new challenges, escalate difficulty, and improve through self-play or adaptive curricula. Rather than optimizing fixed tasks, the goal is for agents to learn to drive their own capability growth. The objective is recursive skill amplification.
147
-
148
- - **Expected Outcome:** an environment for improving self-play of a LLM over a defined set of tasks
149
-
150
- - **Example Environments:** Self-play negotiation arenas, auto-generated math/proof tasks, evolving coding competitions, adaptive RL curricula.
151
-
152
- - **Partner Sub-Theme:**
153
- - **Snorkel AI:** Simulated Experts-in-the-Loop: Environment that simulates interactions with real subject-matter experts, with changing requirements / preferences.
154
-
155
- **Statement 5: Wild Card - Impress Us!**
156
-
157
- We do not want to limit your focus if your idea doesn’t fit the boxes above, we want and WILL reward out of box tasks, please be creative but remember to add submissions that meaningfully add value to LLM training on a certain task.
158
-
159
- More details about each theme can be found here:
160
-
161
- [linkEmbed]
162
-
163
- ## **7. CV Hackathon Winners**
164
-
165
- [linkEmbed]
166
-
167
- ## **8. OpenEnv Provided Resources**
168
-
169
- **Please read through the entire slideshow here. This includes:**
170
-
171
- - OpenEnv Fundamentals, Architecture
172
- - Local Dev, Docker, and HF Spaces Deployment
173
- - OpenEnv in Practice
174
- - Training (TRL & Unsloth)
175
- - How-to-Access-Infrastructure (including GPU Request Form)
176
-
177
- [linkEmbed]
178
-
179
- ## **9. Partner Provided Resources**
180
-
181
- - **Unsloth AI Resources**
182
- - <https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks>
183
- - **Mercor Resources**
184
- - Dataset: <https://huggingface.co/datasets/mercor/apex-agents>
185
- - Archipelago repo to run the eval: <https://github.com/Mercor-Intelligence/archipelago>
186
- - APEX-Agents paper: <https://arxiv.org/abs/2601.14242>
187
- - **Hugging Face Resources**
188
- - **$30** in Compute and Inference Credits
189
- - To claim your credits, set up a HF account here: <https://huggingface.co/join>
190
- - Then, follow this link: <https://huggingface.co/openenv-community>
191
- - You will be granted **$30** of compute and inference credits!
192
- - **Northflank Resources**
193
- - Each team gets an H100
194
- - Northflank instructions
195
-
196
- [linkEmbed]
197
-
198
- - Join the NorthFlank discord channel for any questions
199
- - Please fill out this form:
200
-
201
- [linkEmbed]
202
-
203
- - **Cursor Resources**
204
- - **$50** in Cursor Credits, **apply below**
205
-
206
- [linkEmbed]
207
-
208
- ## **10. Judging & Submissions**
209
-
210
- Judges will be taking place on **Sunday, March 8**. These judges are evaluating your **technical demos** in the following categories. _Show us what you have built_ to solve our problem statements. Please **do not** show us a presentation. We'll be checking to ensure your project was built **entirely during the event**; no previous work is allowed. 
211
-
212
- **|** **Teams should submit [here](https://cerebralvalley.ai/e/openenv-hackathon-sf/hackathon/submit) when they have completed hacking.** In the submission form, you will have to upload a **one minute** demo video on YouTube talking about your submission. You must also show a minimal training script for your environment using Unsloth or HF TRL in Colab.
213
-
214
- **Please ensure your project uses** use OpenEnv (stable release 0.2.1) deployed on HF spaces.
215
-
216
- [linkEmbed]
217
-
218
- **Judging Criteria**
219
-
220
- - **Environment Innovation (40%) -** Is the environment novel, creative, or challenging? Does it meaningfully test the agent’s behavior?
221
- - **Storytelling (30%) -** Does the team clearly explain the problem, environment, and agent behavior? Is the demo engaging and easy to follow?
222
- - **Training Script Showing Improvement in Rewards (20%) -** Does the demo provide observable evidence of training progress (reward curves, metrics, or before/after behavior)? 
223
- - **Reward and Training Pipeline Setup (10%) -** Is the reward logic coherent, and does the pipeline produce meaningful improvement in the agent’s inference (how it acts in the environment)?
224
-
225
- **Judging Process**
226
-
227
- **|** Judging proceeds in two rounds:
228
-
229
- - Hackers will be assigned groups of judges; \~3 minutes to pitch followed by 1-2 minutes of Q/A
230
-
231
- - The top **six** teams in ranking will get to demo on stage to a panel of judges; \~3 minutes to pitch followed by 2-3 minutes for Q/A.
232
-
233
- ## **11. Prizes**
234
-
235
- - **1st Place:** $15,000 USD Cash
236
-
237
- - **2nd Place:** $9,000 USD Cash
238
-
239
- - **3rd Place:** $6,000 USD Cash
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/contract.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Final
4
+
5
+ from fusion_lab.models import LowDimBoundaryParams, default_low_dim_boundary_params
6
+
7
+ N_FIELD_PERIODS: Final[int] = 3
8
+ DEFAULT_RESET_SEED: Final[LowDimBoundaryParams] = default_low_dim_boundary_params()
9
+
10
+ RESET_SEEDS: Final[tuple[LowDimBoundaryParams, ...]] = (
11
+ DEFAULT_RESET_SEED,
12
+ LowDimBoundaryParams(
13
+ aspect_ratio=3.4,
14
+ elongation=1.4,
15
+ rotational_transform=1.6,
16
+ triangularity_scale=0.55,
17
+ ),
18
+ LowDimBoundaryParams(
19
+ aspect_ratio=3.8,
20
+ elongation=1.4,
21
+ rotational_transform=1.5,
22
+ triangularity_scale=0.55,
23
+ ),
24
+ )
25
+
26
+ SMOKE_TEST_PARAMS: Final[LowDimBoundaryParams] = DEFAULT_RESET_SEED
server/environment.py CHANGED
@@ -71,10 +71,9 @@ class StellaratorEnvironment(
71
  initial_params=params,
72
  current_params=params,
73
  best_params=params,
74
- initial_score=metrics.p1_score,
75
- best_score=metrics.p1_score,
76
- best_feasibility=metrics.p1_feasibility,
77
- best_high_fidelity_feasibility=float("inf"),
78
  budget_total=BUDGET,
79
  budget_remaining=BUDGET,
80
  episode_done=False,
@@ -151,13 +150,13 @@ class StellaratorEnvironment(
151
 
152
  def _handle_submit(self) -> StellaratorObservation:
153
  metrics = self._evaluate_params(self._state.current_params, fidelity="high")
154
- initial_submit_metrics = self._initial_high_fidelity_metrics()
155
  best_submit_metrics = self._refresh_best_high_fidelity_metrics(metrics)
156
  reward = self._compute_reward(
157
  metrics,
158
  "submit",
159
  done=True,
160
- initial_reference_score=initial_submit_metrics.p1_score,
161
  )
162
  summary = self._summary_submit(metrics, best_submit_metrics)
163
  self._state.history.append(summary)
@@ -253,7 +252,7 @@ class StellaratorEnvironment(
253
  base_score = (
254
  initial_reference_score
255
  if initial_reference_score is not None
256
- else self._state.initial_score
257
  )
258
  improved = metrics.constraints_satisfied and metrics.p1_score > base_score
259
  if improved:
@@ -274,6 +273,10 @@ class StellaratorEnvironment(
274
  reward: float | None = None,
275
  done: bool = False,
276
  ) -> StellaratorObservation:
 
 
 
 
277
  text_lines = [
278
  action_summary,
279
  "",
@@ -284,13 +287,20 @@ class StellaratorEnvironment(
284
  text_lines.append(f"failure_reason={metrics.failure_reason}")
285
  text_lines.extend(
286
  [
287
- f"max_elongation={metrics.max_elongation:.4f} | best_score={self._display_best_score(metrics):.6f}",
288
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
289
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
290
  f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
 
 
 
291
  (
292
- f"feasibility={metrics.p1_feasibility:.6f} | "
293
- f"best_feasibility={self._display_best_feasibility(metrics):.6f}"
 
 
 
 
294
  ),
295
  f"vacuum_well={metrics.vacuum_well:.4f}",
296
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
@@ -312,8 +322,10 @@ class StellaratorEnvironment(
312
  failure_reason=metrics.failure_reason,
313
  step_number=self._state.step_count,
314
  budget_remaining=self._state.budget_remaining,
315
- best_score=self._display_best_score(metrics),
316
- best_feasibility=self._display_best_feasibility(metrics),
 
 
317
  constraints_satisfied=metrics.constraints_satisfied,
318
  target_spec=TARGET_SPEC,
319
  reward=reward,
@@ -372,7 +384,7 @@ class StellaratorEnvironment(
372
  return f"Restore-best failed during low-fidelity evaluation: {metrics.failure_reason}"
373
  return (
374
  "Restored the best-known design. "
375
- f"Score={metrics.p1_score:.6f}, feasibility={metrics.p1_feasibility:.6f}."
376
  )
377
 
378
  def _initial_params(self, seed: int | None) -> LowDimBoundaryParams:
@@ -427,12 +439,12 @@ class StellaratorEnvironment(
427
  and self._last_metrics.evaluation_failed
428
  )
429
 
430
- def _initial_high_fidelity_metrics(self) -> EvaluationMetrics:
431
  if self._state.initial_high_fidelity_score is not None:
432
- return self._evaluate_params(self._state.initial_params, fidelity="high")
433
  metrics = self._evaluate_params(self._state.initial_params, fidelity="high")
434
  self._state.initial_high_fidelity_score = metrics.p1_score
435
- return metrics
436
 
437
  def _refresh_best_high_fidelity_metrics(
438
  self,
@@ -446,21 +458,10 @@ class StellaratorEnvironment(
446
  self._state.best_high_fidelity_feasibility = best_metrics.p1_feasibility
447
  return best_metrics
448
 
449
- def _display_best_score(self, metrics: EvaluationMetrics) -> float:
450
- if (
451
- metrics.evaluation_fidelity == "high"
452
- and self._state.best_high_fidelity_score is not None
453
- ):
454
- return self._state.best_high_fidelity_score
455
- return self._state.best_score
456
-
457
- def _display_best_feasibility(self, metrics: EvaluationMetrics) -> float:
458
- if (
459
- metrics.evaluation_fidelity == "high"
460
- and self._state.best_high_fidelity_score is not None
461
- ):
462
- return self._state.best_high_fidelity_feasibility
463
- return self._state.best_feasibility
464
 
465
  def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
466
  if metrics.evaluation_failed:
@@ -470,11 +471,11 @@ class StellaratorEnvironment(
470
  (1, metrics.p1_score) if metrics.constraints_satisfied else (0, -metrics.p1_feasibility)
471
  )
472
  best = (
473
- (1, self._state.best_score)
474
- if self._state.best_feasibility <= FEASIBILITY_TOLERANCE
475
- else (0, -self._state.best_feasibility)
476
  )
477
  if current > best:
478
  self._state.best_params = params
479
- self._state.best_score = metrics.p1_score
480
- self._state.best_feasibility = metrics.p1_feasibility
 
71
  initial_params=params,
72
  current_params=params,
73
  best_params=params,
74
+ initial_low_fidelity_score=metrics.p1_score,
75
+ best_low_fidelity_score=metrics.p1_score,
76
+ best_low_fidelity_feasibility=metrics.p1_feasibility,
 
77
  budget_total=BUDGET,
78
  budget_remaining=BUDGET,
79
  episode_done=False,
 
150
 
151
  def _handle_submit(self) -> StellaratorObservation:
152
  metrics = self._evaluate_params(self._state.current_params, fidelity="high")
153
+ initial_submit_score = self._initial_high_fidelity_score()
154
  best_submit_metrics = self._refresh_best_high_fidelity_metrics(metrics)
155
  reward = self._compute_reward(
156
  metrics,
157
  "submit",
158
  done=True,
159
+ initial_reference_score=initial_submit_score,
160
  )
161
  summary = self._summary_submit(metrics, best_submit_metrics)
162
  self._state.history.append(summary)
 
252
  base_score = (
253
  initial_reference_score
254
  if initial_reference_score is not None
255
+ else self._state.initial_low_fidelity_score
256
  )
257
  improved = metrics.constraints_satisfied and metrics.p1_score > base_score
258
  if improved:
 
273
  reward: float | None = None,
274
  done: bool = False,
275
  ) -> StellaratorObservation:
276
+ best_low_fidelity_score = self._state.best_low_fidelity_score
277
+ best_low_fidelity_feasibility = self._state.best_low_fidelity_feasibility
278
+ best_high_fidelity_score = self._state.best_high_fidelity_score
279
+ best_high_fidelity_feasibility = self._state.best_high_fidelity_feasibility
280
  text_lines = [
281
  action_summary,
282
  "",
 
287
  text_lines.append(f"failure_reason={metrics.failure_reason}")
288
  text_lines.extend(
289
  [
290
+ f"max_elongation={metrics.max_elongation:.4f}",
291
  f"aspect_ratio={metrics.aspect_ratio:.4f} (<= {ASPECT_RATIO_MAX:.1f})",
292
  f"average_triangularity={metrics.average_triangularity:.4f} (<= {AVERAGE_TRIANGULARITY_MAX:.1f})",
293
  f"edge_iota_over_nfp={metrics.edge_iota_over_nfp:.4f} (>= {EDGE_IOTA_OVER_NFP_MIN:.1f})",
294
+ f"feasibility={metrics.p1_feasibility:.6f}",
295
+ f"best_low_fidelity_score={best_low_fidelity_score:.6f}",
296
+ f"best_low_fidelity_feasibility={best_low_fidelity_feasibility:.6f}",
297
  (
298
+ "best_high_fidelity_score="
299
+ f"{self._format_optional_metric(best_high_fidelity_score)}"
300
+ ),
301
+ (
302
+ "best_high_fidelity_feasibility="
303
+ f"{self._format_optional_metric(best_high_fidelity_feasibility)}"
304
  ),
305
  f"vacuum_well={metrics.vacuum_well:.4f}",
306
  f"constraints={'SATISFIED' if metrics.constraints_satisfied else 'VIOLATED'}",
 
322
  failure_reason=metrics.failure_reason,
323
  step_number=self._state.step_count,
324
  budget_remaining=self._state.budget_remaining,
325
+ best_low_fidelity_score=best_low_fidelity_score,
326
+ best_low_fidelity_feasibility=best_low_fidelity_feasibility,
327
+ best_high_fidelity_score=best_high_fidelity_score,
328
+ best_high_fidelity_feasibility=best_high_fidelity_feasibility,
329
  constraints_satisfied=metrics.constraints_satisfied,
330
  target_spec=TARGET_SPEC,
331
  reward=reward,
 
384
  return f"Restore-best failed during low-fidelity evaluation: {metrics.failure_reason}"
385
  return (
386
  "Restored the best-known design. "
387
+ f"Low-fidelity score={metrics.p1_score:.6f}, feasibility={metrics.p1_feasibility:.6f}."
388
  )
389
 
390
  def _initial_params(self, seed: int | None) -> LowDimBoundaryParams:
 
439
  and self._last_metrics.evaluation_failed
440
  )
441
 
442
+ def _initial_high_fidelity_score(self) -> float:
443
  if self._state.initial_high_fidelity_score is not None:
444
+ return self._state.initial_high_fidelity_score
445
  metrics = self._evaluate_params(self._state.initial_params, fidelity="high")
446
  self._state.initial_high_fidelity_score = metrics.p1_score
447
+ return metrics.p1_score
448
 
449
  def _refresh_best_high_fidelity_metrics(
450
  self,
 
458
  self._state.best_high_fidelity_feasibility = best_metrics.p1_feasibility
459
  return best_metrics
460
 
461
+ def _format_optional_metric(self, value: float | None) -> str:
462
+ if value is None:
463
+ return "n/a"
464
+ return f"{value:.6f}"
 
 
 
 
 
 
 
 
 
 
 
465
 
466
  def _update_best(self, params: LowDimBoundaryParams, metrics: EvaluationMetrics) -> None:
467
  if metrics.evaluation_failed:
 
471
  (1, metrics.p1_score) if metrics.constraints_satisfied else (0, -metrics.p1_feasibility)
472
  )
473
  best = (
474
+ (1, self._state.best_low_fidelity_score)
475
+ if self._state.best_low_fidelity_feasibility <= FEASIBILITY_TOLERANCE
476
+ else (0, -self._state.best_low_fidelity_feasibility)
477
  )
478
  if current > best:
479
  self._state.best_params = params
480
+ self._state.best_low_fidelity_score = metrics.p1_score
481
+ self._state.best_low_fidelity_feasibility = metrics.p1_feasibility
tests/test_repo_scaffold.py DELETED
@@ -1,9 +0,0 @@
1
- from server.environment import TASK, environment_status
2
-
3
-
4
- def test_environment_scaffold_status() -> None:
5
- assert environment_status() == "scaffolded"
6
-
7
-
8
- def test_task_budget_is_fixed() -> None:
9
- assert TASK["budget"] == 6