CreativeEngineer commited on
Commit
8bf0155
·
1 Parent(s): 567ff67

feat: add replay playtest and tighten fail-fast validation

Browse files
README.md CHANGED
@@ -30,7 +30,7 @@ Implementation status:
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
- - the next runtime work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, and deployment evidence
34
 
35
  ## Execution Status
36
 
@@ -52,7 +52,7 @@ Implementation status:
52
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
53
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
54
  - [x] Add tracked `P1` fixtures under `server/data/p1/`
55
- - [ ] Run a tiny low-fi PPO smoke run, then record at least one submit-side manual trace and the first real reward pathology
56
  - [ ] Refresh the heuristic baseline for the real verifier path
57
  - [ ] Deploy the real environment to HF Space
58
 
@@ -67,12 +67,12 @@ Implementation status:
67
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
68
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
69
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
70
- - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run alongside high-fidelity fixture pairing, then a real `submit` trace.
71
 
72
  Current mode:
73
 
74
  - strategic task choice is already locked
75
- - the next work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, smoke validation, and deployment
76
  - new planning text should only appear when a real blocker forces a decision change
77
 
78
  ## Planned Repository Layout
@@ -117,7 +117,7 @@ uv sync --extra notebooks
117
  - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
118
  - OpenEnv deployment target: Hugging Face Spaces
119
  - Minimal submission notebook target: Colab
120
- - Required notebook artifact: one public Colab notebook, even if it only demonstrates evaluation traces rather than a trained policy
121
  - Verifier of record: `constellaration.problems.GeometricalProblem`
122
  - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
123
  - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
@@ -126,10 +126,10 @@ uv sync --extra notebooks
126
 
127
  ## Immediate Next Steps
128
 
129
- - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
130
- - [ ] Run a tiny low-fidelity PPO smoke run and save a few trajectories.
131
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
132
- - [ ] Run at least one submit-side manual trace and record the first real reward pathology, if any.
133
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
134
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
135
  - [ ] Deploy the environment to HF Space.
 
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
+ - the next runtime work is a tiny low-fi PPO smoke run as a diagnostic-only check, followed immediately by paired high-fidelity fixture checks and one real submit-side manual trace
34
 
35
  ## Execution Status
36
 
 
52
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
53
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
54
  - [x] Add tracked `P1` fixtures under `server/data/p1/`
55
+ - [ ] Run a tiny low-fi PPO smoke run as a diagnostic-only check, then complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
56
  - [ ] Refresh the heuristic baseline for the real verifier path
57
  - [ ] Deploy the real environment to HF Space
58
 
 
67
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
68
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
69
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
70
+ - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run used only to surface obvious learnability bugs, followed immediately by high-fidelity fixture pairing and one real `submit` trace.
71
 
72
  Current mode:
73
 
74
  - strategic task choice is already locked
75
+ - the next work is a tiny low-fi PPO smoke run as a smoke test only, then paired high-fidelity fixture checks, one submit-side manual trace, heuristic refresh, smoke validation, and deployment
76
  - new planning text should only appear when a real blocker forces a decision change
77
 
78
  ## Planned Repository Layout
 
117
  - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
118
  - OpenEnv deployment target: Hugging Face Spaces
119
  - Minimal submission notebook target: Colab
120
+ - Required notebook artifact: one public Colab notebook that demonstrates trained-policy behavior against the environment
121
  - Verifier of record: `constellaration.problems.GeometricalProblem`
122
  - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
123
  - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
 
126
 
127
  ## Immediate Next Steps
128
 
129
+ - [ ] Run a tiny low-fidelity PPO smoke run and stop after a few readable trajectories or one clear failure mode.
130
+ - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
131
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
132
+ - [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
133
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
134
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
135
  - [ ] Deploy the environment to HF Space.
TODO.md CHANGED
@@ -54,9 +54,9 @@ flowchart TD
54
  B["P1 Contract Lock"] --> D["P1 Models + Environment"]
55
  C["constellaration Physics Wiring"] --> D
56
  D --> P["Parameterization Repair"]
57
- P --> E["Fixture Checks"]
58
- E --> F["Tiny PPO Smoke"]
59
- F --> G["Submit-side Manual Playtest"]
60
  G --> H["Reward V1"]
61
  H --> I["Baselines"]
62
  I --> J["HF Space Deploy"]
@@ -196,6 +196,8 @@ flowchart TD
196
  fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
197
  Note:
198
  treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
 
 
199
 
200
  - [ ] Manual-playtest 5-10 episodes
201
  Goal:
 
54
  B["P1 Contract Lock"] --> D["P1 Models + Environment"]
55
  C["constellaration Physics Wiring"] --> D
56
  D --> P["Parameterization Repair"]
57
+ P --> F["Tiny PPO Smoke"]
58
+ F --> E["Fixture Checks"]
59
+ E --> G["Submit-side Manual Playtest"]
60
  G --> H["Reward V1"]
61
  H --> I["Baselines"]
62
  I --> J["HF Space Deploy"]
 
196
  fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
197
  Note:
198
  treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
199
+ stop after a few readable trajectories or one clear failure mode
200
+ paired high-fidelity fixture checks must happen immediately after this smoke pass
201
 
202
  - [ ] Manual-playtest 5-10 episodes
203
  Goal:
baselines/measured_sweep.py CHANGED
@@ -4,7 +4,11 @@ Validates ranges, crash zones, feasibility regions, and identifies
4
  candidate reset seeds for the repaired low-dimensional family.
5
 
6
  Usage:
7
- uv run python baselines/measured_sweep.py
 
 
 
 
8
  """
9
 
10
  from __future__ import annotations
@@ -29,15 +33,29 @@ def linspace_inclusive(low: float, high: float, n: int) -> list[float]:
29
  return [round(float(v), 4) for v in np.linspace(low, high, n)]
30
 
31
 
 
 
 
 
 
 
 
 
32
  def parse_args() -> argparse.Namespace:
33
  parser = argparse.ArgumentParser(
34
  description="Run a measured low-fidelity sweep over the repaired 4-knob family."
35
  )
36
- parser.add_argument(
 
37
  "--grid-points",
38
  type=int,
39
  default=3,
40
- help="Number of evenly spaced points per parameter range.",
 
 
 
 
 
41
  )
42
  parser.add_argument(
43
  "--output-dir",
@@ -48,12 +66,15 @@ def parse_args() -> argparse.Namespace:
48
  return parser.parse_args()
49
 
50
 
51
- def run_sweep(*, grid_points: int) -> list[dict]:
52
- if grid_points < 2:
53
- raise ValueError("--grid-points must be at least 2.")
54
- grids = {
55
- name: linspace_inclusive(lo, hi, grid_points) for name, (lo, hi) in SWEEP_RANGES.items()
56
- }
 
 
 
57
 
58
  configs = list(
59
  product(
@@ -108,7 +129,8 @@ def run_sweep(*, grid_points: int) -> list[dict]:
108
  f"{rate:.1f} eval/s"
109
  )
110
 
111
- return results
 
112
 
113
 
114
  def analyze(results: list[dict]) -> dict:
@@ -196,17 +218,28 @@ def analyze(results: list[dict]) -> dict:
196
 
197
  def main() -> None:
198
  args = parse_args()
199
- results = run_sweep(grid_points=args.grid_points)
 
 
 
200
 
201
  out_dir = args.output_dir
202
  out_dir.mkdir(exist_ok=True)
 
203
  timestamp = time.strftime("%Y%m%dT%H%M%SZ", time.gmtime())
204
  out_path = out_dir / f"measured_sweep_{timestamp}.json"
205
 
206
  analysis = analyze(results)
207
 
 
 
 
 
 
 
 
208
  with open(out_path, "w") as f:
209
- json.dump({"analysis": analysis, "results": results}, f, indent=2)
210
  print(f"\nResults saved to {out_path}")
211
 
212
 
 
4
  candidate reset seeds for the repaired low-dimensional family.
5
 
6
  Usage:
7
+ # Broad evenly spaced grid (default 3 points per parameter)
8
+ uv run python baselines/measured_sweep.py --grid-points 5
9
+
10
+ # Targeted sweep around the known feasible zone
11
+ uv run python baselines/measured_sweep.py --targeted
12
  """
13
 
14
  from __future__ import annotations
 
33
  return [round(float(v), 4) for v in np.linspace(low, high, n)]
34
 
35
 
36
+ TARGETED_VALUES: dict[str, list[float]] = {
37
+ "aspect_ratio": [3.4, 3.6, 3.8],
38
+ "elongation": [1.2, 1.4, 1.6],
39
+ "rotational_transform": [1.50, 1.55, 1.60, 1.65, 1.70, 1.75, 1.80],
40
+ "triangularity_scale": [0.55, 0.58, 0.60, 0.62, 0.65],
41
+ }
42
+
43
+
44
  def parse_args() -> argparse.Namespace:
45
  parser = argparse.ArgumentParser(
46
  description="Run a measured low-fidelity sweep over the repaired 4-knob family."
47
  )
48
+ mode = parser.add_mutually_exclusive_group()
49
+ mode.add_argument(
50
  "--grid-points",
51
  type=int,
52
  default=3,
53
+ help="Number of evenly spaced points per parameter range (default: 3).",
54
+ )
55
+ mode.add_argument(
56
+ "--targeted",
57
+ action="store_true",
58
+ help="Use the pre-defined targeted value set around the known feasible zone.",
59
  )
60
  parser.add_argument(
61
  "--output-dir",
 
66
  return parser.parse_args()
67
 
68
 
69
+ def run_sweep(*, grid_points: int, targeted: bool = False) -> tuple[list[dict], float]:
70
+ if targeted:
71
+ grids = TARGETED_VALUES
72
+ else:
73
+ if grid_points < 2:
74
+ raise ValueError("--grid-points must be at least 2.")
75
+ grids = {
76
+ name: linspace_inclusive(lo, hi, grid_points) for name, (lo, hi) in SWEEP_RANGES.items()
77
+ }
78
 
79
  configs = list(
80
  product(
 
129
  f"{rate:.1f} eval/s"
130
  )
131
 
132
+ total_elapsed = time.monotonic() - t0
133
+ return results, total_elapsed
134
 
135
 
136
  def analyze(results: list[dict]) -> dict:
 
218
 
219
  def main() -> None:
220
  args = parse_args()
221
+ results, elapsed_s = run_sweep(
222
+ grid_points=args.grid_points,
223
+ targeted=args.targeted,
224
+ )
225
 
226
  out_dir = args.output_dir
227
  out_dir.mkdir(exist_ok=True)
228
+ mode_label = "targeted" if args.targeted else f"grid{args.grid_points}"
229
  timestamp = time.strftime("%Y%m%dT%H%M%SZ", time.gmtime())
230
  out_path = out_dir / f"measured_sweep_{timestamp}.json"
231
 
232
  analysis = analyze(results)
233
 
234
+ metadata = {
235
+ "mode": mode_label,
236
+ "timestamp": timestamp,
237
+ "elapsed_seconds": round(elapsed_s, 1),
238
+ "seconds_per_eval": round(elapsed_s / max(len(results), 1), 2),
239
+ }
240
+
241
  with open(out_path, "w") as f:
242
+ json.dump({"metadata": metadata, "analysis": analysis, "results": results}, f, indent=2)
243
  print(f"\nResults saved to {out_path}")
244
 
245
 
baselines/replay_playtest.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fixed-action replay playtest for reward branch coverage.
2
+
3
+ Runs 5 scripted episodes against StellaratorEnvironment directly.
4
+ Each episode targets specific untested reward branches.
5
+
6
+ Episodes:
7
+ 1. Seed 0 — repair + feasible-side objective shaping + budget exhaustion
8
+ 2. Seed 1 — repair from different seed (ar=3.4, rt=1.6)
9
+ 3. Seed 2 — boundary clamping (ar=3.8 = upper bound)
10
+ 4. Seed 0 — push rt into crash zone + restore_best
11
+ 5. Seed 0 — repair + objective move + explicit submit
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import json
17
+ import sys
18
+ from dataclasses import asdict, dataclass
19
+ from typing import Sequence
20
+
21
+ from fusion_lab.models import StellaratorAction, StellaratorObservation
22
+ from server.environment import StellaratorEnvironment
23
+
24
+
25
+ @dataclass(frozen=True)
26
+ class StepRecord:
27
+ step: int
28
+ intent: str
29
+ action_label: str
30
+ score: float
31
+ feasibility: float
32
+ constraints_satisfied: bool
33
+ evaluation_fidelity: str
34
+ evaluation_failed: bool
35
+ max_elongation: float
36
+ reward: float
37
+ budget_remaining: int
38
+ done: bool
39
+
40
+
41
+ def _action_label(action: StellaratorAction) -> str:
42
+ if action.intent != "run":
43
+ return action.intent
44
+ return f"{action.parameter} {action.direction} {action.magnitude}"
45
+
46
+
47
+ def _record(obs: StellaratorObservation, step: int, action: StellaratorAction) -> StepRecord:
48
+ return StepRecord(
49
+ step=step,
50
+ intent=action.intent,
51
+ action_label=_action_label(action),
52
+ score=obs.p1_score,
53
+ feasibility=obs.p1_feasibility,
54
+ constraints_satisfied=obs.constraints_satisfied,
55
+ evaluation_fidelity=obs.evaluation_fidelity,
56
+ evaluation_failed=obs.evaluation_failed,
57
+ max_elongation=obs.max_elongation,
58
+ reward=obs.reward or 0.0,
59
+ budget_remaining=obs.budget_remaining,
60
+ done=obs.done,
61
+ )
62
+
63
+
64
+ def _run_episode(
65
+ env: StellaratorEnvironment,
66
+ seed: int,
67
+ actions: Sequence[StellaratorAction],
68
+ label: str,
69
+ ) -> list[StepRecord]:
70
+ obs = env.reset(seed=seed)
71
+ print(f"\n{'=' * 72}")
72
+ print(f"Episode: {label}")
73
+ print(f"Seed: {seed}")
74
+ print(
75
+ f" reset score={obs.p1_score:.6f} feasibility={obs.p1_feasibility:.6f} "
76
+ f"constraints={'yes' if obs.constraints_satisfied else 'no'} "
77
+ f"elongation={obs.max_elongation:.4f} budget={obs.budget_remaining}"
78
+ )
79
+
80
+ records: list[StepRecord] = []
81
+ for i, action in enumerate(actions, start=1):
82
+ if obs.done:
83
+ print(f" (episode ended before step {i})")
84
+ break
85
+ obs = env.step(action)
86
+ rec = _record(obs, i, action)
87
+ records.append(rec)
88
+
89
+ status = (
90
+ "FAIL" if rec.evaluation_failed else ("OK" if rec.constraints_satisfied else "viol")
91
+ )
92
+ print(
93
+ f" step {i:2d} {rec.action_label:<42s} "
94
+ f"reward={rec.reward:+8.4f} score={rec.score:.6f} "
95
+ f"feas={rec.feasibility:.6f} elong={rec.max_elongation:.4f} "
96
+ f"status={status} budget={rec.budget_remaining} "
97
+ f"{'DONE' if rec.done else ''}"
98
+ )
99
+
100
+ total_reward = sum(r.reward for r in records)
101
+ print(f" total_reward={total_reward:+.4f}")
102
+ return records
103
+
104
+
105
+ def _run(action: str, param: str, direction: str, magnitude: str) -> StellaratorAction:
106
+ return StellaratorAction(
107
+ intent="run",
108
+ parameter=param,
109
+ direction=direction,
110
+ magnitude=magnitude,
111
+ )
112
+
113
+
114
+ def _submit() -> StellaratorAction:
115
+ return StellaratorAction(intent="submit")
116
+
117
+
118
+ def _restore() -> StellaratorAction:
119
+ return StellaratorAction(intent="restore_best")
120
+
121
+
122
+ # ── Episode definitions ──────────────────────────────────────────────────
123
+
124
+ EPISODE_1 = (
125
+ "seed0_repair_objective_exhaustion",
126
+ 0,
127
+ [
128
+ _run("run", "triangularity_scale", "increase", "medium"), # cross feasibility
129
+ _run("run", "elongation", "decrease", "small"), # feasible-side shaping
130
+ _run("run", "elongation", "decrease", "small"), # more shaping
131
+ _run("run", "elongation", "decrease", "small"), # more shaping
132
+ _run("run", "elongation", "decrease", "small"), # more shaping
133
+ _run("run", "elongation", "decrease", "small"), # budget=0 → done bonus
134
+ ],
135
+ )
136
+
137
+ EPISODE_2 = (
138
+ "seed1_repair_different_seed",
139
+ 1,
140
+ [
141
+ _run(
142
+ "run", "triangularity_scale", "increase", "medium"
143
+ ), # cross feasibility from ar=3.4,rt=1.6
144
+ _run("run", "elongation", "decrease", "small"), # feasible-side shaping
145
+ _run("run", "elongation", "decrease", "small"), # more shaping
146
+ _run("run", "triangularity_scale", "increase", "small"), # push tri further
147
+ _run("run", "elongation", "decrease", "small"), # more shaping
148
+ _run("run", "elongation", "decrease", "small"), # budget exhaustion
149
+ ],
150
+ )
151
+
152
+ EPISODE_3 = (
153
+ "seed2_boundary_clamping",
154
+ 2,
155
+ [
156
+ _run("run", "aspect_ratio", "increase", "large"), # ar=3.8 + 0.2 → clamped at 3.8
157
+ _run("run", "triangularity_scale", "increase", "medium"), # repair toward feasibility
158
+ _run("run", "triangularity_scale", "increase", "medium"), # push further
159
+ _run("run", "elongation", "decrease", "small"), # shaping if feasible
160
+ _run("run", "aspect_ratio", "decrease", "large"), # move ar down
161
+ _run("run", "elongation", "decrease", "small"), # budget exhaustion
162
+ ],
163
+ )
164
+
165
+ EPISODE_4 = (
166
+ "seed0_crash_recovery_restore",
167
+ 0,
168
+ [
169
+ _run("run", "triangularity_scale", "increase", "medium"), # cross feasibility first
170
+ _run("run", "rotational_transform", "increase", "large"), # rt 1.5→1.7
171
+ _run("run", "rotational_transform", "increase", "large"), # rt 1.7→1.9 (crash zone)
172
+ _restore(), # recover best state
173
+ _run("run", "elongation", "decrease", "small"), # continue from best
174
+ _run("run", "elongation", "decrease", "small"), # budget exhaustion
175
+ ],
176
+ )
177
+
178
+ EPISODE_5 = (
179
+ "seed0_repair_objective_submit",
180
+ 0,
181
+ [
182
+ _run("run", "triangularity_scale", "increase", "medium"), # cross feasibility
183
+ _run("run", "elongation", "decrease", "small"), # feasible-side objective
184
+ _submit(), # explicit high-fidelity submit
185
+ ],
186
+ )
187
+
188
+ ALL_EPISODES = [EPISODE_1, EPISODE_2, EPISODE_3, EPISODE_4, EPISODE_5]
189
+
190
+
191
+ def main(output_json: str | None = None) -> None:
192
+ env = StellaratorEnvironment()
193
+ all_results: dict[str, list[dict[str, object]]] = {}
194
+
195
+ for label, seed, actions in ALL_EPISODES:
196
+ records = _run_episode(env, seed, actions, label)
197
+ all_results[label] = [asdict(r) for r in records]
198
+
199
+ if output_json:
200
+ with open(output_json, "w") as f:
201
+ json.dump(all_results, f, indent=2)
202
+ print(f"\nResults written to {output_json}")
203
+
204
+
205
+ if __name__ == "__main__":
206
+ out = sys.argv[1] if len(sys.argv) > 1 else None
207
+ main(output_json=out)
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -80,6 +80,8 @@ Practical fail-fast rule:
80
 
81
  - allow a tiny low-fidelity PPO smoke run before full submit-side validation
82
  - use it only to surface obvious learnability bugs, reward exploits, or action-space problems
 
 
83
  - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
84
 
85
  ## 5. Document Roles
@@ -133,8 +135,8 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
133
 
134
  ## 8. Execution Order
135
 
136
- - [ ] Run a tiny low-fidelity PPO smoke pass and inspect a few trajectories for obvious learnability failures or reward exploits.
137
- - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
138
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
139
  - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
140
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
@@ -155,10 +157,12 @@ Gate 2: tiny PPO smoke is sane
155
 
156
  - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
157
  - trajectories are readable enough to debug
 
158
 
159
  Gate 3: fixture checks pass
160
 
161
  - good, boundary, and bad references behave as expected
 
162
 
163
  Gate 4: manual playtest passes
164
 
 
80
 
81
  - allow a tiny low-fidelity PPO smoke run before full submit-side validation
82
  - use it only to surface obvious learnability bugs, reward exploits, or action-space problems
83
+ - stop after a few readable trajectories or one clear failure mode
84
+ - run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
85
  - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
86
 
87
  ## 5. Document Roles
 
135
 
136
  ## 8. Execution Order
137
 
138
+ - [ ] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
139
+ - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
140
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
141
  - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
142
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 
157
 
158
  - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
159
  - trajectories are readable enough to debug
160
+ - the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
161
 
162
  Gate 3: fixture checks pass
163
 
164
  - good, boundary, and bad references behave as expected
165
+ - the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work
166
 
167
  Gate 4: manual playtest passes
168
 
docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED
@@ -14,8 +14,9 @@ Use these docs instead:
14
  Current execution priority remains:
15
 
16
  1. measured sweep
17
- 2. tracked fixtures
18
- 3. manual playtest
19
- 4. heuristic baseline refresh
20
- 5. HF Space proof
21
- 6. notebook, demo, and repo polish
 
 
14
  Current execution priority remains:
15
 
16
  1. measured sweep
17
+ 2. tiny PPO smoke pass as a diagnostic-only check
18
+ 3. tracked fixtures with paired high-fidelity submit checks
19
+ 4. one submit-side manual trace, then broader manual playtest
20
+ 5. heuristic baseline refresh
21
+ 6. HF Space proof
22
+ 7. notebook, demo, and repo polish
training/notebooks/northflank_smoke.py CHANGED
@@ -7,10 +7,8 @@ from datetime import UTC, datetime
7
  from importlib.metadata import version
8
  from pathlib import Path
9
 
10
- from constellaration.initial_guess import generate_rotating_ellipse
11
-
12
- from server.environment import BASELINE_PARAMS, N_FIELD_PERIODS
13
- from server.physics import EvaluationMetrics, evaluate_params
14
 
15
 
16
  DEFAULT_OUTPUT_DIR = Path("training/notebooks/artifacts")
@@ -23,15 +21,15 @@ class SmokeArtifact:
23
  boundary_type: str
24
  n_field_periods: int
25
  params: dict[str, float]
26
- metrics: dict[str, float | bool]
27
 
28
 
29
  def parse_args() -> argparse.Namespace:
30
  parser = argparse.ArgumentParser(
31
  description=(
32
  "Run the Fusion Design Lab Northflank smoke check: generate one "
33
- "rotating-ellipse boundary, run one low-fidelity verifier call, "
34
- "and write a JSON artifact."
35
  )
36
  )
37
  parser.add_argument(
@@ -47,23 +45,17 @@ def parse_args() -> argparse.Namespace:
47
 
48
 
49
  def build_artifact() -> SmokeArtifact:
50
- boundary = generate_rotating_ellipse(
51
- aspect_ratio=BASELINE_PARAMS.aspect_ratio,
52
- elongation=BASELINE_PARAMS.elongation,
53
- rotational_transform=BASELINE_PARAMS.rotational_transform,
54
- n_field_periods=N_FIELD_PERIODS,
55
- )
56
- metrics = evaluate_params(
57
- BASELINE_PARAMS,
58
  n_field_periods=N_FIELD_PERIODS,
59
- fidelity="low",
60
  )
 
61
  return SmokeArtifact(
62
  created_at_utc=datetime.now(UTC).isoformat(),
63
  constellaration_version=version("constellaration"),
64
  boundary_type=type(boundary).__name__,
65
  n_field_periods=N_FIELD_PERIODS,
66
- params=BASELINE_PARAMS.model_dump(),
67
  metrics=_metrics_payload(metrics),
68
  )
69
 
@@ -76,8 +68,11 @@ def write_artifact(output_dir: Path, artifact: SmokeArtifact) -> Path:
76
  return output_path
77
 
78
 
79
- def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, float | bool]:
80
  return {
 
 
 
81
  "max_elongation": metrics.max_elongation,
82
  "aspect_ratio": metrics.aspect_ratio,
83
  "average_triangularity": metrics.average_triangularity,
 
7
  from importlib.metadata import version
8
  from pathlib import Path
9
 
10
+ from server.contract import N_FIELD_PERIODS, SMOKE_TEST_PARAMS
11
+ from server.physics import EvaluationMetrics, build_boundary_from_params, evaluate_boundary
 
 
12
 
13
 
14
  DEFAULT_OUTPUT_DIR = Path("training/notebooks/artifacts")
 
21
  boundary_type: str
22
  n_field_periods: int
23
  params: dict[str, float]
24
+ metrics: dict[str, str | float | bool]
25
 
26
 
27
  def parse_args() -> argparse.Namespace:
28
  parser = argparse.ArgumentParser(
29
  description=(
30
  "Run the Fusion Design Lab Northflank smoke check: generate one "
31
+ "rotating-ellipse-derived low-dimensional boundary, run one "
32
+ "low-fidelity verifier call, and write a JSON artifact."
33
  )
34
  )
35
  parser.add_argument(
 
45
 
46
 
47
  def build_artifact() -> SmokeArtifact:
48
+ boundary = build_boundary_from_params(
49
+ SMOKE_TEST_PARAMS,
 
 
 
 
 
 
50
  n_field_periods=N_FIELD_PERIODS,
 
51
  )
52
+ metrics = evaluate_boundary(boundary, fidelity="low")
53
  return SmokeArtifact(
54
  created_at_utc=datetime.now(UTC).isoformat(),
55
  constellaration_version=version("constellaration"),
56
  boundary_type=type(boundary).__name__,
57
  n_field_periods=N_FIELD_PERIODS,
58
+ params=SMOKE_TEST_PARAMS.model_dump(),
59
  metrics=_metrics_payload(metrics),
60
  )
61
 
 
68
  return output_path
69
 
70
 
71
+ def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, str | float | bool]:
72
  return {
73
+ "evaluation_fidelity": metrics.evaluation_fidelity,
74
+ "evaluation_failed": metrics.evaluation_failed,
75
+ "failure_reason": metrics.failure_reason,
76
  "max_elongation": metrics.max_elongation,
77
  "aspect_ratio": metrics.aspect_ratio,
78
  "average_triangularity": metrics.average_triangularity,