CreativeEngineer commited on
Commit
c3a24db
·
1 Parent(s): a02ffad

feat: add high-fidelity validation evidence

Browse files
README.md CHANGED
@@ -30,7 +30,8 @@ Implementation status:
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
- - the next runtime work is a tiny low-fi PPO smoke run as a diagnostic-only check, followed immediately by paired high-fidelity fixture checks and one real submit-side manual trace
 
34
 
35
  ## Execution Status
36
 
@@ -52,7 +53,8 @@ Implementation status:
52
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
53
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
54
  - [x] Add tracked `P1` fixtures under `server/data/p1/`
55
- - [ ] Run a tiny low-fi PPO smoke run as a diagnostic-only check, then complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
 
56
  - [ ] Refresh the heuristic baseline for the real verifier path
57
  - [ ] Deploy the real environment to HF Space
58
 
@@ -60,7 +62,7 @@ Implementation status:
60
 
61
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
62
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
63
- - The tracked fixtures in `server/data/p1/` are currently low-fidelity-calibrated. Do not narrate them as fully paired low-fi/high-fi references until the submit-side spot checks land.
64
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
65
  - High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
66
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
@@ -68,12 +70,13 @@ Implementation status:
68
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
69
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
70
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
71
- - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run used only to surface obvious learnability bugs, followed immediately by high-fidelity fixture pairing and one real `submit` trace.
 
72
 
73
  Current mode:
74
 
75
  - strategic task choice is already locked
76
- - the next work is a tiny low-fi PPO smoke run as a smoke test only, then paired high-fidelity fixture checks, one submit-side manual trace, heuristic refresh, smoke validation, and deployment
77
  - new planning text should only appear when a real blocker forces a decision change
78
 
79
  ## Planned Repository Layout
@@ -127,10 +130,10 @@ uv sync --extra notebooks
127
 
128
  ## Immediate Next Steps
129
 
130
- - [ ] Run a tiny low-fidelity PPO smoke run and stop after a few readable trajectories or one clear failure mode.
131
- - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 
132
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
133
- - [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
134
  - [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
135
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
136
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
+ - the first tiny low-fi PPO smoke artifact and paired high-fidelity fixture checks now exist
34
+ - a one-trajectory submit-side manual trace has now been recorded
35
 
36
  ## Execution Status
37
 
 
53
  - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
54
  - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
55
  - [x] Add tracked `P1` fixtures under `server/data/p1/`
56
+ - [x] Run a tiny low-fi PPO smoke run as a diagnostic-only check and save one trajectory artifact
57
+ - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
58
  - [ ] Refresh the heuristic baseline for the real verifier path
59
  - [ ] Deploy the real environment to HF Space
60
 
 
62
 
63
  - Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
64
  - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
65
+ - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
66
  - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
67
  - High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
68
  - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 
70
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
71
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
72
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
73
+ - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is now baseline refresh and reset-seed confirmation backed by the paired high-fidelity evidence.
74
+ - The first tiny PPO smoke note is in [docs/P1_PPO_SMOKE_NOTE.md](docs/P1_PPO_SMOKE_NOTE.md). It produced a valid trajectory artifact and exposed a repeated-action local failure, which is the right outcome for a smoke run.
75
 
76
  Current mode:
77
 
78
  - strategic task choice is already locked
79
+ - the next work is heuristic refresh, reset-seed confirmation, and deployment
80
  - new planning text should only appear when a real blocker forces a decision change
81
 
82
  ## Planned Repository Layout
 
130
 
131
  ## Immediate Next Steps
132
 
133
+ - [x] Run a tiny low-fidelity PPO smoke run and stop after a few readable trajectories or one clear failure mode.
134
+ - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
135
+ - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
136
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
 
137
  - [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
138
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
139
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
TODO.md CHANGED
@@ -43,7 +43,9 @@ Priority source:
43
  - [x] manual playtest log
44
  - [x] settle the non-submit terminal reward policy
45
  - [x] baseline comparison has been re-run on the `constellaration` branch state
46
- - [ ] tiny low-fi PPO smoke run exists
 
 
47
  - [ ] refresh the heuristic baseline for the real verifier path
48
 
49
  ## Execution Graph
@@ -182,22 +184,24 @@ flowchart TD
182
  [server/data/p1/README.md](server/data/p1/README.md),
183
  [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
184
  Note:
185
- first tracked fixtures are low-fidelity-calibrated; add paired high-fidelity submit checks next
186
 
187
- - [ ] Run fixture sanity checks
188
  Goal:
189
  confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
190
  Related:
191
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
192
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
193
 
194
- - [ ] Run a tiny low-fi PPO smoke pass
195
  Goal:
196
  fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
197
  Note:
198
  treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
199
  stop after a few readable trajectories or one clear failure mode
200
  paired high-fidelity fixture checks must happen immediately after this smoke pass
 
 
201
  high-fidelity VMEC-backed `submit` should stay out of the normal RL inner loop
202
 
203
  - [ ] Manual-playtest 5-10 episodes
 
43
  - [x] manual playtest log
44
  - [x] settle the non-submit terminal reward policy
45
  - [x] baseline comparison has been re-run on the `constellaration` branch state
46
+ - [x] tiny low-fi PPO smoke run exists
47
+ Note:
48
+ `training/ppo_smoke.py` now runs a diagnostic-only low-fidelity PPO smoke pass and the first artifact is summarized in `docs/P1_PPO_SMOKE_NOTE.md`
49
  - [ ] refresh the heuristic baseline for the real verifier path
50
 
51
  ## Execution Graph
 
184
  [server/data/p1/README.md](server/data/p1/README.md),
185
  [P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
186
  Note:
187
+ paired high-fidelity submit checks are now written into each tracked fixture and summarized in `baselines/fixture_high_fidelity_pairs.json`
188
 
189
+ - [x] Run fixture sanity checks
190
  Goal:
191
  confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
192
  Related:
193
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
194
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
195
 
196
+ - [x] Run a tiny low-fi PPO smoke pass
197
  Goal:
198
  fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
199
  Note:
200
  treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
201
  stop after a few readable trajectories or one clear failure mode
202
  paired high-fidelity fixture checks must happen immediately after this smoke pass
203
+ Status:
204
+ first smoke artifact exists; next use of this step should only happen if a follow-up reward or observation change needs re-checking
205
  high-fidelity VMEC-backed `submit` should stay out of the normal RL inner loop
206
 
207
  - [ ] Manual-playtest 5-10 episodes
baselines/fixture_high_fidelity_pairs.json ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "timestamp_utc": "2026-03-08T07:07:29.939307+00:00",
3
+ "n_field_periods": 3,
4
+ "fixture_count": 3,
5
+ "pass_count": 3,
6
+ "fail_count": 0,
7
+ "results": [
8
+ {
9
+ "name": "bad_low_iota",
10
+ "file": "server/data/p1/bad_low_iota.json",
11
+ "status": "pass",
12
+ "low_fidelity": {
13
+ "evaluation_failed": false,
14
+ "constraints_satisfied": false,
15
+ "p1_score": 0.0,
16
+ "p1_feasibility": 0.575134593927855,
17
+ "max_elongation": 5.983792904658967,
18
+ "aspect_ratio": 2.802311169335037,
19
+ "average_triangularity": -0.5512332332730122,
20
+ "edge_iota_over_nfp": 0.12745962182164347,
21
+ "vacuum_well": -1.0099648537211192,
22
+ "evaluation_fidelity": "low",
23
+ "failure_reason": ""
24
+ },
25
+ "high_fidelity": {
26
+ "evaluation_failed": false,
27
+ "constraints_satisfied": false,
28
+ "p1_score": 0.0,
29
+ "p1_feasibility": 0.5763570514697449,
30
+ "max_elongation": 5.9831792818066525,
31
+ "aspect_ratio": 2.802311169335037,
32
+ "average_triangularity": -0.5512332332730122,
33
+ "edge_iota_over_nfp": 0.12709288455907652,
34
+ "vacuum_well": -1.0111716777365585,
35
+ "evaluation_fidelity": "high",
36
+ "failure_reason": ""
37
+ },
38
+ "comparison": {
39
+ "low_high_feasibility_match": true,
40
+ "feasibility_delta": 0.0012224575418898764,
41
+ "score_delta": 0.0,
42
+ "ranking_compatibility": "match",
43
+ "low_fidelity_stored_p1_score": 0.0,
44
+ "low_fidelity_stored_p1_feasibility": 0.575134593927855,
45
+ "low_fidelity_snapshot": {
46
+ "missing_fields": [],
47
+ "drift_fields": {},
48
+ "mismatches": [],
49
+ "max_abs_drift": 0.0
50
+ }
51
+ }
52
+ },
53
+ {
54
+ "name": "boundary_default_reset",
55
+ "file": "server/data/p1/boundary_default_reset.json",
56
+ "status": "pass",
57
+ "low_fidelity": {
58
+ "evaluation_failed": false,
59
+ "constraints_satisfied": false,
60
+ "p1_score": 0.0,
61
+ "p1_feasibility": 0.0506528382250242,
62
+ "max_elongation": 6.13677115978351,
63
+ "aspect_ratio": 3.31313049868072,
64
+ "average_triangularity": -0.4746735808874879,
65
+ "edge_iota_over_nfp": 0.2906263991807532,
66
+ "vacuum_well": -0.7537878932672235,
67
+ "evaluation_fidelity": "low",
68
+ "failure_reason": ""
69
+ },
70
+ "high_fidelity": {
71
+ "evaluation_failed": false,
72
+ "constraints_satisfied": false,
73
+ "p1_score": 0.0,
74
+ "p1_feasibility": 0.0506528382250242,
75
+ "max_elongation": 6.134177903677296,
76
+ "aspect_ratio": 3.31313049868072,
77
+ "average_triangularity": -0.4746735808874879,
78
+ "edge_iota_over_nfp": 0.28971623977263294,
79
+ "vacuum_well": -0.7554909069955263,
80
+ "evaluation_fidelity": "high",
81
+ "failure_reason": ""
82
+ },
83
+ "comparison": {
84
+ "low_high_feasibility_match": true,
85
+ "feasibility_delta": 0.0,
86
+ "score_delta": 0.0,
87
+ "ranking_compatibility": "match",
88
+ "low_fidelity_stored_p1_score": 0.0,
89
+ "low_fidelity_stored_p1_feasibility": 0.0506528382250242,
90
+ "low_fidelity_snapshot": {
91
+ "missing_fields": [],
92
+ "drift_fields": {},
93
+ "mismatches": [],
94
+ "max_abs_drift": 0.0
95
+ }
96
+ }
97
+ },
98
+ {
99
+ "name": "lowfi_feasible_local",
100
+ "file": "server/data/p1/lowfi_feasible_local.json",
101
+ "status": "pass",
102
+ "low_fidelity": {
103
+ "evaluation_failed": false,
104
+ "constraints_satisfied": true,
105
+ "p1_score": 0.29165951078327634,
106
+ "p1_feasibility": 0.0,
107
+ "max_elongation": 7.375064402950513,
108
+ "aspect_ratio": 3.2870514531062405,
109
+ "average_triangularity": -0.5002923204919443,
110
+ "edge_iota_over_nfp": 0.30046082924426193,
111
+ "vacuum_well": -0.7949586699110935,
112
+ "evaluation_fidelity": "low",
113
+ "failure_reason": ""
114
+ },
115
+ "high_fidelity": {
116
+ "evaluation_failed": false,
117
+ "constraints_satisfied": true,
118
+ "p1_score": 0.2920325118884466,
119
+ "p1_feasibility": 0.0,
120
+ "max_elongation": 7.37170739300398,
121
+ "aspect_ratio": 3.2870514531062405,
122
+ "average_triangularity": -0.5002923204919443,
123
+ "edge_iota_over_nfp": 0.300057398146058,
124
+ "vacuum_well": -0.7963320784471227,
125
+ "evaluation_fidelity": "high",
126
+ "failure_reason": ""
127
+ },
128
+ "comparison": {
129
+ "low_high_feasibility_match": true,
130
+ "feasibility_delta": 0.0,
131
+ "score_delta": 0.0003730011051702453,
132
+ "ranking_compatibility": "match",
133
+ "low_fidelity_stored_p1_score": 0.29165951078327634,
134
+ "low_fidelity_stored_p1_feasibility": 0.0,
135
+ "low_fidelity_snapshot": {
136
+ "missing_fields": [],
137
+ "drift_fields": {},
138
+ "mismatches": [],
139
+ "max_abs_drift": 0.0
140
+ }
141
+ }
142
+ }
143
+ ]
144
+ }
baselines/high_fidelity_validation.py ADDED
@@ -0,0 +1,488 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Validation utilities for high-fidelity fixture pairing and submit-side traces."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from dataclasses import asdict, dataclass
8
+ from datetime import UTC, datetime
9
+ from pathlib import Path
10
+ from pprint import pformat
11
+ from time import perf_counter
12
+ from typing import Any
13
+
14
+ from fusion_lab.models import LowDimBoundaryParams, StellaratorAction
15
+ from server.contract import N_FIELD_PERIODS
16
+ from server.environment import StellaratorEnvironment
17
+ from server.physics import EvaluationMetrics, build_boundary_from_params, evaluate_boundary
18
+
19
+
20
+ LOW_FIDELITY_TOLERANCE = 1.0e-6
21
+
22
+
23
+ def _float(value: Any) -> float | None:
24
+ if isinstance(value, bool):
25
+ return None
26
+ try:
27
+ return float(value)
28
+ except (TypeError, ValueError):
29
+ return None
30
+
31
+
32
+ @dataclass(frozen=True)
33
+ class FixturePairResult:
34
+ name: str
35
+ file: str
36
+ status: str
37
+ low_fidelity: dict[str, Any]
38
+ high_fidelity: dict[str, Any]
39
+ comparison: dict[str, Any]
40
+
41
+
42
+ @dataclass(frozen=True)
43
+ class TraceStep:
44
+ step: int
45
+ intent: str
46
+ action: str
47
+ reward: float
48
+ score: float
49
+ feasibility: float
50
+ constraints_satisfied: bool
51
+ feasibility_delta: float | None
52
+ score_delta: float | None
53
+ max_elongation: float
54
+ p1_feasibility: float
55
+ budget_remaining: int
56
+ evaluation_fidelity: str
57
+ done: bool
58
+
59
+
60
+ def parse_args() -> argparse.Namespace:
61
+ parser = argparse.ArgumentParser(
62
+ description=(
63
+ "Run paired high-fidelity fixture checks and a submit-side manual trace "
64
+ "for the repaired P1 contract."
65
+ )
66
+ )
67
+ parser.add_argument(
68
+ "--fixture-dir",
69
+ type=Path,
70
+ default=Path("server/data/p1"),
71
+ help="Directory containing tracked P1 fixture JSON files.",
72
+ )
73
+ parser.add_argument(
74
+ "--fixture-output",
75
+ type=Path,
76
+ default=Path("baselines/fixture_high_fidelity_pairs.json"),
77
+ help="Output path for paired fixture summary JSON.",
78
+ )
79
+ parser.add_argument(
80
+ "--trace-output",
81
+ type=Path,
82
+ default=Path("baselines/submit_side_trace.json"),
83
+ help="Output path for one submit-side manual trace JSON.",
84
+ )
85
+ parser.add_argument(
86
+ "--no-write-fixture-updates",
87
+ action="store_true",
88
+ help="Do not write paired high-fidelity results back into fixture files.",
89
+ )
90
+ parser.add_argument(
91
+ "--skip-submit-trace",
92
+ action="store_true",
93
+ help="Only run paired fixture checks.",
94
+ )
95
+ parser.add_argument(
96
+ "--seed",
97
+ type=int,
98
+ default=0,
99
+ help="Seed for the submit-side manual trace reset state.",
100
+ )
101
+ parser.add_argument(
102
+ "--submit-action-sequence",
103
+ type=str,
104
+ default=(
105
+ "run:rotational_transform:increase:medium,"
106
+ "run:triangularity_scale:increase:medium,"
107
+ "run:elongation:decrease:small,"
108
+ "submit"
109
+ ),
110
+ help=(
111
+ "Comma-separated submit trace sequence. "
112
+ "Run actions are `run:parameter:direction:magnitude`; include `submit` as the last token."
113
+ ),
114
+ )
115
+ return parser.parse_args()
116
+
117
+
118
+ def _fixture_files(fixture_dir: Path) -> list[Path]:
119
+ return sorted(path for path in fixture_dir.glob("*.json") if path.is_file())
120
+
121
+
122
+ def _load_fixture(path: Path) -> dict[str, Any]:
123
+ with path.open("r") as file:
124
+ return json.load(file)
125
+
126
+
127
+ def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, Any]:
128
+ return {
129
+ "evaluation_failed": metrics.evaluation_failed,
130
+ "constraints_satisfied": metrics.constraints_satisfied,
131
+ "p1_score": metrics.p1_score,
132
+ "p1_feasibility": metrics.p1_feasibility,
133
+ "max_elongation": metrics.max_elongation,
134
+ "aspect_ratio": metrics.aspect_ratio,
135
+ "average_triangularity": metrics.average_triangularity,
136
+ "edge_iota_over_nfp": metrics.edge_iota_over_nfp,
137
+ "vacuum_well": metrics.vacuum_well,
138
+ "evaluation_fidelity": metrics.evaluation_fidelity,
139
+ "failure_reason": metrics.failure_reason,
140
+ }
141
+
142
+
143
+ def _parse_submit_sequence(raw: str) -> list[StellaratorAction]:
144
+ actions: list[StellaratorAction] = []
145
+ for token in raw.split(","):
146
+ token = token.strip()
147
+ if not token:
148
+ continue
149
+
150
+ if token == "submit":
151
+ actions.append(StellaratorAction(intent="submit"))
152
+ continue
153
+
154
+ parts = token.split(":")
155
+ if len(parts) != 4 or parts[0] != "run":
156
+ raise ValueError(
157
+ "Expected token format `run:parameter:direction:magnitude` or `submit`."
158
+ )
159
+ _, parameter, direction, magnitude = parts
160
+ actions.append(
161
+ StellaratorAction(
162
+ intent="run",
163
+ parameter=parameter,
164
+ direction=direction,
165
+ magnitude=magnitude,
166
+ )
167
+ )
168
+
169
+ if not actions:
170
+ raise ValueError("submit-action-sequence must include at least one action.")
171
+ if actions[-1].intent != "submit":
172
+ raise ValueError("submit-action-sequence must end with submit.")
173
+ return actions
174
+
175
+
176
+ def _compare_low_snapshot(
177
+ stored: dict[str, Any],
178
+ current: dict[str, Any],
179
+ ) -> tuple[bool, dict[str, Any]]:
180
+ numeric_keys = [
181
+ "p1_feasibility",
182
+ "p1_score",
183
+ "max_elongation",
184
+ "aspect_ratio",
185
+ "average_triangularity",
186
+ "edge_iota_over_nfp",
187
+ "vacuum_well",
188
+ ]
189
+ exact_keys = [
190
+ "constraints_satisfied",
191
+ "evaluation_fidelity",
192
+ "evaluation_failed",
193
+ "failure_reason",
194
+ ]
195
+ missing_fields: list[str] = []
196
+ drift_fields: dict[str, dict[str, float]] = {}
197
+ mismatches: list[dict[str, Any]] = []
198
+ max_abs_drift = 0.0
199
+
200
+ for key in numeric_keys:
201
+ if key not in stored:
202
+ missing_fields.append(key)
203
+ continue
204
+
205
+ expected = _float(stored.get(key))
206
+ actual = _float(current.get(key))
207
+ if expected is None or actual is None:
208
+ mismatches.append(
209
+ {
210
+ "field": key,
211
+ "expected": stored.get(key),
212
+ "actual": current.get(key),
213
+ "reason": "non-numeric",
214
+ }
215
+ )
216
+ continue
217
+
218
+ drift = abs(expected - actual)
219
+ max_abs_drift = max(max_abs_drift, drift)
220
+ if drift > LOW_FIDELITY_TOLERANCE:
221
+ drift_fields[key] = {
222
+ "expected": expected,
223
+ "actual": actual,
224
+ "abs_drift": drift,
225
+ }
226
+ mismatches.append(
227
+ {
228
+ "field": key,
229
+ "expected": expected,
230
+ "actual": actual,
231
+ "abs_drift": drift,
232
+ }
233
+ )
234
+
235
+ for key in exact_keys:
236
+ if key not in stored:
237
+ missing_fields.append(key)
238
+ continue
239
+
240
+ expected = stored.get(key)
241
+ actual = current.get(key)
242
+ if expected != actual:
243
+ mismatches.append(
244
+ {
245
+ "field": key,
246
+ "expected": expected,
247
+ "actual": actual,
248
+ "reason": "exact-mismatch",
249
+ }
250
+ )
251
+
252
+ return (
253
+ not missing_fields and not mismatches,
254
+ {
255
+ "missing_fields": missing_fields,
256
+ "drift_fields": drift_fields,
257
+ "mismatches": mismatches,
258
+ "max_abs_drift": max_abs_drift,
259
+ },
260
+ )
261
+
262
+
263
+ def _pair_fixture(path: Path) -> FixturePairResult:
264
+ data = _load_fixture(path)
265
+ params = LowDimBoundaryParams.model_validate(data["params"])
266
+ boundary = build_boundary_from_params(params, n_field_periods=N_FIELD_PERIODS)
267
+
268
+ low = evaluate_boundary(boundary, fidelity="low")
269
+ high = evaluate_boundary(boundary, fidelity="high")
270
+
271
+ low_payload = _metrics_payload(low)
272
+ high_payload = _metrics_payload(high)
273
+ low_snapshot_ok, low_snapshot = _compare_low_snapshot(
274
+ data.get("low_fidelity", {}),
275
+ low_payload,
276
+ )
277
+ feasible_match = low.constraints_satisfied == high.constraints_satisfied
278
+ ranking_compat = (
279
+ "ambiguous"
280
+ if low.evaluation_failed or high.evaluation_failed
281
+ else "match"
282
+ if feasible_match
283
+ else "mismatch"
284
+ )
285
+
286
+ comparison: dict[str, Any] = {
287
+ "low_high_feasibility_match": feasible_match,
288
+ "feasibility_delta": high.p1_feasibility - low.p1_feasibility,
289
+ "score_delta": high.p1_score - low.p1_score,
290
+ "ranking_compatibility": ranking_compat,
291
+ "low_fidelity_stored_p1_score": data.get("low_fidelity", {}).get("p1_score"),
292
+ "low_fidelity_stored_p1_feasibility": data.get("low_fidelity", {}).get("p1_feasibility"),
293
+ "low_fidelity_snapshot": low_snapshot,
294
+ }
295
+
296
+ status = "pass"
297
+ if low.evaluation_failed or high.evaluation_failed or not feasible_match or not low_snapshot_ok:
298
+ status = "fail"
299
+ if not low_snapshot_ok:
300
+ print(f" low-fidelity snapshot mismatch:\n{pformat(low_snapshot)}")
301
+
302
+ return FixturePairResult(
303
+ name=str(data.get("name", path.stem)),
304
+ file=str(path),
305
+ status=status,
306
+ low_fidelity=low_payload,
307
+ high_fidelity=high_payload,
308
+ comparison=comparison,
309
+ )
310
+
311
+
312
+ def _write_json(payload: dict[str, Any], path: Path) -> None:
313
+ path.parent.mkdir(parents=True, exist_ok=True)
314
+ with path.open("w") as file:
315
+ json.dump(payload, file, indent=2)
316
+
317
+
318
+ def _run_fixture_checks(
319
+ *,
320
+ fixture_dir: Path,
321
+ fixture_output: Path,
322
+ write_fixture_updates: bool,
323
+ ) -> tuple[list[FixturePairResult], int]:
324
+ results: list[FixturePairResult] = []
325
+ fail_count = 0
326
+
327
+ for path in _fixture_files(fixture_dir):
328
+ print(f"Evaluating fixture: {path.name}")
329
+ fixture_start = perf_counter()
330
+ result = _pair_fixture(path)
331
+ if result.status != "pass":
332
+ fail_count += 1
333
+ results.append(result)
334
+
335
+ if write_fixture_updates:
336
+ fixture = _load_fixture(path)
337
+ fixture["high_fidelity"] = result.high_fidelity
338
+ fixture["paired_high_fidelity_timestamp_utc"] = datetime.now(tz=UTC).isoformat()
339
+ with path.open("w") as file:
340
+ json.dump(fixture, file, indent=2)
341
+
342
+ elapsed = perf_counter() - fixture_start
343
+ print(
344
+ " done in "
345
+ f"{elapsed:0.1f}s | low_feasible={result.low_fidelity['constraints_satisfied']} "
346
+ f"| high_feasible={result.high_fidelity['constraints_satisfied']} "
347
+ f"| status={result.status}"
348
+ )
349
+
350
+ pass_count = len(results) - fail_count
351
+ payload = {
352
+ "timestamp_utc": datetime.now(tz=UTC).isoformat(),
353
+ "n_field_periods": N_FIELD_PERIODS,
354
+ "fixture_count": len(results),
355
+ "pass_count": pass_count,
356
+ "fail_count": fail_count,
357
+ "results": [asdict(result) for result in results],
358
+ }
359
+ _write_json(payload, fixture_output)
360
+ return results, fail_count
361
+
362
+
363
+ def _run_submit_trace(
364
+ trace_output: Path,
365
+ *,
366
+ seed: int,
367
+ action_sequence: str,
368
+ ) -> dict[str, Any]:
369
+ env = StellaratorEnvironment()
370
+ obs = env.reset(seed=seed)
371
+ initial_state = env.state
372
+ actions = _parse_submit_sequence(action_sequence)
373
+
374
+ trace: list[dict[str, Any]] = [
375
+ {
376
+ "step": 0,
377
+ "intent": "reset",
378
+ "action": f"reset(seed={seed})",
379
+ "reward": 0.0,
380
+ "score": obs.p1_score,
381
+ "feasibility": obs.p1_feasibility,
382
+ "feasibility_delta": None,
383
+ "score_delta": None,
384
+ "constraints_satisfied": obs.constraints_satisfied,
385
+ "max_elongation": obs.max_elongation,
386
+ "p1_feasibility": obs.p1_feasibility,
387
+ "budget_remaining": obs.budget_remaining,
388
+ "evaluation_fidelity": obs.evaluation_fidelity,
389
+ "done": obs.done,
390
+ "params": initial_state.current_params.model_dump(),
391
+ }
392
+ ]
393
+
394
+ previous_feasibility = obs.p1_feasibility
395
+ previous_score = obs.p1_score
396
+
397
+ for idx, action in enumerate(actions, start=1):
398
+ obs = env.step(action)
399
+ trace.append(
400
+ asdict(
401
+ TraceStep(
402
+ step=idx,
403
+ intent=action.intent,
404
+ action=(
405
+ f"{action.parameter} {action.direction} {action.magnitude}"
406
+ if action.intent == "run"
407
+ else action.intent
408
+ ),
409
+ reward=float(obs.reward or 0.0),
410
+ score=obs.p1_score,
411
+ feasibility=obs.p1_feasibility,
412
+ constraints_satisfied=obs.constraints_satisfied,
413
+ feasibility_delta=obs.p1_feasibility - previous_feasibility,
414
+ score_delta=obs.p1_score - previous_score,
415
+ max_elongation=obs.max_elongation,
416
+ p1_feasibility=obs.p1_feasibility,
417
+ budget_remaining=obs.budget_remaining,
418
+ evaluation_fidelity=obs.evaluation_fidelity,
419
+ done=obs.done,
420
+ )
421
+ )
422
+ )
423
+
424
+ previous_feasibility = obs.p1_feasibility
425
+ previous_score = obs.p1_score
426
+ if obs.done:
427
+ break
428
+
429
+ total_reward = sum(step["reward"] for step in trace)
430
+ payload = {
431
+ "trace_label": "submit_side_manual",
432
+ "trace_profile": action_sequence,
433
+ "timestamp_utc": datetime.now(tz=UTC).isoformat(),
434
+ "n_field_periods": N_FIELD_PERIODS,
435
+ "seed": seed,
436
+ "total_reward": total_reward,
437
+ "final_score": obs.p1_score,
438
+ "final_feasibility": obs.p1_feasibility,
439
+ "final_constraints_satisfied": obs.constraints_satisfied,
440
+ "final_evaluation_fidelity": obs.evaluation_fidelity,
441
+ "final_evaluation_failed": obs.evaluation_failed,
442
+ "steps": trace,
443
+ "final_best_low_fidelity_score": obs.best_low_fidelity_score,
444
+ "final_best_low_fidelity_feasibility": obs.best_low_fidelity_feasibility,
445
+ "final_best_high_fidelity_score": obs.best_high_fidelity_score,
446
+ "final_best_high_fidelity_feasibility": obs.best_high_fidelity_feasibility,
447
+ "final_diagnostics_text": obs.diagnostics_text,
448
+ }
449
+ _write_json(payload, trace_output)
450
+ return payload
451
+
452
+
453
+ def main() -> int:
454
+ args = parse_args()
455
+ results, fail_count = _run_fixture_checks(
456
+ fixture_dir=args.fixture_dir,
457
+ fixture_output=args.fixture_output,
458
+ write_fixture_updates=not args.no_write_fixture_updates,
459
+ )
460
+
461
+ print(
462
+ f"Paired fixtures: {len(results)} total, {len(results) - fail_count} pass, {fail_count} fail"
463
+ )
464
+ for result in results:
465
+ print(
466
+ f" - {result.name}: {result.status} "
467
+ f"(low={result.low_fidelity['constraints_satisfied']} "
468
+ f"high={result.high_fidelity['constraints_satisfied']})"
469
+ )
470
+
471
+ if not args.skip_submit_trace:
472
+ trace = _run_submit_trace(
473
+ args.trace_output,
474
+ seed=args.seed,
475
+ action_sequence=args.submit_action_sequence,
476
+ )
477
+ print(
478
+ f"Manual submit trace written to {args.trace_output} | "
479
+ f"sequence='{trace['trace_profile']}' | "
480
+ f"final_feasibility={trace['final_feasibility']:.6f} | "
481
+ f"fidelity={trace['final_evaluation_fidelity']}"
482
+ )
483
+
484
+ return 1 if fail_count else 0
485
+
486
+
487
+ if __name__ == "__main__":
488
+ raise SystemExit(main())
baselines/submit_side_trace.json ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "trace_label": "submit_side_manual",
3
+ "trace_profile": "run:rotational_transform:increase:medium,run:triangularity_scale:increase:medium,run:elongation:decrease:small,submit",
4
+ "timestamp_utc": "2026-03-08T07:07:43.478814+00:00",
5
+ "n_field_periods": 3,
6
+ "seed": 0,
7
+ "total_reward": 5.3296,
8
+ "final_score": 0.29605869964467535,
9
+ "final_feasibility": 0.0008652388718514148,
10
+ "final_constraints_satisfied": true,
11
+ "final_evaluation_fidelity": "high",
12
+ "final_evaluation_failed": false,
13
+ "steps": [
14
+ {
15
+ "step": 0,
16
+ "intent": "reset",
17
+ "action": "reset(seed=0)",
18
+ "reward": 0.0,
19
+ "score": 0.0,
20
+ "feasibility": 0.0506528382250242,
21
+ "feasibility_delta": null,
22
+ "score_delta": null,
23
+ "constraints_satisfied": false,
24
+ "max_elongation": 6.13677115978351,
25
+ "p1_feasibility": 0.0506528382250242,
26
+ "budget_remaining": 6,
27
+ "evaluation_fidelity": "low",
28
+ "done": false,
29
+ "params": {
30
+ "aspect_ratio": 3.6,
31
+ "elongation": 1.4,
32
+ "rotational_transform": 1.5,
33
+ "triangularity_scale": 0.55
34
+ }
35
+ },
36
+ {
37
+ "step": 1,
38
+ "intent": "run",
39
+ "action": "rotational_transform increase medium",
40
+ "reward": -0.1,
41
+ "score": 0.0,
42
+ "feasibility": 0.05065283822502309,
43
+ "constraints_satisfied": false,
44
+ "feasibility_delta": -1.1102230246251565e-15,
45
+ "score_delta": 0.0,
46
+ "max_elongation": 6.729528139593349,
47
+ "p1_feasibility": 0.05065283822502309,
48
+ "budget_remaining": 5,
49
+ "evaluation_fidelity": "low",
50
+ "done": false
51
+ },
52
+ {
53
+ "step": 2,
54
+ "intent": "run",
55
+ "action": "triangularity_scale increase medium",
56
+ "reward": 3.1533,
57
+ "score": 0.29165951078326,
58
+ "feasibility": 0.0,
59
+ "constraints_satisfied": true,
60
+ "feasibility_delta": -0.05065283822502309,
61
+ "score_delta": 0.29165951078326,
62
+ "max_elongation": 7.37506440295066,
63
+ "p1_feasibility": 0.0,
64
+ "budget_remaining": 4,
65
+ "evaluation_fidelity": "low",
66
+ "done": false
67
+ },
68
+ {
69
+ "step": 3,
70
+ "intent": "run",
71
+ "action": "elongation decrease small",
72
+ "reward": 0.2665,
73
+ "score": 0.2957311862720885,
74
+ "feasibility": 0.0008652388718514148,
75
+ "constraints_satisfied": true,
76
+ "feasibility_delta": 0.0008652388718514148,
77
+ "score_delta": 0.0040716754888284745,
78
+ "max_elongation": 7.338419323551204,
79
+ "p1_feasibility": 0.0008652388718514148,
80
+ "budget_remaining": 3,
81
+ "evaluation_fidelity": "low",
82
+ "done": false
83
+ },
84
+ {
85
+ "step": 4,
86
+ "intent": "submit",
87
+ "action": "submit",
88
+ "reward": 2.0098,
89
+ "score": 0.29605869964467535,
90
+ "feasibility": 0.0008652388718514148,
91
+ "constraints_satisfied": true,
92
+ "feasibility_delta": 0.0,
93
+ "score_delta": 0.00032751337258685176,
94
+ "max_elongation": 7.335471703197922,
95
+ "p1_feasibility": 0.0008652388718514148,
96
+ "budget_remaining": 3,
97
+ "evaluation_fidelity": "high",
98
+ "done": true
99
+ }
100
+ ],
101
+ "final_best_low_fidelity_score": 0.2957311862720885,
102
+ "final_best_low_fidelity_feasibility": 0.0008652388718514148,
103
+ "final_best_high_fidelity_score": 0.29605869964467535,
104
+ "final_best_high_fidelity_feasibility": 0.0008652388718514148,
105
+ "final_diagnostics_text": "Submitted current_high_fidelity_score=0.296059, best_high_fidelity_score=0.296059, best_high_fidelity_feasibility=0.000865, constraints=SATISFIED.\n\nevaluation_fidelity=high\nevaluation_status=OK\nmax_elongation=7.3355\naspect_ratio=3.2897 (<= 4.0)\naverage_triangularity=-0.4996 (<= -0.5)\nedge_iota_over_nfp=0.3030 (>= 0.3)\nfeasibility=0.000865\nbest_low_fidelity_score=0.295731\nbest_low_fidelity_feasibility=0.000865\nbest_high_fidelity_score=0.296059\nbest_high_fidelity_feasibility=0.000865\nvacuum_well=-0.8079\nconstraints=SATISFIED\nstep=4 | budget=3/6"
106
+ }
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -35,12 +35,13 @@ Completed:
35
  - a coarse measured sweep note now exists
36
  - the first tracked low-fidelity fixtures now exist
37
  - an initial low-fidelity manual playtest note now exists
 
 
38
 
39
  Still open:
40
 
41
  - tiny low-fidelity PPO smoke evidence
42
- - paired high-fidelity checks for the tracked fixtures
43
- - submit-side manual playtest evidence
44
  - heuristic baseline refresh on the repaired real-verifier path
45
  - HF Space deployment evidence
46
  - Colab artifact wiring
@@ -114,9 +115,9 @@ Compute surfaces:
114
  Evidence order:
115
 
116
  - [x] measured sweep note
117
- - [ ] fixture checks
118
  - [x] manual playtest log
119
- - [ ] tiny low-fi PPO smoke trace
120
  - [ ] reward iteration note
121
  - [ ] stable local and remote episodes
122
  - [x] random and heuristic baselines
@@ -138,10 +139,10 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
138
 
139
  ## 8. Execution Order
140
 
141
- - [ ] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
142
- - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
143
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
144
- - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
145
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
146
  - [ ] Refresh the heuristic baseline using the repaired-family evidence.
147
  - [ ] Prove a stable local episode path.
@@ -161,6 +162,7 @@ Gate 2: tiny PPO smoke is sane
161
  - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
162
  - trajectories are readable enough to debug
163
  - the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
 
164
 
165
  Gate 3: fixture checks pass
166
 
@@ -221,8 +223,8 @@ If the repaired family is too easy:
221
  - [x] Record the measured sweep and choose provisional defaults from evidence.
222
  - [x] Check in tracked fixtures.
223
  - [x] Record the first manual playtest log.
224
- - [ ] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
225
- - [ ] Pair the tracked fixtures with high-fidelity submit checks.
226
- - [ ] Record one submit-side manual trace.
227
  - [ ] Refresh the heuristic baseline from that playtest evidence.
228
  - [ ] Verify one clean HF Space episode with the same contract.
 
35
  - a coarse measured sweep note now exists
36
  - the first tracked low-fidelity fixtures now exist
37
  - an initial low-fidelity manual playtest note now exists
38
+ - paired high-fidelity fixture checks for those tracked fixtures now exist
39
+ - one submit-side manual playtest trace exists
40
 
41
  Still open:
42
 
43
  - tiny low-fidelity PPO smoke evidence
44
+ - decision on whether reset-seed pool should change from paired checks
 
45
  - heuristic baseline refresh on the repaired real-verifier path
46
  - HF Space deployment evidence
47
  - Colab artifact wiring
 
115
  Evidence order:
116
 
117
  - [x] measured sweep note
118
+ - [x] fixture checks
119
  - [x] manual playtest log
120
+ - [x] tiny low-fi PPO smoke trace
121
  - [ ] reward iteration note
122
  - [ ] stable local and remote episodes
123
  - [x] random and heuristic baselines
 
139
 
140
  ## 8. Execution Order
141
 
142
+ - [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
143
+ - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
144
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
145
+ - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
146
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
147
  - [ ] Refresh the heuristic baseline using the repaired-family evidence.
148
  - [ ] Prove a stable local episode path.
 
162
  - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
163
  - trajectories are readable enough to debug
164
  - the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
165
+ - current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)
166
 
167
  Gate 3: fixture checks pass
168
 
 
223
  - [x] Record the measured sweep and choose provisional defaults from evidence.
224
  - [x] Check in tracked fixtures.
225
  - [x] Record the first manual playtest log.
226
+ - [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
227
+ - [x] Pair the tracked fixtures with high-fidelity submit checks.
228
+ - [x] Record one submit-side manual trace.
229
  - [ ] Refresh the heuristic baseline from that playtest evidence.
230
  - [ ] Verify one clean HF Space episode with the same contract.
docs/P1_MANUAL_PLAYTEST_LOG.md CHANGED
@@ -50,4 +50,33 @@ Step 1:
50
  Current conclusion:
51
 
52
  - Reward V0 is legible on the low-fidelity repair path around the default reset seed
53
- - the most useful next manual check is still a real `submit` trace, but low-fidelity shaping is already understandable by a human
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  Current conclusion:
51
 
52
  - Reward V0 is legible on the low-fidelity repair path around the default reset seed
53
+ - a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
54
+
55
+ Episode C: submit-side manual trace
56
+
57
+ Scope:
58
+
59
+ - same seed-0 start state as episode A
60
+ - actions: `rotational_transform increase medium`, `triangularity_scale increase medium`, `elongation decrease small`, `submit`
61
+
62
+ Step sequence:
63
+
64
+ - Step 1: `rotational_transform increase medium`
65
+ - low-fidelity feasibility changed by `0.000000` (still infeasible)
66
+ - reward: `-0.1000`
67
+ - Step 2: `triangularity_scale increase medium`
68
+ - crossed feasibility boundary
69
+ - low-fidelity feasibility moved from `0.050653` to `0.000000`
70
+ - reward: `+3.1533`
71
+ - Step 3: `elongation decrease small`
72
+ - low-fidelity feasibility moved to `0.000865`
73
+ - reward: `+0.2665`
74
+ - Step 4: `submit` (high-fidelity)
75
+ - final feasibility: `0.000865`
76
+ - final high-fidelity score: `0.296059`
77
+ - final reward: `+2.0098`
78
+ - final diagnostics `evaluation_fidelity`=`high`, `constraints`=`SATISFIED`, `best_high_fidelity_score`=`0.296059`
79
+
80
+ Artifact:
81
+
82
+ - [manual submit trace JSON](../baselines/submit_side_trace.json)
server/data/p1/FIXTURE_SANITY.md CHANGED
@@ -1,6 +1,6 @@
1
  # P1 Fixture Sanity
2
 
3
- This folder now contains three low-fidelity-calibrated `P1` fixtures:
4
 
5
  - `boundary_default_reset.json`
6
  - `bad_low_iota.json`
@@ -23,8 +23,24 @@ Current interpretation:
23
  - low-fidelity feasible local target
24
  - reachable from the default reset band with two intuitive knob increases
25
 
26
- What is still pending:
27
 
28
- - paired high-fidelity submit measurements for each tracked fixture
29
- - low-fi vs high-fi ranking comparison note
30
  - decision on whether any reset seed should be changed from the current default
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # P1 Fixture Sanity
2
 
3
+ This folder now contains three paired low-fidelity/high-fidelity `P1` fixtures:
4
 
5
  - `boundary_default_reset.json`
6
  - `bad_low_iota.json`
 
23
  - low-fidelity feasible local target
24
  - reachable from the default reset band with two intuitive knob increases
25
 
26
+ Observed from paired run:
27
 
28
+ - low-fi vs high-fi feasibility alignment and metric deltas are documented in `baselines/fixture_high_fidelity_pairs.json`.
 
29
  - decision on whether any reset seed should be changed from the current default
30
+
31
+ Current paired summary (`baselines/fixture_high_fidelity_pairs.json`):
32
+
33
+ - `bad_low_iota.json`:
34
+ - both fidelities infeasible
35
+ - low/high feasibility match: `true`
36
+ - low/high score match: both `0.0`
37
+
38
+ - `boundary_default_reset.json`:
39
+ - both fidelities infeasible
40
+ - low/high feasibility match: `true`
41
+ - low/high score match: both `0.0`
42
+
43
+ - `lowfi_feasible_local.json`:
44
+ - both fidelities feasible
45
+ - low/high feasibility match: `true`
46
+ - high-fidelity score improved slightly: `0.29165951078327634` → `0.2920325118884466`
server/data/p1/README.md CHANGED
@@ -12,7 +12,7 @@ These fixtures are for verifier and reward sanity checks.
12
 
13
  ## Status
14
 
15
- - [ ] known-good or near-winning fixture added
16
  - [x] near-boundary fixture added
17
  - [x] clearly infeasible fixture added
18
  - [x] fixture sanity note written
 
12
 
13
  ## Status
14
 
15
+ - [x] known-good or near-winning fixture added
16
  - [x] near-boundary fixture added
17
  - [x] clearly infeasible fixture added
18
  - [x] fixture sanity note written
server/data/p1/bad_low_iota.json CHANGED
@@ -4,7 +4,7 @@
4
  "notes": [
5
  "Clearly infeasible calibration case from the coarse measured sweep.",
6
  "The dominant failure mode is low edge_iota_over_nfp, not triangularity.",
7
- "High-fidelity submit spot check is still pending."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.2,
@@ -21,7 +21,22 @@
21
  "aspect_ratio": 2.802311169335037,
22
  "average_triangularity": -0.5512332332730122,
23
  "edge_iota_over_nfp": 0.12745962182164347,
24
- "vacuum_well": -1.0099648537211192
 
 
25
  },
26
- "high_fidelity": null
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  }
 
4
  "notes": [
5
  "Clearly infeasible calibration case from the coarse measured sweep.",
6
  "The dominant failure mode is low edge_iota_over_nfp, not triangularity.",
7
+ "High-fidelity submit spot check is complete."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.2,
 
21
  "aspect_ratio": 2.802311169335037,
22
  "average_triangularity": -0.5512332332730122,
23
  "edge_iota_over_nfp": 0.12745962182164347,
24
+ "vacuum_well": -1.0099648537211192,
25
+ "evaluation_fidelity": "low",
26
+ "failure_reason": ""
27
  },
28
+ "high_fidelity": {
29
+ "evaluation_failed": false,
30
+ "constraints_satisfied": false,
31
+ "p1_score": 0.0,
32
+ "p1_feasibility": 0.5763570514697449,
33
+ "max_elongation": 5.9831792818066525,
34
+ "aspect_ratio": 2.802311169335037,
35
+ "average_triangularity": -0.5512332332730122,
36
+ "edge_iota_over_nfp": 0.12709288455907652,
37
+ "vacuum_well": -1.0111716777365585,
38
+ "evaluation_fidelity": "high",
39
+ "failure_reason": ""
40
+ },
41
+ "paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:19.629771+00:00"
42
  }
server/data/p1/boundary_default_reset.json CHANGED
@@ -4,7 +4,7 @@
4
  "notes": [
5
  "Matches the current default reset seed.",
6
  "Useful as a near-boundary starting point for short repair episodes.",
7
- "High-fidelity submit spot check is still pending."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.6,
@@ -21,7 +21,22 @@
21
  "aspect_ratio": 3.31313049868072,
22
  "average_triangularity": -0.4746735808874879,
23
  "edge_iota_over_nfp": 0.2906263991807532,
24
- "vacuum_well": -0.7537878932672235
 
 
25
  },
26
- "high_fidelity": null
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  }
 
4
  "notes": [
5
  "Matches the current default reset seed.",
6
  "Useful as a near-boundary starting point for short repair episodes.",
7
+ "High-fidelity submit spot check is complete."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.6,
 
21
  "aspect_ratio": 3.31313049868072,
22
  "average_triangularity": -0.4746735808874879,
23
  "edge_iota_over_nfp": 0.2906263991807532,
24
+ "vacuum_well": -0.7537878932672235,
25
+ "evaluation_fidelity": "low",
26
+ "failure_reason": ""
27
  },
28
+ "high_fidelity": {
29
+ "evaluation_failed": false,
30
+ "constraints_satisfied": false,
31
+ "p1_score": 0.0,
32
+ "p1_feasibility": 0.0506528382250242,
33
+ "max_elongation": 6.134177903677296,
34
+ "aspect_ratio": 3.31313049868072,
35
+ "average_triangularity": -0.4746735808874879,
36
+ "edge_iota_over_nfp": 0.28971623977263294,
37
+ "vacuum_well": -0.7554909069955263,
38
+ "evaluation_fidelity": "high",
39
+ "failure_reason": ""
40
+ },
41
+ "paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:24.745385+00:00"
42
  }
server/data/p1/lowfi_feasible_local.json CHANGED
@@ -4,7 +4,7 @@
4
  "notes": [
5
  "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
6
  "Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
7
- "High-fidelity submit spot check is still pending."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.6,
@@ -21,7 +21,22 @@
21
  "aspect_ratio": 3.2870514531062405,
22
  "average_triangularity": -0.5002923204919443,
23
  "edge_iota_over_nfp": 0.30046082924426193,
24
- "vacuum_well": -0.7949586699110935
 
 
25
  },
26
- "high_fidelity": null
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  }
 
4
  "notes": [
5
  "Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
6
  "Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
7
+ "High-fidelity submit spot check is complete."
8
  ],
9
  "params": {
10
  "aspect_ratio": 3.6,
 
21
  "aspect_ratio": 3.2870514531062405,
22
  "average_triangularity": -0.5002923204919443,
23
  "edge_iota_over_nfp": 0.30046082924426193,
24
+ "vacuum_well": -0.7949586699110935,
25
+ "evaluation_fidelity": "low",
26
+ "failure_reason": ""
27
  },
28
+ "high_fidelity": {
29
+ "evaluation_failed": false,
30
+ "constraints_satisfied": true,
31
+ "p1_score": 0.2920325118884466,
32
+ "p1_feasibility": 0.0,
33
+ "max_elongation": 7.37170739300398,
34
+ "aspect_ratio": 3.2870514531062405,
35
+ "average_triangularity": -0.5002923204919443,
36
+ "edge_iota_over_nfp": 0.300057398146058,
37
+ "vacuum_well": -0.7963320784471227,
38
+ "evaluation_fidelity": "high",
39
+ "failure_reason": ""
40
+ },
41
+ "paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:29.939083+00:00"
42
  }