Spaces:
Sleeping
Sleeping
Commit ·
c3a24db
1
Parent(s): a02ffad
feat: add high-fidelity validation evidence
Browse files- README.md +11 -8
- TODO.md +8 -4
- baselines/fixture_high_fidelity_pairs.json +144 -0
- baselines/high_fidelity_validation.py +488 -0
- baselines/submit_side_trace.json +106 -0
- docs/FUSION_DESIGN_LAB_PLAN_V2.md +12 -10
- docs/P1_MANUAL_PLAYTEST_LOG.md +30 -1
- server/data/p1/FIXTURE_SANITY.md +20 -4
- server/data/p1/README.md +1 -1
- server/data/p1/bad_low_iota.json +18 -3
- server/data/p1/boundary_default_reset.json +18 -3
- server/data/p1/lowfi_feasible_local.json +18 -3
README.md
CHANGED
|
@@ -30,7 +30,8 @@ Implementation status:
|
|
| 30 |
- the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
|
| 31 |
- the repaired 4-knob low-dimensional family is now wired into the runtime path
|
| 32 |
- the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
|
| 33 |
-
- the
|
|
|
|
| 34 |
|
| 35 |
## Execution Status
|
| 36 |
|
|
@@ -52,7 +53,8 @@ Implementation status:
|
|
| 52 |
- [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
|
| 53 |
- [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
|
| 54 |
- [x] Add tracked `P1` fixtures under `server/data/p1/`
|
| 55 |
-
- [
|
|
|
|
| 56 |
- [ ] Refresh the heuristic baseline for the real verifier path
|
| 57 |
- [ ] Deploy the real environment to HF Space
|
| 58 |
|
|
@@ -60,7 +62,7 @@ Implementation status:
|
|
| 60 |
|
| 61 |
- Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
|
| 62 |
- The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
|
| 63 |
-
- The
|
| 64 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 65 |
- High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
|
| 66 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
|
@@ -68,12 +70,13 @@ Implementation status:
|
|
| 68 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
| 69 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 70 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 71 |
-
- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is
|
|
|
|
| 72 |
|
| 73 |
Current mode:
|
| 74 |
|
| 75 |
- strategic task choice is already locked
|
| 76 |
-
- the next work is
|
| 77 |
- new planning text should only appear when a real blocker forces a decision change
|
| 78 |
|
| 79 |
## Planned Repository Layout
|
|
@@ -127,10 +130,10 @@ uv sync --extra notebooks
|
|
| 127 |
|
| 128 |
## Immediate Next Steps
|
| 129 |
|
| 130 |
-
- [
|
| 131 |
-
- [
|
|
|
|
| 132 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
| 133 |
-
- [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
|
| 134 |
- [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
|
| 135 |
- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
|
| 136 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
|
|
|
| 30 |
- the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
|
| 31 |
- the repaired 4-knob low-dimensional family is now wired into the runtime path
|
| 32 |
- the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
|
| 33 |
+
- the first tiny low-fi PPO smoke artifact and paired high-fidelity fixture checks now exist
|
| 34 |
+
- a one-trajectory submit-side manual trace has now been recorded
|
| 35 |
|
| 36 |
## Execution Status
|
| 37 |
|
|
|
|
| 53 |
- [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
|
| 54 |
- [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
|
| 55 |
- [x] Add tracked `P1` fixtures under `server/data/p1/`
|
| 56 |
+
- [x] Run a tiny low-fi PPO smoke run as a diagnostic-only check and save one trajectory artifact
|
| 57 |
+
- [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
|
| 58 |
- [ ] Refresh the heuristic baseline for the real verifier path
|
| 59 |
- [ ] Deploy the real environment to HF Space
|
| 60 |
|
|
|
|
| 62 |
|
| 63 |
- Historical blocker note: the old 3-knob family was structurally blocked on P1 triangularity with the real verifier path. A sampled low-fidelity sweep kept `average_triangularity` at roughly `+0.004975` and `p1_feasibility` at roughly `1.00995`, with zero feasible samples. That blocker motivated the repaired 4-knob runtime that is now live.
|
| 64 |
- The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
|
| 65 |
+
- The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
|
| 66 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 67 |
- High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
|
| 68 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
|
|
|
| 70 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
| 71 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 72 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 73 |
+
- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is now baseline refresh and reset-seed confirmation backed by the paired high-fidelity evidence.
|
| 74 |
+
- The first tiny PPO smoke note is in [docs/P1_PPO_SMOKE_NOTE.md](docs/P1_PPO_SMOKE_NOTE.md). It produced a valid trajectory artifact and exposed a repeated-action local failure, which is the right outcome for a smoke run.
|
| 75 |
|
| 76 |
Current mode:
|
| 77 |
|
| 78 |
- strategic task choice is already locked
|
| 79 |
+
- the next work is heuristic refresh, reset-seed confirmation, and deployment
|
| 80 |
- new planning text should only appear when a real blocker forces a decision change
|
| 81 |
|
| 82 |
## Planned Repository Layout
|
|
|
|
| 130 |
|
| 131 |
## Immediate Next Steps
|
| 132 |
|
| 133 |
+
- [x] Run a tiny low-fidelity PPO smoke run and stop after a few readable trajectories or one clear failure mode.
|
| 134 |
+
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
|
| 135 |
+
- [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
|
| 136 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
|
|
|
| 137 |
- [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
|
| 138 |
- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
|
| 139 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
TODO.md
CHANGED
|
@@ -43,7 +43,9 @@ Priority source:
|
|
| 43 |
- [x] manual playtest log
|
| 44 |
- [x] settle the non-submit terminal reward policy
|
| 45 |
- [x] baseline comparison has been re-run on the `constellaration` branch state
|
| 46 |
-
- [
|
|
|
|
|
|
|
| 47 |
- [ ] refresh the heuristic baseline for the real verifier path
|
| 48 |
|
| 49 |
## Execution Graph
|
|
@@ -182,22 +184,24 @@ flowchart TD
|
|
| 182 |
[server/data/p1/README.md](server/data/p1/README.md),
|
| 183 |
[P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
|
| 184 |
Note:
|
| 185 |
-
|
| 186 |
|
| 187 |
-
- [
|
| 188 |
Goal:
|
| 189 |
confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
|
| 190 |
Related:
|
| 191 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 192 |
[Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
|
| 193 |
|
| 194 |
-
- [
|
| 195 |
Goal:
|
| 196 |
fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
|
| 197 |
Note:
|
| 198 |
treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
|
| 199 |
stop after a few readable trajectories or one clear failure mode
|
| 200 |
paired high-fidelity fixture checks must happen immediately after this smoke pass
|
|
|
|
|
|
|
| 201 |
high-fidelity VMEC-backed `submit` should stay out of the normal RL inner loop
|
| 202 |
|
| 203 |
- [ ] Manual-playtest 5-10 episodes
|
|
|
|
| 43 |
- [x] manual playtest log
|
| 44 |
- [x] settle the non-submit terminal reward policy
|
| 45 |
- [x] baseline comparison has been re-run on the `constellaration` branch state
|
| 46 |
+
- [x] tiny low-fi PPO smoke run exists
|
| 47 |
+
Note:
|
| 48 |
+
`training/ppo_smoke.py` now runs a diagnostic-only low-fidelity PPO smoke pass and the first artifact is summarized in `docs/P1_PPO_SMOKE_NOTE.md`
|
| 49 |
- [ ] refresh the heuristic baseline for the real verifier path
|
| 50 |
|
| 51 |
## Execution Graph
|
|
|
|
| 184 |
[server/data/p1/README.md](server/data/p1/README.md),
|
| 185 |
[P1 Pivot Record](docs/archive/PIVOT_P1_ROTATING_ELLIPSE.md)
|
| 186 |
Note:
|
| 187 |
+
paired high-fidelity submit checks are now written into each tracked fixture and summarized in `baselines/fixture_high_fidelity_pairs.json`
|
| 188 |
|
| 189 |
+
- [x] Run fixture sanity checks
|
| 190 |
Goal:
|
| 191 |
confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering
|
| 192 |
Related:
|
| 193 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 194 |
[Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
|
| 195 |
|
| 196 |
+
- [x] Run a tiny low-fi PPO smoke pass
|
| 197 |
Goal:
|
| 198 |
fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
|
| 199 |
Note:
|
| 200 |
treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
|
| 201 |
stop after a few readable trajectories or one clear failure mode
|
| 202 |
paired high-fidelity fixture checks must happen immediately after this smoke pass
|
| 203 |
+
Status:
|
| 204 |
+
first smoke artifact exists; next use of this step should only happen if a follow-up reward or observation change needs re-checking
|
| 205 |
high-fidelity VMEC-backed `submit` should stay out of the normal RL inner loop
|
| 206 |
|
| 207 |
- [ ] Manual-playtest 5-10 episodes
|
baselines/fixture_high_fidelity_pairs.json
ADDED
|
@@ -0,0 +1,144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"timestamp_utc": "2026-03-08T07:07:29.939307+00:00",
|
| 3 |
+
"n_field_periods": 3,
|
| 4 |
+
"fixture_count": 3,
|
| 5 |
+
"pass_count": 3,
|
| 6 |
+
"fail_count": 0,
|
| 7 |
+
"results": [
|
| 8 |
+
{
|
| 9 |
+
"name": "bad_low_iota",
|
| 10 |
+
"file": "server/data/p1/bad_low_iota.json",
|
| 11 |
+
"status": "pass",
|
| 12 |
+
"low_fidelity": {
|
| 13 |
+
"evaluation_failed": false,
|
| 14 |
+
"constraints_satisfied": false,
|
| 15 |
+
"p1_score": 0.0,
|
| 16 |
+
"p1_feasibility": 0.575134593927855,
|
| 17 |
+
"max_elongation": 5.983792904658967,
|
| 18 |
+
"aspect_ratio": 2.802311169335037,
|
| 19 |
+
"average_triangularity": -0.5512332332730122,
|
| 20 |
+
"edge_iota_over_nfp": 0.12745962182164347,
|
| 21 |
+
"vacuum_well": -1.0099648537211192,
|
| 22 |
+
"evaluation_fidelity": "low",
|
| 23 |
+
"failure_reason": ""
|
| 24 |
+
},
|
| 25 |
+
"high_fidelity": {
|
| 26 |
+
"evaluation_failed": false,
|
| 27 |
+
"constraints_satisfied": false,
|
| 28 |
+
"p1_score": 0.0,
|
| 29 |
+
"p1_feasibility": 0.5763570514697449,
|
| 30 |
+
"max_elongation": 5.9831792818066525,
|
| 31 |
+
"aspect_ratio": 2.802311169335037,
|
| 32 |
+
"average_triangularity": -0.5512332332730122,
|
| 33 |
+
"edge_iota_over_nfp": 0.12709288455907652,
|
| 34 |
+
"vacuum_well": -1.0111716777365585,
|
| 35 |
+
"evaluation_fidelity": "high",
|
| 36 |
+
"failure_reason": ""
|
| 37 |
+
},
|
| 38 |
+
"comparison": {
|
| 39 |
+
"low_high_feasibility_match": true,
|
| 40 |
+
"feasibility_delta": 0.0012224575418898764,
|
| 41 |
+
"score_delta": 0.0,
|
| 42 |
+
"ranking_compatibility": "match",
|
| 43 |
+
"low_fidelity_stored_p1_score": 0.0,
|
| 44 |
+
"low_fidelity_stored_p1_feasibility": 0.575134593927855,
|
| 45 |
+
"low_fidelity_snapshot": {
|
| 46 |
+
"missing_fields": [],
|
| 47 |
+
"drift_fields": {},
|
| 48 |
+
"mismatches": [],
|
| 49 |
+
"max_abs_drift": 0.0
|
| 50 |
+
}
|
| 51 |
+
}
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"name": "boundary_default_reset",
|
| 55 |
+
"file": "server/data/p1/boundary_default_reset.json",
|
| 56 |
+
"status": "pass",
|
| 57 |
+
"low_fidelity": {
|
| 58 |
+
"evaluation_failed": false,
|
| 59 |
+
"constraints_satisfied": false,
|
| 60 |
+
"p1_score": 0.0,
|
| 61 |
+
"p1_feasibility": 0.0506528382250242,
|
| 62 |
+
"max_elongation": 6.13677115978351,
|
| 63 |
+
"aspect_ratio": 3.31313049868072,
|
| 64 |
+
"average_triangularity": -0.4746735808874879,
|
| 65 |
+
"edge_iota_over_nfp": 0.2906263991807532,
|
| 66 |
+
"vacuum_well": -0.7537878932672235,
|
| 67 |
+
"evaluation_fidelity": "low",
|
| 68 |
+
"failure_reason": ""
|
| 69 |
+
},
|
| 70 |
+
"high_fidelity": {
|
| 71 |
+
"evaluation_failed": false,
|
| 72 |
+
"constraints_satisfied": false,
|
| 73 |
+
"p1_score": 0.0,
|
| 74 |
+
"p1_feasibility": 0.0506528382250242,
|
| 75 |
+
"max_elongation": 6.134177903677296,
|
| 76 |
+
"aspect_ratio": 3.31313049868072,
|
| 77 |
+
"average_triangularity": -0.4746735808874879,
|
| 78 |
+
"edge_iota_over_nfp": 0.28971623977263294,
|
| 79 |
+
"vacuum_well": -0.7554909069955263,
|
| 80 |
+
"evaluation_fidelity": "high",
|
| 81 |
+
"failure_reason": ""
|
| 82 |
+
},
|
| 83 |
+
"comparison": {
|
| 84 |
+
"low_high_feasibility_match": true,
|
| 85 |
+
"feasibility_delta": 0.0,
|
| 86 |
+
"score_delta": 0.0,
|
| 87 |
+
"ranking_compatibility": "match",
|
| 88 |
+
"low_fidelity_stored_p1_score": 0.0,
|
| 89 |
+
"low_fidelity_stored_p1_feasibility": 0.0506528382250242,
|
| 90 |
+
"low_fidelity_snapshot": {
|
| 91 |
+
"missing_fields": [],
|
| 92 |
+
"drift_fields": {},
|
| 93 |
+
"mismatches": [],
|
| 94 |
+
"max_abs_drift": 0.0
|
| 95 |
+
}
|
| 96 |
+
}
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"name": "lowfi_feasible_local",
|
| 100 |
+
"file": "server/data/p1/lowfi_feasible_local.json",
|
| 101 |
+
"status": "pass",
|
| 102 |
+
"low_fidelity": {
|
| 103 |
+
"evaluation_failed": false,
|
| 104 |
+
"constraints_satisfied": true,
|
| 105 |
+
"p1_score": 0.29165951078327634,
|
| 106 |
+
"p1_feasibility": 0.0,
|
| 107 |
+
"max_elongation": 7.375064402950513,
|
| 108 |
+
"aspect_ratio": 3.2870514531062405,
|
| 109 |
+
"average_triangularity": -0.5002923204919443,
|
| 110 |
+
"edge_iota_over_nfp": 0.30046082924426193,
|
| 111 |
+
"vacuum_well": -0.7949586699110935,
|
| 112 |
+
"evaluation_fidelity": "low",
|
| 113 |
+
"failure_reason": ""
|
| 114 |
+
},
|
| 115 |
+
"high_fidelity": {
|
| 116 |
+
"evaluation_failed": false,
|
| 117 |
+
"constraints_satisfied": true,
|
| 118 |
+
"p1_score": 0.2920325118884466,
|
| 119 |
+
"p1_feasibility": 0.0,
|
| 120 |
+
"max_elongation": 7.37170739300398,
|
| 121 |
+
"aspect_ratio": 3.2870514531062405,
|
| 122 |
+
"average_triangularity": -0.5002923204919443,
|
| 123 |
+
"edge_iota_over_nfp": 0.300057398146058,
|
| 124 |
+
"vacuum_well": -0.7963320784471227,
|
| 125 |
+
"evaluation_fidelity": "high",
|
| 126 |
+
"failure_reason": ""
|
| 127 |
+
},
|
| 128 |
+
"comparison": {
|
| 129 |
+
"low_high_feasibility_match": true,
|
| 130 |
+
"feasibility_delta": 0.0,
|
| 131 |
+
"score_delta": 0.0003730011051702453,
|
| 132 |
+
"ranking_compatibility": "match",
|
| 133 |
+
"low_fidelity_stored_p1_score": 0.29165951078327634,
|
| 134 |
+
"low_fidelity_stored_p1_feasibility": 0.0,
|
| 135 |
+
"low_fidelity_snapshot": {
|
| 136 |
+
"missing_fields": [],
|
| 137 |
+
"drift_fields": {},
|
| 138 |
+
"mismatches": [],
|
| 139 |
+
"max_abs_drift": 0.0
|
| 140 |
+
}
|
| 141 |
+
}
|
| 142 |
+
}
|
| 143 |
+
]
|
| 144 |
+
}
|
baselines/high_fidelity_validation.py
ADDED
|
@@ -0,0 +1,488 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Validation utilities for high-fidelity fixture pairing and submit-side traces."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import argparse
|
| 6 |
+
import json
|
| 7 |
+
from dataclasses import asdict, dataclass
|
| 8 |
+
from datetime import UTC, datetime
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from pprint import pformat
|
| 11 |
+
from time import perf_counter
|
| 12 |
+
from typing import Any
|
| 13 |
+
|
| 14 |
+
from fusion_lab.models import LowDimBoundaryParams, StellaratorAction
|
| 15 |
+
from server.contract import N_FIELD_PERIODS
|
| 16 |
+
from server.environment import StellaratorEnvironment
|
| 17 |
+
from server.physics import EvaluationMetrics, build_boundary_from_params, evaluate_boundary
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
LOW_FIDELITY_TOLERANCE = 1.0e-6
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def _float(value: Any) -> float | None:
|
| 24 |
+
if isinstance(value, bool):
|
| 25 |
+
return None
|
| 26 |
+
try:
|
| 27 |
+
return float(value)
|
| 28 |
+
except (TypeError, ValueError):
|
| 29 |
+
return None
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
@dataclass(frozen=True)
|
| 33 |
+
class FixturePairResult:
|
| 34 |
+
name: str
|
| 35 |
+
file: str
|
| 36 |
+
status: str
|
| 37 |
+
low_fidelity: dict[str, Any]
|
| 38 |
+
high_fidelity: dict[str, Any]
|
| 39 |
+
comparison: dict[str, Any]
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
@dataclass(frozen=True)
|
| 43 |
+
class TraceStep:
|
| 44 |
+
step: int
|
| 45 |
+
intent: str
|
| 46 |
+
action: str
|
| 47 |
+
reward: float
|
| 48 |
+
score: float
|
| 49 |
+
feasibility: float
|
| 50 |
+
constraints_satisfied: bool
|
| 51 |
+
feasibility_delta: float | None
|
| 52 |
+
score_delta: float | None
|
| 53 |
+
max_elongation: float
|
| 54 |
+
p1_feasibility: float
|
| 55 |
+
budget_remaining: int
|
| 56 |
+
evaluation_fidelity: str
|
| 57 |
+
done: bool
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def parse_args() -> argparse.Namespace:
|
| 61 |
+
parser = argparse.ArgumentParser(
|
| 62 |
+
description=(
|
| 63 |
+
"Run paired high-fidelity fixture checks and a submit-side manual trace "
|
| 64 |
+
"for the repaired P1 contract."
|
| 65 |
+
)
|
| 66 |
+
)
|
| 67 |
+
parser.add_argument(
|
| 68 |
+
"--fixture-dir",
|
| 69 |
+
type=Path,
|
| 70 |
+
default=Path("server/data/p1"),
|
| 71 |
+
help="Directory containing tracked P1 fixture JSON files.",
|
| 72 |
+
)
|
| 73 |
+
parser.add_argument(
|
| 74 |
+
"--fixture-output",
|
| 75 |
+
type=Path,
|
| 76 |
+
default=Path("baselines/fixture_high_fidelity_pairs.json"),
|
| 77 |
+
help="Output path for paired fixture summary JSON.",
|
| 78 |
+
)
|
| 79 |
+
parser.add_argument(
|
| 80 |
+
"--trace-output",
|
| 81 |
+
type=Path,
|
| 82 |
+
default=Path("baselines/submit_side_trace.json"),
|
| 83 |
+
help="Output path for one submit-side manual trace JSON.",
|
| 84 |
+
)
|
| 85 |
+
parser.add_argument(
|
| 86 |
+
"--no-write-fixture-updates",
|
| 87 |
+
action="store_true",
|
| 88 |
+
help="Do not write paired high-fidelity results back into fixture files.",
|
| 89 |
+
)
|
| 90 |
+
parser.add_argument(
|
| 91 |
+
"--skip-submit-trace",
|
| 92 |
+
action="store_true",
|
| 93 |
+
help="Only run paired fixture checks.",
|
| 94 |
+
)
|
| 95 |
+
parser.add_argument(
|
| 96 |
+
"--seed",
|
| 97 |
+
type=int,
|
| 98 |
+
default=0,
|
| 99 |
+
help="Seed for the submit-side manual trace reset state.",
|
| 100 |
+
)
|
| 101 |
+
parser.add_argument(
|
| 102 |
+
"--submit-action-sequence",
|
| 103 |
+
type=str,
|
| 104 |
+
default=(
|
| 105 |
+
"run:rotational_transform:increase:medium,"
|
| 106 |
+
"run:triangularity_scale:increase:medium,"
|
| 107 |
+
"run:elongation:decrease:small,"
|
| 108 |
+
"submit"
|
| 109 |
+
),
|
| 110 |
+
help=(
|
| 111 |
+
"Comma-separated submit trace sequence. "
|
| 112 |
+
"Run actions are `run:parameter:direction:magnitude`; include `submit` as the last token."
|
| 113 |
+
),
|
| 114 |
+
)
|
| 115 |
+
return parser.parse_args()
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def _fixture_files(fixture_dir: Path) -> list[Path]:
|
| 119 |
+
return sorted(path for path in fixture_dir.glob("*.json") if path.is_file())
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def _load_fixture(path: Path) -> dict[str, Any]:
|
| 123 |
+
with path.open("r") as file:
|
| 124 |
+
return json.load(file)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, Any]:
|
| 128 |
+
return {
|
| 129 |
+
"evaluation_failed": metrics.evaluation_failed,
|
| 130 |
+
"constraints_satisfied": metrics.constraints_satisfied,
|
| 131 |
+
"p1_score": metrics.p1_score,
|
| 132 |
+
"p1_feasibility": metrics.p1_feasibility,
|
| 133 |
+
"max_elongation": metrics.max_elongation,
|
| 134 |
+
"aspect_ratio": metrics.aspect_ratio,
|
| 135 |
+
"average_triangularity": metrics.average_triangularity,
|
| 136 |
+
"edge_iota_over_nfp": metrics.edge_iota_over_nfp,
|
| 137 |
+
"vacuum_well": metrics.vacuum_well,
|
| 138 |
+
"evaluation_fidelity": metrics.evaluation_fidelity,
|
| 139 |
+
"failure_reason": metrics.failure_reason,
|
| 140 |
+
}
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
def _parse_submit_sequence(raw: str) -> list[StellaratorAction]:
|
| 144 |
+
actions: list[StellaratorAction] = []
|
| 145 |
+
for token in raw.split(","):
|
| 146 |
+
token = token.strip()
|
| 147 |
+
if not token:
|
| 148 |
+
continue
|
| 149 |
+
|
| 150 |
+
if token == "submit":
|
| 151 |
+
actions.append(StellaratorAction(intent="submit"))
|
| 152 |
+
continue
|
| 153 |
+
|
| 154 |
+
parts = token.split(":")
|
| 155 |
+
if len(parts) != 4 or parts[0] != "run":
|
| 156 |
+
raise ValueError(
|
| 157 |
+
"Expected token format `run:parameter:direction:magnitude` or `submit`."
|
| 158 |
+
)
|
| 159 |
+
_, parameter, direction, magnitude = parts
|
| 160 |
+
actions.append(
|
| 161 |
+
StellaratorAction(
|
| 162 |
+
intent="run",
|
| 163 |
+
parameter=parameter,
|
| 164 |
+
direction=direction,
|
| 165 |
+
magnitude=magnitude,
|
| 166 |
+
)
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
if not actions:
|
| 170 |
+
raise ValueError("submit-action-sequence must include at least one action.")
|
| 171 |
+
if actions[-1].intent != "submit":
|
| 172 |
+
raise ValueError("submit-action-sequence must end with submit.")
|
| 173 |
+
return actions
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
def _compare_low_snapshot(
|
| 177 |
+
stored: dict[str, Any],
|
| 178 |
+
current: dict[str, Any],
|
| 179 |
+
) -> tuple[bool, dict[str, Any]]:
|
| 180 |
+
numeric_keys = [
|
| 181 |
+
"p1_feasibility",
|
| 182 |
+
"p1_score",
|
| 183 |
+
"max_elongation",
|
| 184 |
+
"aspect_ratio",
|
| 185 |
+
"average_triangularity",
|
| 186 |
+
"edge_iota_over_nfp",
|
| 187 |
+
"vacuum_well",
|
| 188 |
+
]
|
| 189 |
+
exact_keys = [
|
| 190 |
+
"constraints_satisfied",
|
| 191 |
+
"evaluation_fidelity",
|
| 192 |
+
"evaluation_failed",
|
| 193 |
+
"failure_reason",
|
| 194 |
+
]
|
| 195 |
+
missing_fields: list[str] = []
|
| 196 |
+
drift_fields: dict[str, dict[str, float]] = {}
|
| 197 |
+
mismatches: list[dict[str, Any]] = []
|
| 198 |
+
max_abs_drift = 0.0
|
| 199 |
+
|
| 200 |
+
for key in numeric_keys:
|
| 201 |
+
if key not in stored:
|
| 202 |
+
missing_fields.append(key)
|
| 203 |
+
continue
|
| 204 |
+
|
| 205 |
+
expected = _float(stored.get(key))
|
| 206 |
+
actual = _float(current.get(key))
|
| 207 |
+
if expected is None or actual is None:
|
| 208 |
+
mismatches.append(
|
| 209 |
+
{
|
| 210 |
+
"field": key,
|
| 211 |
+
"expected": stored.get(key),
|
| 212 |
+
"actual": current.get(key),
|
| 213 |
+
"reason": "non-numeric",
|
| 214 |
+
}
|
| 215 |
+
)
|
| 216 |
+
continue
|
| 217 |
+
|
| 218 |
+
drift = abs(expected - actual)
|
| 219 |
+
max_abs_drift = max(max_abs_drift, drift)
|
| 220 |
+
if drift > LOW_FIDELITY_TOLERANCE:
|
| 221 |
+
drift_fields[key] = {
|
| 222 |
+
"expected": expected,
|
| 223 |
+
"actual": actual,
|
| 224 |
+
"abs_drift": drift,
|
| 225 |
+
}
|
| 226 |
+
mismatches.append(
|
| 227 |
+
{
|
| 228 |
+
"field": key,
|
| 229 |
+
"expected": expected,
|
| 230 |
+
"actual": actual,
|
| 231 |
+
"abs_drift": drift,
|
| 232 |
+
}
|
| 233 |
+
)
|
| 234 |
+
|
| 235 |
+
for key in exact_keys:
|
| 236 |
+
if key not in stored:
|
| 237 |
+
missing_fields.append(key)
|
| 238 |
+
continue
|
| 239 |
+
|
| 240 |
+
expected = stored.get(key)
|
| 241 |
+
actual = current.get(key)
|
| 242 |
+
if expected != actual:
|
| 243 |
+
mismatches.append(
|
| 244 |
+
{
|
| 245 |
+
"field": key,
|
| 246 |
+
"expected": expected,
|
| 247 |
+
"actual": actual,
|
| 248 |
+
"reason": "exact-mismatch",
|
| 249 |
+
}
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
return (
|
| 253 |
+
not missing_fields and not mismatches,
|
| 254 |
+
{
|
| 255 |
+
"missing_fields": missing_fields,
|
| 256 |
+
"drift_fields": drift_fields,
|
| 257 |
+
"mismatches": mismatches,
|
| 258 |
+
"max_abs_drift": max_abs_drift,
|
| 259 |
+
},
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
def _pair_fixture(path: Path) -> FixturePairResult:
|
| 264 |
+
data = _load_fixture(path)
|
| 265 |
+
params = LowDimBoundaryParams.model_validate(data["params"])
|
| 266 |
+
boundary = build_boundary_from_params(params, n_field_periods=N_FIELD_PERIODS)
|
| 267 |
+
|
| 268 |
+
low = evaluate_boundary(boundary, fidelity="low")
|
| 269 |
+
high = evaluate_boundary(boundary, fidelity="high")
|
| 270 |
+
|
| 271 |
+
low_payload = _metrics_payload(low)
|
| 272 |
+
high_payload = _metrics_payload(high)
|
| 273 |
+
low_snapshot_ok, low_snapshot = _compare_low_snapshot(
|
| 274 |
+
data.get("low_fidelity", {}),
|
| 275 |
+
low_payload,
|
| 276 |
+
)
|
| 277 |
+
feasible_match = low.constraints_satisfied == high.constraints_satisfied
|
| 278 |
+
ranking_compat = (
|
| 279 |
+
"ambiguous"
|
| 280 |
+
if low.evaluation_failed or high.evaluation_failed
|
| 281 |
+
else "match"
|
| 282 |
+
if feasible_match
|
| 283 |
+
else "mismatch"
|
| 284 |
+
)
|
| 285 |
+
|
| 286 |
+
comparison: dict[str, Any] = {
|
| 287 |
+
"low_high_feasibility_match": feasible_match,
|
| 288 |
+
"feasibility_delta": high.p1_feasibility - low.p1_feasibility,
|
| 289 |
+
"score_delta": high.p1_score - low.p1_score,
|
| 290 |
+
"ranking_compatibility": ranking_compat,
|
| 291 |
+
"low_fidelity_stored_p1_score": data.get("low_fidelity", {}).get("p1_score"),
|
| 292 |
+
"low_fidelity_stored_p1_feasibility": data.get("low_fidelity", {}).get("p1_feasibility"),
|
| 293 |
+
"low_fidelity_snapshot": low_snapshot,
|
| 294 |
+
}
|
| 295 |
+
|
| 296 |
+
status = "pass"
|
| 297 |
+
if low.evaluation_failed or high.evaluation_failed or not feasible_match or not low_snapshot_ok:
|
| 298 |
+
status = "fail"
|
| 299 |
+
if not low_snapshot_ok:
|
| 300 |
+
print(f" low-fidelity snapshot mismatch:\n{pformat(low_snapshot)}")
|
| 301 |
+
|
| 302 |
+
return FixturePairResult(
|
| 303 |
+
name=str(data.get("name", path.stem)),
|
| 304 |
+
file=str(path),
|
| 305 |
+
status=status,
|
| 306 |
+
low_fidelity=low_payload,
|
| 307 |
+
high_fidelity=high_payload,
|
| 308 |
+
comparison=comparison,
|
| 309 |
+
)
|
| 310 |
+
|
| 311 |
+
|
| 312 |
+
def _write_json(payload: dict[str, Any], path: Path) -> None:
|
| 313 |
+
path.parent.mkdir(parents=True, exist_ok=True)
|
| 314 |
+
with path.open("w") as file:
|
| 315 |
+
json.dump(payload, file, indent=2)
|
| 316 |
+
|
| 317 |
+
|
| 318 |
+
def _run_fixture_checks(
|
| 319 |
+
*,
|
| 320 |
+
fixture_dir: Path,
|
| 321 |
+
fixture_output: Path,
|
| 322 |
+
write_fixture_updates: bool,
|
| 323 |
+
) -> tuple[list[FixturePairResult], int]:
|
| 324 |
+
results: list[FixturePairResult] = []
|
| 325 |
+
fail_count = 0
|
| 326 |
+
|
| 327 |
+
for path in _fixture_files(fixture_dir):
|
| 328 |
+
print(f"Evaluating fixture: {path.name}")
|
| 329 |
+
fixture_start = perf_counter()
|
| 330 |
+
result = _pair_fixture(path)
|
| 331 |
+
if result.status != "pass":
|
| 332 |
+
fail_count += 1
|
| 333 |
+
results.append(result)
|
| 334 |
+
|
| 335 |
+
if write_fixture_updates:
|
| 336 |
+
fixture = _load_fixture(path)
|
| 337 |
+
fixture["high_fidelity"] = result.high_fidelity
|
| 338 |
+
fixture["paired_high_fidelity_timestamp_utc"] = datetime.now(tz=UTC).isoformat()
|
| 339 |
+
with path.open("w") as file:
|
| 340 |
+
json.dump(fixture, file, indent=2)
|
| 341 |
+
|
| 342 |
+
elapsed = perf_counter() - fixture_start
|
| 343 |
+
print(
|
| 344 |
+
" done in "
|
| 345 |
+
f"{elapsed:0.1f}s | low_feasible={result.low_fidelity['constraints_satisfied']} "
|
| 346 |
+
f"| high_feasible={result.high_fidelity['constraints_satisfied']} "
|
| 347 |
+
f"| status={result.status}"
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
pass_count = len(results) - fail_count
|
| 351 |
+
payload = {
|
| 352 |
+
"timestamp_utc": datetime.now(tz=UTC).isoformat(),
|
| 353 |
+
"n_field_periods": N_FIELD_PERIODS,
|
| 354 |
+
"fixture_count": len(results),
|
| 355 |
+
"pass_count": pass_count,
|
| 356 |
+
"fail_count": fail_count,
|
| 357 |
+
"results": [asdict(result) for result in results],
|
| 358 |
+
}
|
| 359 |
+
_write_json(payload, fixture_output)
|
| 360 |
+
return results, fail_count
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
def _run_submit_trace(
|
| 364 |
+
trace_output: Path,
|
| 365 |
+
*,
|
| 366 |
+
seed: int,
|
| 367 |
+
action_sequence: str,
|
| 368 |
+
) -> dict[str, Any]:
|
| 369 |
+
env = StellaratorEnvironment()
|
| 370 |
+
obs = env.reset(seed=seed)
|
| 371 |
+
initial_state = env.state
|
| 372 |
+
actions = _parse_submit_sequence(action_sequence)
|
| 373 |
+
|
| 374 |
+
trace: list[dict[str, Any]] = [
|
| 375 |
+
{
|
| 376 |
+
"step": 0,
|
| 377 |
+
"intent": "reset",
|
| 378 |
+
"action": f"reset(seed={seed})",
|
| 379 |
+
"reward": 0.0,
|
| 380 |
+
"score": obs.p1_score,
|
| 381 |
+
"feasibility": obs.p1_feasibility,
|
| 382 |
+
"feasibility_delta": None,
|
| 383 |
+
"score_delta": None,
|
| 384 |
+
"constraints_satisfied": obs.constraints_satisfied,
|
| 385 |
+
"max_elongation": obs.max_elongation,
|
| 386 |
+
"p1_feasibility": obs.p1_feasibility,
|
| 387 |
+
"budget_remaining": obs.budget_remaining,
|
| 388 |
+
"evaluation_fidelity": obs.evaluation_fidelity,
|
| 389 |
+
"done": obs.done,
|
| 390 |
+
"params": initial_state.current_params.model_dump(),
|
| 391 |
+
}
|
| 392 |
+
]
|
| 393 |
+
|
| 394 |
+
previous_feasibility = obs.p1_feasibility
|
| 395 |
+
previous_score = obs.p1_score
|
| 396 |
+
|
| 397 |
+
for idx, action in enumerate(actions, start=1):
|
| 398 |
+
obs = env.step(action)
|
| 399 |
+
trace.append(
|
| 400 |
+
asdict(
|
| 401 |
+
TraceStep(
|
| 402 |
+
step=idx,
|
| 403 |
+
intent=action.intent,
|
| 404 |
+
action=(
|
| 405 |
+
f"{action.parameter} {action.direction} {action.magnitude}"
|
| 406 |
+
if action.intent == "run"
|
| 407 |
+
else action.intent
|
| 408 |
+
),
|
| 409 |
+
reward=float(obs.reward or 0.0),
|
| 410 |
+
score=obs.p1_score,
|
| 411 |
+
feasibility=obs.p1_feasibility,
|
| 412 |
+
constraints_satisfied=obs.constraints_satisfied,
|
| 413 |
+
feasibility_delta=obs.p1_feasibility - previous_feasibility,
|
| 414 |
+
score_delta=obs.p1_score - previous_score,
|
| 415 |
+
max_elongation=obs.max_elongation,
|
| 416 |
+
p1_feasibility=obs.p1_feasibility,
|
| 417 |
+
budget_remaining=obs.budget_remaining,
|
| 418 |
+
evaluation_fidelity=obs.evaluation_fidelity,
|
| 419 |
+
done=obs.done,
|
| 420 |
+
)
|
| 421 |
+
)
|
| 422 |
+
)
|
| 423 |
+
|
| 424 |
+
previous_feasibility = obs.p1_feasibility
|
| 425 |
+
previous_score = obs.p1_score
|
| 426 |
+
if obs.done:
|
| 427 |
+
break
|
| 428 |
+
|
| 429 |
+
total_reward = sum(step["reward"] for step in trace)
|
| 430 |
+
payload = {
|
| 431 |
+
"trace_label": "submit_side_manual",
|
| 432 |
+
"trace_profile": action_sequence,
|
| 433 |
+
"timestamp_utc": datetime.now(tz=UTC).isoformat(),
|
| 434 |
+
"n_field_periods": N_FIELD_PERIODS,
|
| 435 |
+
"seed": seed,
|
| 436 |
+
"total_reward": total_reward,
|
| 437 |
+
"final_score": obs.p1_score,
|
| 438 |
+
"final_feasibility": obs.p1_feasibility,
|
| 439 |
+
"final_constraints_satisfied": obs.constraints_satisfied,
|
| 440 |
+
"final_evaluation_fidelity": obs.evaluation_fidelity,
|
| 441 |
+
"final_evaluation_failed": obs.evaluation_failed,
|
| 442 |
+
"steps": trace,
|
| 443 |
+
"final_best_low_fidelity_score": obs.best_low_fidelity_score,
|
| 444 |
+
"final_best_low_fidelity_feasibility": obs.best_low_fidelity_feasibility,
|
| 445 |
+
"final_best_high_fidelity_score": obs.best_high_fidelity_score,
|
| 446 |
+
"final_best_high_fidelity_feasibility": obs.best_high_fidelity_feasibility,
|
| 447 |
+
"final_diagnostics_text": obs.diagnostics_text,
|
| 448 |
+
}
|
| 449 |
+
_write_json(payload, trace_output)
|
| 450 |
+
return payload
|
| 451 |
+
|
| 452 |
+
|
| 453 |
+
def main() -> int:
|
| 454 |
+
args = parse_args()
|
| 455 |
+
results, fail_count = _run_fixture_checks(
|
| 456 |
+
fixture_dir=args.fixture_dir,
|
| 457 |
+
fixture_output=args.fixture_output,
|
| 458 |
+
write_fixture_updates=not args.no_write_fixture_updates,
|
| 459 |
+
)
|
| 460 |
+
|
| 461 |
+
print(
|
| 462 |
+
f"Paired fixtures: {len(results)} total, {len(results) - fail_count} pass, {fail_count} fail"
|
| 463 |
+
)
|
| 464 |
+
for result in results:
|
| 465 |
+
print(
|
| 466 |
+
f" - {result.name}: {result.status} "
|
| 467 |
+
f"(low={result.low_fidelity['constraints_satisfied']} "
|
| 468 |
+
f"high={result.high_fidelity['constraints_satisfied']})"
|
| 469 |
+
)
|
| 470 |
+
|
| 471 |
+
if not args.skip_submit_trace:
|
| 472 |
+
trace = _run_submit_trace(
|
| 473 |
+
args.trace_output,
|
| 474 |
+
seed=args.seed,
|
| 475 |
+
action_sequence=args.submit_action_sequence,
|
| 476 |
+
)
|
| 477 |
+
print(
|
| 478 |
+
f"Manual submit trace written to {args.trace_output} | "
|
| 479 |
+
f"sequence='{trace['trace_profile']}' | "
|
| 480 |
+
f"final_feasibility={trace['final_feasibility']:.6f} | "
|
| 481 |
+
f"fidelity={trace['final_evaluation_fidelity']}"
|
| 482 |
+
)
|
| 483 |
+
|
| 484 |
+
return 1 if fail_count else 0
|
| 485 |
+
|
| 486 |
+
|
| 487 |
+
if __name__ == "__main__":
|
| 488 |
+
raise SystemExit(main())
|
baselines/submit_side_trace.json
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"trace_label": "submit_side_manual",
|
| 3 |
+
"trace_profile": "run:rotational_transform:increase:medium,run:triangularity_scale:increase:medium,run:elongation:decrease:small,submit",
|
| 4 |
+
"timestamp_utc": "2026-03-08T07:07:43.478814+00:00",
|
| 5 |
+
"n_field_periods": 3,
|
| 6 |
+
"seed": 0,
|
| 7 |
+
"total_reward": 5.3296,
|
| 8 |
+
"final_score": 0.29605869964467535,
|
| 9 |
+
"final_feasibility": 0.0008652388718514148,
|
| 10 |
+
"final_constraints_satisfied": true,
|
| 11 |
+
"final_evaluation_fidelity": "high",
|
| 12 |
+
"final_evaluation_failed": false,
|
| 13 |
+
"steps": [
|
| 14 |
+
{
|
| 15 |
+
"step": 0,
|
| 16 |
+
"intent": "reset",
|
| 17 |
+
"action": "reset(seed=0)",
|
| 18 |
+
"reward": 0.0,
|
| 19 |
+
"score": 0.0,
|
| 20 |
+
"feasibility": 0.0506528382250242,
|
| 21 |
+
"feasibility_delta": null,
|
| 22 |
+
"score_delta": null,
|
| 23 |
+
"constraints_satisfied": false,
|
| 24 |
+
"max_elongation": 6.13677115978351,
|
| 25 |
+
"p1_feasibility": 0.0506528382250242,
|
| 26 |
+
"budget_remaining": 6,
|
| 27 |
+
"evaluation_fidelity": "low",
|
| 28 |
+
"done": false,
|
| 29 |
+
"params": {
|
| 30 |
+
"aspect_ratio": 3.6,
|
| 31 |
+
"elongation": 1.4,
|
| 32 |
+
"rotational_transform": 1.5,
|
| 33 |
+
"triangularity_scale": 0.55
|
| 34 |
+
}
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"step": 1,
|
| 38 |
+
"intent": "run",
|
| 39 |
+
"action": "rotational_transform increase medium",
|
| 40 |
+
"reward": -0.1,
|
| 41 |
+
"score": 0.0,
|
| 42 |
+
"feasibility": 0.05065283822502309,
|
| 43 |
+
"constraints_satisfied": false,
|
| 44 |
+
"feasibility_delta": -1.1102230246251565e-15,
|
| 45 |
+
"score_delta": 0.0,
|
| 46 |
+
"max_elongation": 6.729528139593349,
|
| 47 |
+
"p1_feasibility": 0.05065283822502309,
|
| 48 |
+
"budget_remaining": 5,
|
| 49 |
+
"evaluation_fidelity": "low",
|
| 50 |
+
"done": false
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"step": 2,
|
| 54 |
+
"intent": "run",
|
| 55 |
+
"action": "triangularity_scale increase medium",
|
| 56 |
+
"reward": 3.1533,
|
| 57 |
+
"score": 0.29165951078326,
|
| 58 |
+
"feasibility": 0.0,
|
| 59 |
+
"constraints_satisfied": true,
|
| 60 |
+
"feasibility_delta": -0.05065283822502309,
|
| 61 |
+
"score_delta": 0.29165951078326,
|
| 62 |
+
"max_elongation": 7.37506440295066,
|
| 63 |
+
"p1_feasibility": 0.0,
|
| 64 |
+
"budget_remaining": 4,
|
| 65 |
+
"evaluation_fidelity": "low",
|
| 66 |
+
"done": false
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"step": 3,
|
| 70 |
+
"intent": "run",
|
| 71 |
+
"action": "elongation decrease small",
|
| 72 |
+
"reward": 0.2665,
|
| 73 |
+
"score": 0.2957311862720885,
|
| 74 |
+
"feasibility": 0.0008652388718514148,
|
| 75 |
+
"constraints_satisfied": true,
|
| 76 |
+
"feasibility_delta": 0.0008652388718514148,
|
| 77 |
+
"score_delta": 0.0040716754888284745,
|
| 78 |
+
"max_elongation": 7.338419323551204,
|
| 79 |
+
"p1_feasibility": 0.0008652388718514148,
|
| 80 |
+
"budget_remaining": 3,
|
| 81 |
+
"evaluation_fidelity": "low",
|
| 82 |
+
"done": false
|
| 83 |
+
},
|
| 84 |
+
{
|
| 85 |
+
"step": 4,
|
| 86 |
+
"intent": "submit",
|
| 87 |
+
"action": "submit",
|
| 88 |
+
"reward": 2.0098,
|
| 89 |
+
"score": 0.29605869964467535,
|
| 90 |
+
"feasibility": 0.0008652388718514148,
|
| 91 |
+
"constraints_satisfied": true,
|
| 92 |
+
"feasibility_delta": 0.0,
|
| 93 |
+
"score_delta": 0.00032751337258685176,
|
| 94 |
+
"max_elongation": 7.335471703197922,
|
| 95 |
+
"p1_feasibility": 0.0008652388718514148,
|
| 96 |
+
"budget_remaining": 3,
|
| 97 |
+
"evaluation_fidelity": "high",
|
| 98 |
+
"done": true
|
| 99 |
+
}
|
| 100 |
+
],
|
| 101 |
+
"final_best_low_fidelity_score": 0.2957311862720885,
|
| 102 |
+
"final_best_low_fidelity_feasibility": 0.0008652388718514148,
|
| 103 |
+
"final_best_high_fidelity_score": 0.29605869964467535,
|
| 104 |
+
"final_best_high_fidelity_feasibility": 0.0008652388718514148,
|
| 105 |
+
"final_diagnostics_text": "Submitted current_high_fidelity_score=0.296059, best_high_fidelity_score=0.296059, best_high_fidelity_feasibility=0.000865, constraints=SATISFIED.\n\nevaluation_fidelity=high\nevaluation_status=OK\nmax_elongation=7.3355\naspect_ratio=3.2897 (<= 4.0)\naverage_triangularity=-0.4996 (<= -0.5)\nedge_iota_over_nfp=0.3030 (>= 0.3)\nfeasibility=0.000865\nbest_low_fidelity_score=0.295731\nbest_low_fidelity_feasibility=0.000865\nbest_high_fidelity_score=0.296059\nbest_high_fidelity_feasibility=0.000865\nvacuum_well=-0.8079\nconstraints=SATISFIED\nstep=4 | budget=3/6"
|
| 106 |
+
}
|
docs/FUSION_DESIGN_LAB_PLAN_V2.md
CHANGED
|
@@ -35,12 +35,13 @@ Completed:
|
|
| 35 |
- a coarse measured sweep note now exists
|
| 36 |
- the first tracked low-fidelity fixtures now exist
|
| 37 |
- an initial low-fidelity manual playtest note now exists
|
|
|
|
|
|
|
| 38 |
|
| 39 |
Still open:
|
| 40 |
|
| 41 |
- tiny low-fidelity PPO smoke evidence
|
| 42 |
-
-
|
| 43 |
-
- submit-side manual playtest evidence
|
| 44 |
- heuristic baseline refresh on the repaired real-verifier path
|
| 45 |
- HF Space deployment evidence
|
| 46 |
- Colab artifact wiring
|
|
@@ -114,9 +115,9 @@ Compute surfaces:
|
|
| 114 |
Evidence order:
|
| 115 |
|
| 116 |
- [x] measured sweep note
|
| 117 |
-
- [
|
| 118 |
- [x] manual playtest log
|
| 119 |
-
- [
|
| 120 |
- [ ] reward iteration note
|
| 121 |
- [ ] stable local and remote episodes
|
| 122 |
- [x] random and heuristic baselines
|
|
@@ -138,10 +139,10 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
|
|
| 138 |
|
| 139 |
## 8. Execution Order
|
| 140 |
|
| 141 |
-
- [
|
| 142 |
-
- [
|
| 143 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 144 |
-
- [
|
| 145 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 146 |
- [ ] Refresh the heuristic baseline using the repaired-family evidence.
|
| 147 |
- [ ] Prove a stable local episode path.
|
|
@@ -161,6 +162,7 @@ Gate 2: tiny PPO smoke is sane
|
|
| 161 |
- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
|
| 162 |
- trajectories are readable enough to debug
|
| 163 |
- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
|
|
|
|
| 164 |
|
| 165 |
Gate 3: fixture checks pass
|
| 166 |
|
|
@@ -221,8 +223,8 @@ If the repaired family is too easy:
|
|
| 221 |
- [x] Record the measured sweep and choose provisional defaults from evidence.
|
| 222 |
- [x] Check in tracked fixtures.
|
| 223 |
- [x] Record the first manual playtest log.
|
| 224 |
-
- [
|
| 225 |
-
- [
|
| 226 |
-
- [
|
| 227 |
- [ ] Refresh the heuristic baseline from that playtest evidence.
|
| 228 |
- [ ] Verify one clean HF Space episode with the same contract.
|
|
|
|
| 35 |
- a coarse measured sweep note now exists
|
| 36 |
- the first tracked low-fidelity fixtures now exist
|
| 37 |
- an initial low-fidelity manual playtest note now exists
|
| 38 |
+
- paired high-fidelity fixture checks for those tracked fixtures now exist
|
| 39 |
+
- one submit-side manual playtest trace exists
|
| 40 |
|
| 41 |
Still open:
|
| 42 |
|
| 43 |
- tiny low-fidelity PPO smoke evidence
|
| 44 |
+
- decision on whether reset-seed pool should change from paired checks
|
|
|
|
| 45 |
- heuristic baseline refresh on the repaired real-verifier path
|
| 46 |
- HF Space deployment evidence
|
| 47 |
- Colab artifact wiring
|
|
|
|
| 115 |
Evidence order:
|
| 116 |
|
| 117 |
- [x] measured sweep note
|
| 118 |
+
- [x] fixture checks
|
| 119 |
- [x] manual playtest log
|
| 120 |
+
- [x] tiny low-fi PPO smoke trace
|
| 121 |
- [ ] reward iteration note
|
| 122 |
- [ ] stable local and remote episodes
|
| 123 |
- [x] random and heuristic baselines
|
|
|
|
| 139 |
|
| 140 |
## 8. Execution Order
|
| 141 |
|
| 142 |
+
- [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
|
| 143 |
+
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
|
| 144 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 145 |
+
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
|
| 146 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 147 |
- [ ] Refresh the heuristic baseline using the repaired-family evidence.
|
| 148 |
- [ ] Prove a stable local episode path.
|
|
|
|
| 162 |
- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
|
| 163 |
- trajectories are readable enough to debug
|
| 164 |
- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
|
| 165 |
+
- current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)
|
| 166 |
|
| 167 |
Gate 3: fixture checks pass
|
| 168 |
|
|
|
|
| 223 |
- [x] Record the measured sweep and choose provisional defaults from evidence.
|
| 224 |
- [x] Check in tracked fixtures.
|
| 225 |
- [x] Record the first manual playtest log.
|
| 226 |
+
- [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
|
| 227 |
+
- [x] Pair the tracked fixtures with high-fidelity submit checks.
|
| 228 |
+
- [x] Record one submit-side manual trace.
|
| 229 |
- [ ] Refresh the heuristic baseline from that playtest evidence.
|
| 230 |
- [ ] Verify one clean HF Space episode with the same contract.
|
docs/P1_MANUAL_PLAYTEST_LOG.md
CHANGED
|
@@ -50,4 +50,33 @@ Step 1:
|
|
| 50 |
Current conclusion:
|
| 51 |
|
| 52 |
- Reward V0 is legible on the low-fidelity repair path around the default reset seed
|
| 53 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
Current conclusion:
|
| 51 |
|
| 52 |
- Reward V0 is legible on the low-fidelity repair path around the default reset seed
|
| 53 |
+
- a real `submit` trace is now recorded; next manual validation is to extend beyond the initial 5-10 episode path and look for one clear exploit or ambiguity
|
| 54 |
+
|
| 55 |
+
Episode C: submit-side manual trace
|
| 56 |
+
|
| 57 |
+
Scope:
|
| 58 |
+
|
| 59 |
+
- same seed-0 start state as episode A
|
| 60 |
+
- actions: `rotational_transform increase medium`, `triangularity_scale increase medium`, `elongation decrease small`, `submit`
|
| 61 |
+
|
| 62 |
+
Step sequence:
|
| 63 |
+
|
| 64 |
+
- Step 1: `rotational_transform increase medium`
|
| 65 |
+
- low-fidelity feasibility changed by `0.000000` (still infeasible)
|
| 66 |
+
- reward: `-0.1000`
|
| 67 |
+
- Step 2: `triangularity_scale increase medium`
|
| 68 |
+
- crossed feasibility boundary
|
| 69 |
+
- low-fidelity feasibility moved from `0.050653` to `0.000000`
|
| 70 |
+
- reward: `+3.1533`
|
| 71 |
+
- Step 3: `elongation decrease small`
|
| 72 |
+
- low-fidelity feasibility moved to `0.000865`
|
| 73 |
+
- reward: `+0.2665`
|
| 74 |
+
- Step 4: `submit` (high-fidelity)
|
| 75 |
+
- final feasibility: `0.000865`
|
| 76 |
+
- final high-fidelity score: `0.296059`
|
| 77 |
+
- final reward: `+2.0098`
|
| 78 |
+
- final diagnostics `evaluation_fidelity`=`high`, `constraints`=`SATISFIED`, `best_high_fidelity_score`=`0.296059`
|
| 79 |
+
|
| 80 |
+
Artifact:
|
| 81 |
+
|
| 82 |
+
- [manual submit trace JSON](../baselines/submit_side_trace.json)
|
server/data/p1/FIXTURE_SANITY.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# P1 Fixture Sanity
|
| 2 |
|
| 3 |
-
This folder now contains three low-fidelity-
|
| 4 |
|
| 5 |
- `boundary_default_reset.json`
|
| 6 |
- `bad_low_iota.json`
|
|
@@ -23,8 +23,24 @@ Current interpretation:
|
|
| 23 |
- low-fidelity feasible local target
|
| 24 |
- reachable from the default reset band with two intuitive knob increases
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
-
|
| 29 |
-
- low-fi vs high-fi ranking comparison note
|
| 30 |
- decision on whether any reset seed should be changed from the current default
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# P1 Fixture Sanity
|
| 2 |
|
| 3 |
+
This folder now contains three paired low-fidelity/high-fidelity `P1` fixtures:
|
| 4 |
|
| 5 |
- `boundary_default_reset.json`
|
| 6 |
- `bad_low_iota.json`
|
|
|
|
| 23 |
- low-fidelity feasible local target
|
| 24 |
- reachable from the default reset band with two intuitive knob increases
|
| 25 |
|
| 26 |
+
Observed from paired run:
|
| 27 |
|
| 28 |
+
- low-fi vs high-fi feasibility alignment and metric deltas are documented in `baselines/fixture_high_fidelity_pairs.json`.
|
|
|
|
| 29 |
- decision on whether any reset seed should be changed from the current default
|
| 30 |
+
|
| 31 |
+
Current paired summary (`baselines/fixture_high_fidelity_pairs.json`):
|
| 32 |
+
|
| 33 |
+
- `bad_low_iota.json`:
|
| 34 |
+
- both fidelities infeasible
|
| 35 |
+
- low/high feasibility match: `true`
|
| 36 |
+
- low/high score match: both `0.0`
|
| 37 |
+
|
| 38 |
+
- `boundary_default_reset.json`:
|
| 39 |
+
- both fidelities infeasible
|
| 40 |
+
- low/high feasibility match: `true`
|
| 41 |
+
- low/high score match: both `0.0`
|
| 42 |
+
|
| 43 |
+
- `lowfi_feasible_local.json`:
|
| 44 |
+
- both fidelities feasible
|
| 45 |
+
- low/high feasibility match: `true`
|
| 46 |
+
- high-fidelity score improved slightly: `0.29165951078327634` → `0.2920325118884466`
|
server/data/p1/README.md
CHANGED
|
@@ -12,7 +12,7 @@ These fixtures are for verifier and reward sanity checks.
|
|
| 12 |
|
| 13 |
## Status
|
| 14 |
|
| 15 |
-
- [
|
| 16 |
- [x] near-boundary fixture added
|
| 17 |
- [x] clearly infeasible fixture added
|
| 18 |
- [x] fixture sanity note written
|
|
|
|
| 12 |
|
| 13 |
## Status
|
| 14 |
|
| 15 |
+
- [x] known-good or near-winning fixture added
|
| 16 |
- [x] near-boundary fixture added
|
| 17 |
- [x] clearly infeasible fixture added
|
| 18 |
- [x] fixture sanity note written
|
server/data/p1/bad_low_iota.json
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
"notes": [
|
| 5 |
"Clearly infeasible calibration case from the coarse measured sweep.",
|
| 6 |
"The dominant failure mode is low edge_iota_over_nfp, not triangularity.",
|
| 7 |
-
"High-fidelity submit spot check is
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.2,
|
|
@@ -21,7 +21,22 @@
|
|
| 21 |
"aspect_ratio": 2.802311169335037,
|
| 22 |
"average_triangularity": -0.5512332332730122,
|
| 23 |
"edge_iota_over_nfp": 0.12745962182164347,
|
| 24 |
-
"vacuum_well": -1.0099648537211192
|
|
|
|
|
|
|
| 25 |
},
|
| 26 |
-
"high_fidelity":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
}
|
|
|
|
| 4 |
"notes": [
|
| 5 |
"Clearly infeasible calibration case from the coarse measured sweep.",
|
| 6 |
"The dominant failure mode is low edge_iota_over_nfp, not triangularity.",
|
| 7 |
+
"High-fidelity submit spot check is complete."
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.2,
|
|
|
|
| 21 |
"aspect_ratio": 2.802311169335037,
|
| 22 |
"average_triangularity": -0.5512332332730122,
|
| 23 |
"edge_iota_over_nfp": 0.12745962182164347,
|
| 24 |
+
"vacuum_well": -1.0099648537211192,
|
| 25 |
+
"evaluation_fidelity": "low",
|
| 26 |
+
"failure_reason": ""
|
| 27 |
},
|
| 28 |
+
"high_fidelity": {
|
| 29 |
+
"evaluation_failed": false,
|
| 30 |
+
"constraints_satisfied": false,
|
| 31 |
+
"p1_score": 0.0,
|
| 32 |
+
"p1_feasibility": 0.5763570514697449,
|
| 33 |
+
"max_elongation": 5.9831792818066525,
|
| 34 |
+
"aspect_ratio": 2.802311169335037,
|
| 35 |
+
"average_triangularity": -0.5512332332730122,
|
| 36 |
+
"edge_iota_over_nfp": 0.12709288455907652,
|
| 37 |
+
"vacuum_well": -1.0111716777365585,
|
| 38 |
+
"evaluation_fidelity": "high",
|
| 39 |
+
"failure_reason": ""
|
| 40 |
+
},
|
| 41 |
+
"paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:19.629771+00:00"
|
| 42 |
}
|
server/data/p1/boundary_default_reset.json
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
"notes": [
|
| 5 |
"Matches the current default reset seed.",
|
| 6 |
"Useful as a near-boundary starting point for short repair episodes.",
|
| 7 |
-
"High-fidelity submit spot check is
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.6,
|
|
@@ -21,7 +21,22 @@
|
|
| 21 |
"aspect_ratio": 3.31313049868072,
|
| 22 |
"average_triangularity": -0.4746735808874879,
|
| 23 |
"edge_iota_over_nfp": 0.2906263991807532,
|
| 24 |
-
"vacuum_well": -0.7537878932672235
|
|
|
|
|
|
|
| 25 |
},
|
| 26 |
-
"high_fidelity":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
}
|
|
|
|
| 4 |
"notes": [
|
| 5 |
"Matches the current default reset seed.",
|
| 6 |
"Useful as a near-boundary starting point for short repair episodes.",
|
| 7 |
+
"High-fidelity submit spot check is complete."
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.6,
|
|
|
|
| 21 |
"aspect_ratio": 3.31313049868072,
|
| 22 |
"average_triangularity": -0.4746735808874879,
|
| 23 |
"edge_iota_over_nfp": 0.2906263991807532,
|
| 24 |
+
"vacuum_well": -0.7537878932672235,
|
| 25 |
+
"evaluation_fidelity": "low",
|
| 26 |
+
"failure_reason": ""
|
| 27 |
},
|
| 28 |
+
"high_fidelity": {
|
| 29 |
+
"evaluation_failed": false,
|
| 30 |
+
"constraints_satisfied": false,
|
| 31 |
+
"p1_score": 0.0,
|
| 32 |
+
"p1_feasibility": 0.0506528382250242,
|
| 33 |
+
"max_elongation": 6.134177903677296,
|
| 34 |
+
"aspect_ratio": 3.31313049868072,
|
| 35 |
+
"average_triangularity": -0.4746735808874879,
|
| 36 |
+
"edge_iota_over_nfp": 0.28971623977263294,
|
| 37 |
+
"vacuum_well": -0.7554909069955263,
|
| 38 |
+
"evaluation_fidelity": "high",
|
| 39 |
+
"failure_reason": ""
|
| 40 |
+
},
|
| 41 |
+
"paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:24.745385+00:00"
|
| 42 |
}
|
server/data/p1/lowfi_feasible_local.json
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
"notes": [
|
| 5 |
"Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
|
| 6 |
"Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
|
| 7 |
-
"High-fidelity submit spot check is
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.6,
|
|
@@ -21,7 +21,22 @@
|
|
| 21 |
"aspect_ratio": 3.2870514531062405,
|
| 22 |
"average_triangularity": -0.5002923204919443,
|
| 23 |
"edge_iota_over_nfp": 0.30046082924426193,
|
| 24 |
-
"vacuum_well": -0.7949586699110935
|
|
|
|
|
|
|
| 25 |
},
|
| 26 |
-
"high_fidelity":
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
}
|
|
|
|
| 4 |
"notes": [
|
| 5 |
"Local repair target reached from the default reset band by increasing rotational_transform and triangularity_scale.",
|
| 6 |
"Useful as a low-fidelity feasibility reference for Reward V0 sanity checks.",
|
| 7 |
+
"High-fidelity submit spot check is complete."
|
| 8 |
],
|
| 9 |
"params": {
|
| 10 |
"aspect_ratio": 3.6,
|
|
|
|
| 21 |
"aspect_ratio": 3.2870514531062405,
|
| 22 |
"average_triangularity": -0.5002923204919443,
|
| 23 |
"edge_iota_over_nfp": 0.30046082924426193,
|
| 24 |
+
"vacuum_well": -0.7949586699110935,
|
| 25 |
+
"evaluation_fidelity": "low",
|
| 26 |
+
"failure_reason": ""
|
| 27 |
},
|
| 28 |
+
"high_fidelity": {
|
| 29 |
+
"evaluation_failed": false,
|
| 30 |
+
"constraints_satisfied": true,
|
| 31 |
+
"p1_score": 0.2920325118884466,
|
| 32 |
+
"p1_feasibility": 0.0,
|
| 33 |
+
"max_elongation": 7.37170739300398,
|
| 34 |
+
"aspect_ratio": 3.2870514531062405,
|
| 35 |
+
"average_triangularity": -0.5002923204919443,
|
| 36 |
+
"edge_iota_over_nfp": 0.300057398146058,
|
| 37 |
+
"vacuum_well": -0.7963320784471227,
|
| 38 |
+
"evaluation_fidelity": "high",
|
| 39 |
+
"failure_reason": ""
|
| 40 |
+
},
|
| 41 |
+
"paired_high_fidelity_timestamp_utc": "2026-03-08T07:07:29.939083+00:00"
|
| 42 |
}
|