Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

f238af4

1 Parent(s): 3bfd80a

feat: refresh heuristic baseline and sync docs

Browse files

Files changed (6) hide show

README.md +5 -5
TODO.md +3 -1
baselines/README.md +25 -2
baselines/heuristic_agent.py +33 -8
docs/findings/FUSION_DESIGN_LAB_PLAN_V2.md +2 -4
docs/findings/P1_REPLAY_PLAYTEST_REPORT.md +31 -18

README.md CHANGED Viewed

@@ -55,7 +55,7 @@ Implementation status:
 - [x] Add tracked `P1` fixtures under `server/data/p1/`
 - [x] Run a tiny low-fi PPO smoke run as a diagnostic-only check and save one trajectory artifact
 - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
-- [ ] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
 ## Known Gaps
@@ -69,14 +69,14 @@ Implementation status:
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
-- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
-- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is now baseline refresh and reset-seed confirmation backed by the paired high-fidelity evidence.
 - The first tiny PPO smoke note is in [docs/P1_PPO_SMOKE_NOTE.md](docs/P1_PPO_SMOKE_NOTE.md). It produced a valid trajectory artifact and exposed a repeated-action local failure, which is the right outcome for a smoke run.
 Current mode:
 - strategic task choice is already locked
-- the next work is heuristic refresh, reset-seed confirmation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
@@ -135,7 +135,7 @@ uv sync --extra notebooks
 - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
 - [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
-- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.
 - [ ] Add the Colab notebook under `training/notebooks`.

 - [x] Add tracked `P1` fixtures under `server/data/p1/`
 - [x] Run a tiny low-fi PPO smoke run as a diagnostic-only check and save one trajectory artifact
 - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
+- [x] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
 ## Known Gaps
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
+- The refreshed real-verifier heuristic now follows the measured feasible sequence instead of the older threshold-only policy: on a fresh `uv run python baselines/compare.py 5` rerun, it finished with `5/5` feasible high-fidelity finals, mean final `P1` score `0.291951`, and `5/5` wins over random.
+- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is now reset-seed confirmation and one presentation-ready comparison trace backed by the paired high-fidelity evidence.
 - The first tiny PPO smoke note is in [docs/P1_PPO_SMOKE_NOTE.md](docs/P1_PPO_SMOKE_NOTE.md). It produced a valid trajectory artifact and exposed a repeated-action local failure, which is the right outcome for a smoke run.
 Current mode:
 - strategic task choice is already locked
+- the next work is reset-seed confirmation, trace export, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
 - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
 - [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
+- [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.
 - [ ] Add the Colab notebook under `training/notebooks`.

TODO.md CHANGED Viewed

@@ -46,7 +46,9 @@ Priority source:
 - [x] tiny low-fi PPO smoke run exists
   Note:
   `training/ppo_smoke.py` now runs a diagnostic-only low-fidelity PPO smoke pass and the first artifact is summarized in `docs/P1_PPO_SMOKE_NOTE.md`
-- [ ] refresh the heuristic baseline for the real verifier path
 ## Execution Graph

 - [x] tiny low-fi PPO smoke run exists
   Note:
   `training/ppo_smoke.py` now runs a diagnostic-only low-fidelity PPO smoke pass and the first artifact is summarized in `docs/P1_PPO_SMOKE_NOTE.md`
+- [x] refresh the heuristic baseline for the real verifier path
+  Note:
+  the refreshed heuristic now uses the measured `rotational_transform -> triangularity_scale -> elongation -> submit` path; a fresh `uv run python baselines/compare.py 5` rerun finished at `5/5` feasible high-fidelity finals and `5/5` wins over random
 ## Execution Graph

baselines/README.md CHANGED Viewed

@@ -8,11 +8,34 @@ Random and heuristic baselines will live here.
 - [x] baseline comparison rerun completed on the real verifier path
 - [x] verified that the current 3-knob family is blocked on P1 triangularity under the real verifier path
 - [x] repair the low-dimensional parameterization before further heuristic work
-- [ ] wait for measured repaired-family ranges and reset seeds before retuning the heuristic
-- [ ] heuristic refreshed after the real-verifier rerun
 - [ ] near-boundary fixture-backed baseline start chosen for manual playtesting
 - [ ] presentation-ready comparison trace exported
 The first baseline milestone is:
 - one random agent

 - [x] baseline comparison rerun completed on the real verifier path
 - [x] verified that the current 3-knob family is blocked on P1 triangularity under the real verifier path
 - [x] repair the low-dimensional parameterization before further heuristic work
+- [x] use the measured repaired-family evidence and current frozen seed set before retuning the heuristic
+- [x] heuristic refreshed after the real-verifier rerun
 - [ ] near-boundary fixture-backed baseline start chosen for manual playtesting
 - [ ] presentation-ready comparison trace exported
+## Current Heuristic
+The refreshed heuristic now follows the measured repaired-family transition pattern:
+- if a low-fidelity evaluation fails, `restore_best`
+- if a reset starts with low `edge_iota_over_nfp`, push `rotational_transform increase medium` first
+- once `average_triangularity` is close enough, push `triangularity_scale increase medium`
+- once feasible, take at most a small amount of `elongation decrease small`
+- submit as soon as the design is feasible and the elongation is in the safe band
+This keeps the baseline on the real verifier path instead of relying on the older threshold-only policy that over-pushed triangularity and missed the feasible sequence.
+## Latest Rerun
+`uv run python baselines/compare.py 5`
+- random mean reward: `-2.2438`
+- heuristic mean reward: `+5.2825`
+- random mean final `P1` score: `0.000000`
+- heuristic mean final `P1` score: `0.291951`
+- feasible high-fidelity finals: `0/5` random vs `5/5` heuristic
+- heuristic wins: `5/5`
 The first baseline milestone is:
 - one random agent

baselines/heuristic_agent.py CHANGED Viewed

@@ -7,6 +7,12 @@ import sys
 from fusion_lab.models import StellaratorAction, StellaratorObservation
 from server.environment import StellaratorEnvironment
 def heuristic_episode(
     env: StellaratorEnvironment, seed: int | None = None
@@ -19,6 +25,10 @@ def heuristic_episode(
             "score": obs.p1_score,
             "evaluation_fidelity": obs.evaluation_fidelity,
             "constraints_satisfied": obs.constraints_satisfied,
         }
     ]
@@ -35,6 +45,10 @@ def heuristic_episode(
                 "score": obs.p1_score,
                 "evaluation_fidelity": obs.evaluation_fidelity,
                 "constraints_satisfied": obs.constraints_satisfied,
                 "reward": obs.reward,
                 "failure": obs.evaluation_failed,
             }
@@ -44,8 +58,15 @@ def heuristic_episode(
 def _choose_action(obs: StellaratorObservation) -> StellaratorAction:
     if obs.constraints_satisfied:
-        if obs.budget_remaining <= 2:
             return StellaratorAction(intent="submit")
         return StellaratorAction(
             intent="run",
@@ -54,18 +75,22 @@ def _choose_action(obs: StellaratorObservation) -> StellaratorAction:
             magnitude="small",
         )
-    if obs.evaluation_failed:
-        return StellaratorAction(intent="restore_best")
-    if obs.average_triangularity > -0.5:
         return StellaratorAction(
             intent="run",
             parameter="triangularity_scale",
             direction="increase",
-            magnitude="small",
         )
-    if obs.edge_iota_over_nfp < 0.3:
         return StellaratorAction(
             intent="run",
             parameter="rotational_transform",
@@ -73,7 +98,7 @@ def _choose_action(obs: StellaratorObservation) -> StellaratorAction:
             magnitude="small",
         )
-    if obs.aspect_ratio > 4.0:
         return StellaratorAction(
             intent="run",
             parameter="aspect_ratio",

 from fusion_lab.models import StellaratorAction, StellaratorObservation
 from server.environment import StellaratorEnvironment
+FEASIBLE_SUBMIT_ELONGATION_MAX = 7.45
+TRIANGULARITY_TARGET_MAX = -0.5
+LOW_IOTA_RESET_THRESHOLD = 0.305
+IOTA_RECOVERY_THRESHOLD = 0.3
+ASPECT_RATIO_TARGET_MAX = 4.0
 def heuristic_episode(
     env: StellaratorEnvironment, seed: int | None = None
             "score": obs.p1_score,
             "evaluation_fidelity": obs.evaluation_fidelity,
             "constraints_satisfied": obs.constraints_satisfied,
+            "feasibility": obs.p1_feasibility,
+            "max_elongation": obs.max_elongation,
+            "average_triangularity": obs.average_triangularity,
+            "edge_iota_over_nfp": obs.edge_iota_over_nfp,
         }
     ]
                 "score": obs.p1_score,
                 "evaluation_fidelity": obs.evaluation_fidelity,
                 "constraints_satisfied": obs.constraints_satisfied,
+                "feasibility": obs.p1_feasibility,
+                "max_elongation": obs.max_elongation,
+                "average_triangularity": obs.average_triangularity,
+                "edge_iota_over_nfp": obs.edge_iota_over_nfp,
                 "reward": obs.reward,
                 "failure": obs.evaluation_failed,
             }
 def _choose_action(obs: StellaratorObservation) -> StellaratorAction:
+    if obs.evaluation_failed:
+        return StellaratorAction(intent="restore_best")
     if obs.constraints_satisfied:
+        if (
+            obs.max_elongation <= FEASIBLE_SUBMIT_ELONGATION_MAX
+            or obs.budget_remaining <= 2
+            or obs.step_number >= 3
+        ):
             return StellaratorAction(intent="submit")
         return StellaratorAction(
             intent="run",
             magnitude="small",
         )
+    if obs.average_triangularity > TRIANGULARITY_TARGET_MAX:
+        if obs.step_number == 0 and obs.edge_iota_over_nfp < LOW_IOTA_RESET_THRESHOLD:
+            return StellaratorAction(
+                intent="run",
+                parameter="rotational_transform",
+                direction="increase",
+                magnitude="medium",
+            )
         return StellaratorAction(
             intent="run",
             parameter="triangularity_scale",
             direction="increase",
+            magnitude="medium",
         )
+    if obs.edge_iota_over_nfp < IOTA_RECOVERY_THRESHOLD:
         return StellaratorAction(
             intent="run",
             parameter="rotational_transform",
             magnitude="small",
         )
+    if obs.aspect_ratio > ASPECT_RATIO_TARGET_MAX:
         return StellaratorAction(
             intent="run",
             parameter="aspect_ratio",

docs/findings/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -40,9 +40,7 @@ Completed:
 Still open:
-- tiny low-fidelity PPO smoke evidence
 - decision on whether reset-seed pool should change from paired checks
-- heuristic baseline refresh on the repaired real-verifier path
 - HF Space deployment evidence
 - Colab artifact wiring
 - demo and README polish after the artifacts are real
@@ -144,7 +142,7 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
-- [ ] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
 - [ ] Wire the Colab artifact to the live environment.
@@ -226,5 +224,5 @@ If the repaired family is too easy:
 - [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
 - [x] Pair the tracked fixtures with high-fidelity submit checks.
 - [x] Record one submit-side manual trace.
-- [ ] Refresh the heuristic baseline from that playtest evidence.
 - [ ] Verify one clean HF Space episode with the same contract.

 Still open:
 - decision on whether reset-seed pool should change from paired checks
 - HF Space deployment evidence
 - Colab artifact wiring
 - demo and README polish after the artifacts are real
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
+- [x] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
 - [ ] Wire the Colab artifact to the live environment.
 - [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
 - [x] Pair the tracked fixtures with high-fidelity submit checks.
 - [x] Record one submit-side manual trace.
+- [x] Refresh the heuristic baseline from that playtest evidence.
 - [ ] Verify one clean HF Space episode with the same contract.

docs/findings/P1_REPLAY_PLAYTEST_REPORT.md CHANGED Viewed

@@ -2,6 +2,16 @@
 Date: 2026-03-07
 ## Purpose
 Expand reward branch coverage beyond the initial manual playtest (Episodes A-B in
@@ -148,11 +158,13 @@ Branches exercised:
 - **submit high-fidelity evaluation** (step 4)
 - **submit failure penalty** (-3.0, step 4: VMEC crash at high fidelity)
-Finding: **cross-fidelity gap confirmed**. The state at
 `(ar=3.6, elong=1.35, rt=1.6, tri=0.60)` passes low-fidelity evaluation
 (step 3: score=0.296, constraints satisfied) but **crashes at high-fidelity
-evaluation** (step 4: VMEC failure). The low-fi repair story does not survive
-the real final check for this particular path.
 ## Reward branch coverage summary
@@ -168,26 +180,27 @@ the real final check for this particular path.
 | Budget exhaustion done-penalty | `environment.py:264-265` | not tested | Ep 3 step 6 |
 | Recovery bonus (+1.0) | `environment.py:248-249` | not tested | Ep 1 step 6, Ep 4 step 4 |
 | Budget exhaustion done-bonus | `environment.py:258-263` | not tested | Ep 1 step 6, Ep 2 step 6, Ep 4 step 6 |
-| Submit improvement bonus | `environment.py:260-261` | not tested | not triggered (submit crashed) |
 | Clamping (no physics change) | `environment.py:412-414` | not tested | Ep 3 step 1 |
 | restore_best | `environment.py:175-195` | not tested | Ep 4 step 4 |
-Coverage: 12 of 13 branches exercised. The only untested branch is the
-**submit improvement bonus** (submit from a state that is feasible at high
-fidelity). This requires finding a repair path that survives high-fi first.
 ## Critical findings
-### 1. Cross-fidelity gap is real (Episode 5)
 The canonical repair path from seed 0 (increase rt medium, increase tri medium,
-decrease elong small) produces a low-fi feasible state that crashes at high
-fidelity. This confirms the concern documented in `P1_MANUAL_PLAYTEST_LOG.md`
-line 53 and `FUSION_DESIGN_LAB_PLAN_V2.md` open items.
-Implication: no currently tested repair path from seed 0 has a known-good
-high-fidelity submit. The submit improvement bonus branch cannot be exercised
-until a cross-fidelity-safe path is found.
 ### 2. Elongation crash pocket (Episode 1)
@@ -224,13 +237,13 @@ monotonically improve the design.
 | Feasible-side shaping | not tested | confirmed legible |
 | VMEC crash handling | not tested | confirmed legible |
 | restore_best | not tested | confirmed working |
-| Submit tested | no | yes (crashed at high-fi) |
-| Cross-fidelity evidence | none | gap confirmed |
 ## Open items
-1. **Find a high-fi-safe repair path** to exercise the submit improvement bonus
-   (the last uncovered branch) and provide positive submit-side evidence.
 2. **Map the elongation crash pocket** with a targeted sweep over the elongation
    dimension at feasible parameter combinations.
 3. **Update the measured sweep report** to document the elongation crash zone.

 Date: 2026-03-07
+Update: 2026-03-08
+This report is still useful for reward-branch coverage and low-fidelity failure
+pathologies, but its Episode 5 submit result is now historical only. The newer
+manual submit trace in `../P1_MANUAL_PLAYTEST_LOG.md` records the same
+`rotational_transform increase medium -> triangularity_scale increase medium ->
+elongation decrease small -> submit` path succeeding at high fidelity with
+score `0.296059`. Do not use this replay report as the current source of truth
+for submit viability.
 ## Purpose
 Expand reward branch coverage beyond the initial manual playtest (Episodes A-B in
 - **submit high-fidelity evaluation** (step 4)
 - **submit failure penalty** (-3.0, step 4: VMEC crash at high fidelity)
+Historical finding: the state at
 `(ar=3.6, elong=1.35, rt=1.6, tri=0.60)` passes low-fidelity evaluation
 (step 3: score=0.296, constraints satisfied) but **crashes at high-fidelity
+evaluation** in this replay run (step 4: VMEC failure). A newer manual submit
+trace now records the same action sequence succeeding at high fidelity, so this
+episode should be treated as a historical discrepancy rather than live evidence
+of a persistent cross-fidelity gap.
 ## Reward branch coverage summary
 | Budget exhaustion done-penalty | `environment.py:264-265` | not tested | Ep 3 step 6 |
 | Recovery bonus (+1.0) | `environment.py:248-249` | not tested | Ep 1 step 6, Ep 4 step 4 |
 | Budget exhaustion done-bonus | `environment.py:258-263` | not tested | Ep 1 step 6, Ep 2 step 6, Ep 4 step 6 |
+| Submit improvement bonus | `environment.py:260-261` | not tested | historical replay did not trigger it |
 | Clamping (no physics change) | `environment.py:412-414` | not tested | Ep 3 step 1 |
 | restore_best | `environment.py:175-195` | not tested | Ep 4 step 4 |
+Coverage: 12 of 13 branches exercised in this replay. The only untested branch
+here is the **submit improvement bonus**. A newer manual submit trace now
+provides positive high-fidelity submit evidence, but that branch was not
+exercised in this historical replay artifact.
 ## Critical findings
+### 1. Historical submit discrepancy (Episode 5)
 The canonical repair path from seed 0 (increase rt medium, increase tri medium,
+decrease elong small) produced a low-fi feasible state that crashed at high
+fidelity in this replay run.
+Update: this is no longer the live repo conclusion. The newer manual submit
+trace in `../P1_MANUAL_PLAYTEST_LOG.md` records the same path succeeding at
+high fidelity. Treat Episode 5 as evidence that submit behavior needed repeated
+checking, not as proof that seed 0 lacks a viable submit path.
 ### 2. Elongation crash pocket (Episode 1)
 | Feasible-side shaping | not tested | confirmed legible |
 | VMEC crash handling | not tested | confirmed legible |
 | restore_best | not tested | confirmed working |
+| Submit tested | no | yes (historical replay crash) |
+| Cross-fidelity evidence | none | mixed; superseded by newer successful manual submit trace |
 ## Open items
+1. **Export the newer high-fidelity-safe submit trace** alongside this replay so
+   the historical Episode 5 crash is not read as the live repo conclusion.
 2. **Map the elongation crash pocket** with a targeted sweep over the elongation
    dimension at feasible parameter combinations.
 3. **Update the measured sweep report** to document the elongation crash zone.