Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

513a2e2

1 Parent(s): 8bf0155

docs: codify multifidelity training policy

Browse files

Files changed (7) hide show

README.md +2 -0
TODO.md +3 -1
docs/FUSION_DESIGN_LAB_PLAN_V2.md +3 -0
docs/P1_ENV_CONTRACT_V1.md +6 -0
docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md +4 -3
training/README.md +6 -0
training/notebooks/README.md +2 -0

README.md CHANGED Viewed

@@ -62,6 +62,7 @@ Implementation status:
 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The tracked fixtures in `server/data/p1/` are currently low-fidelity-calibrated. Do not narrate them as fully paired low-fi/high-fi references until the submit-side spot checks land.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
@@ -130,6 +131,7 @@ uv sync --extra notebooks
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
 - [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The tracked fixtures in `server/data/p1/` are currently low-fidelity-calibrated. Do not narrate them as fully paired low-fi/high-fi references until the submit-side spot checks land.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
+- High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
 - [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
+- [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

TODO.md CHANGED Viewed

@@ -198,6 +198,7 @@ flowchart TD
   treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
   stop after a few readable trajectories or one clear failure mode
   paired high-fidelity fixture checks must happen immediately after this smoke pass
 - [ ] Manual-playtest 5-10 episodes
   Goal:
@@ -270,7 +271,7 @@ flowchart TD
   Files:
   [README.md](README.md)
-- [ ] Only add training evidence if it is actually persuasive
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
@@ -283,6 +284,7 @@ flowchart TD
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible

   treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
   stop after a few readable trajectories or one clear failure mode
   paired high-fidelity fixture checks must happen immediately after this smoke pass
+  high-fidelity VMEC-backed `submit` should stay out of the normal RL inner loop
 - [ ] Manual-playtest 5-10 episodes
   Goal:
   Files:
   [README.md](README.md)
+- [ ] Only treat training evidence as submission-ready if low-fidelity gains survive sparse high-fidelity evaluation
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
 - [ ] Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
+- [ ] Do not move high-fidelity VMEC-backed `submit` into the normal RL inner loop
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -50,6 +50,7 @@ Current caution:
 - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
 - do not narrate low-fidelity rollout metrics as final submission truth
 ## 3. Locked Decisions
@@ -83,6 +84,7 @@ Practical fail-fast rule:
 - stop after a few readable trajectories or one clear failure mode
 - run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
 - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
 ## 5. Document Roles
@@ -107,6 +109,7 @@ Compute surfaces:
 - Northflank is the main compute workspace for verifier-heavy work
 - HF Space is the hosted environment surface
 - Colab is the required public artifact and should show trained-policy behavior against the live environment
 Evidence order:

 - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
 - do not narrate low-fidelity rollout metrics as final submission truth
+- do not move high-fidelity VMEC-backed `submit` into the normal RL inner loop; keep it for truth checks and sparse evaluation
 ## 3. Locked Decisions
 - stop after a few readable trajectories or one clear failure mode
 - run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
 - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
+- keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop
 ## 5. Document Roles
 - Northflank is the main compute workspace for verifier-heavy work
 - HF Space is the hosted environment surface
 - Colab is the required public artifact and should show trained-policy behavior against the live environment
+- trained-policy work should still iterate on low-fidelity `run`; use high-fidelity `submit` only for sparse checkpoint evaluation and final evidence
 Evidence order:

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -163,6 +163,12 @@ The verifier should stay boundary-based:
 Do not treat parameterization-specific logic as verifier truth.
 ## 9. Reward V0
 `Reward V0` is the live reward contract until playtesting proves a concrete pathology.

 Do not treat parameterization-specific logic as verifier truth.
+Training and evaluation rule:
+- use low-fidelity `run` as the RL inner-loop surface
+- keep high-fidelity `submit` for terminal truth, paired fixture checks, submit-side manual traces, and sparse checkpoint evaluation
+- do not move high-fidelity VMEC-backed evaluation into every training step unless the contract is deliberately redefined
 ## 9. Reward V0
 `Reward V0` is the live reward contract until playtesting proves a concrete pathology.

docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED Viewed

@@ -17,6 +17,7 @@ Current execution priority remains:
 2. tiny PPO smoke pass as a diagnostic-only check
 3. tracked fixtures with paired high-fidelity submit checks
 4. one submit-side manual trace, then broader manual playtest
-5. heuristic baseline refresh
-6. HF Space proof
-7. notebook, demo, and repo polish

 2. tiny PPO smoke pass as a diagnostic-only check
 3. tracked fixtures with paired high-fidelity submit checks
 4. one submit-side manual trace, then broader manual playtest
+5. sparse checkpoint high-fidelity evaluation to confirm low-fidelity gains survive `submit`
+6. heuristic baseline refresh
+7. HF Space proof
+8. notebook, demo, and repo polish

training/README.md CHANGED Viewed

@@ -2,6 +2,12 @@ Training and evaluation notebooks belong here.
 This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.
 ## Status
 - [ ] Northflank notebook artifacts saved

 This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.
+Training policy:
+- train on the low-fidelity `run` surface for the normal RL inner loop
+- use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, submit-side traces, and final evidence
+- if low-fidelity gains do not survive high-fidelity `submit`, stop training and fix the environment or reward before pushing further
 ## Status
 - [ ] Northflank notebook artifacts saved

training/notebooks/README.md CHANGED Viewed

@@ -25,6 +25,8 @@ Operational defaults:
 - use the same Python dependency set as the repo runtime
 - keep heavy verifier and training work on Northflank
 - keep the Colab notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook

 - use the same Python dependency set as the repo runtime
 - keep heavy verifier and training work on Northflank
+- keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
+- use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
 - keep the Colab notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook