Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

e8e5af5

1 Parent(s): ddcb837

docs: align training workflow and plan ssot

Browse files

Files changed (6) hide show

README.md +8 -4
docs/{findings/FUSION_DESIGN_LAB_PLAN_V2.md → FUSION_DESIGN_LAB_PLAN_V2.md} +26 -6
docs/P1_ENV_CONTRACT_V1.md +3 -2
docs/VERIFIER_REWARD_REVIEW_2026-03-08.md +11 -7
training/README.md +13 -2
training/notebooks/README.md +8 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ An RL environment where agents optimize stellarator fusion reactor designs by ad
 | `average_triangularity` | ≤ -0.5 |
 | `edge_iota_over_nfp` | ≥ 0.3 |
-The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. Each episode has a budget of **6 evaluations** across **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit).
 ## Architecture
@@ -30,8 +30,10 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
 - `P1` is locked as the benchmark task with `constellaration` as verifier of record
 - The repaired 4-knob low-dimensional boundary family is wired into the runtime path
 - Environment deployed to HF Spaces and verified (health, reset, step all operational)
-- GRPO training notebook created with Unsloth + TRL integration
 - Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
 ## Execution Status
@@ -65,7 +67,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
-- High-fidelity VMEC-backed `submit` is too expensive to serve as the normal RL inner loop. Keep training rollouts on low-fidelity `run`, then use high-fidelity calls for paired fixtures, submit-side traces, sparse checkpoint evaluation, and final evidence.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
@@ -135,7 +137,9 @@ uv sync --extra notebooks
 - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
-- [ ] Keep any checkpoint high-fidelity evaluation sparse enough that the low-fidelity inner loop stays fast.
 - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [x] Deploy the environment to HF Space.

 | `average_triangularity` | ≤ -0.5 |
 | `edge_iota_over_nfp` | ≥ 0.3 |
+The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
 ## Architecture
 - `P1` is locked as the benchmark task with `constellaration` as verifier of record
 - The repaired 4-knob low-dimensional boundary family is wired into the runtime path
 - Environment deployed to HF Spaces and verified (health, reset, step all operational)
+- GRPO training notebook is checked into the repo and aligned with the shared `fusion_lab/llm_agent.py` contract
+- LLM rollout tooling can now generate fresh model completions per seed and save fixed-seed reward/outcome summaries
 - Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
+- Before/after trained-policy evidence on the current low-fidelity-only workflow is still open
 ## Execution Status
 - The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
 - The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
 - `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
+- The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
 - VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
 - Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 - [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
+- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
+- [ ] Run one short H100 GRPO pass with the repository notebook on the same low-fidelity-only workflow.
+- [ ] Re-run the same seeds after training and save one before/after artifact.
 - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [x] Deploy the environment to HF Space.

docs/{findings/FUSION_DESIGN_LAB_PLAN_V2.md → FUSION_DESIGN_LAB_PLAN_V2.md} RENAMED Viewed

@@ -37,19 +37,23 @@ Completed:
 - an initial low-fidelity manual playtest note now exists
 - paired high-fidelity fixture checks for those tracked fixtures now exist
 - one submit-side manual playtest trace exists
 Still open:
 - decision on whether reset-seed pool should change from paired checks
 - HF Space deployment evidence
-- public notebook artifact wiring
 - demo and README polish after the artifacts are real
 Current caution:
 - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
 - do not narrate low-fidelity rollout metrics as final submission truth
-- do not move high-fidelity VMEC-backed `submit` into the normal RL inner loop; keep it for truth checks and sparse evaluation
 ## 3. Locked Decisions
@@ -98,8 +102,9 @@ Use the docs like this:
 Visible artifacts:
-- [ ] HF Space environment
-- [ ] Public submission notebook
 - [ ] 1-minute demo video
 - [x] Public repo and README
@@ -116,10 +121,12 @@ Evidence order:
 - [x] fixture checks
 - [x] manual playtest log
 - [x] tiny low-fi PPO smoke trace
 - [ ] reward iteration note
 - [ ] stable local and remote episodes
 - [x] random and heuristic baselines
-- [ ] notebook evidence
 - [ ] demo and repo polish
 ## 7. Environment Summary
@@ -141,11 +148,14 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
 - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 - [x] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
-- [ ] Wire the public notebook artifact to the live environment.
 - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
 - [ ] Polish the public repo only after the artifacts above exist.
@@ -189,6 +199,12 @@ Gate 8: submission artifacts exist
 - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
 ## 10. Fallback Rules
 If training evidence is weak:
@@ -196,6 +212,7 @@ If training evidence is weak:
 - keep claims conservative about policy quality
 - still ship a trained-policy demonstration and document its limitations plainly
 - do not skip the paired high-fidelity checks or submit-side manual trace
 If HF Space deployment is delayed:
@@ -225,4 +242,7 @@ If the repaired family is too easy:
 - [x] Pair the tracked fixtures with high-fidelity submit checks.
 - [x] Record one submit-side manual trace.
 - [x] Refresh the heuristic baseline from that playtest evidence.
 - [ ] Verify one clean HF Space episode with the same contract.

 - an initial low-fidelity manual playtest note now exists
 - paired high-fidelity fixture checks for those tracked fixtures now exist
 - one submit-side manual playtest trace exists
+- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
+- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines
 Still open:
 - decision on whether reset-seed pool should change from paired checks
 - HF Space deployment evidence
+- public Colab mirror or notebook submission link, if the submission surface still requires it
+- before/after trained-policy evidence on the current low-fidelity-only workflow
 - demo and README polish after the artifacts are real
 Current caution:
 - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
 - do not narrate low-fidelity rollout metrics as final submission truth
+- the standard notebook and `training/llm_rollout.py` `monitor` / `evaluate` paths now stay on low-fidelity `run` only and ignore `submit` by default
+- reserve VMEC-backed `submit` for replay/debug work, paired fixture checks, submit-side traces, and final evidence
 ## 3. Locked Decisions
 Visible artifacts:
+- [x] HF Space environment
+- [x] Repository training notebook
+- [ ] Public Colab mirror or submission notebook link if required
 - [ ] 1-minute demo video
 - [x] Public repo and README
 - [x] fixture checks
 - [x] manual playtest log
 - [x] tiny low-fi PPO smoke trace
+- [x] shared-helper notebook alignment
+- [x] model-driven low-fi LLM evaluation tooling
 - [ ] reward iteration note
 - [ ] stable local and remote episodes
 - [x] random and heuristic baselines
+- [ ] before/after trained-policy evidence
 - [ ] demo and repo polish
 ## 7. Environment Summary
 - [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
+- [ ] Save one fixed-seed untrained baseline with the low-fidelity-only `training/llm_rollout.py evaluate` workflow.
+- [ ] Run one short H100 GRPO pass with the repository notebook on that same low-fidelity-only workflow.
+- [ ] Re-run the same seeds after training and save one before/after artifact.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 - [x] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
+- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
 - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
 - [ ] Polish the public repo only after the artifacts above exist.
 - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
+Gate 9: trained-policy evidence is real
+- one fixed-seed untrained baseline exists
+- one short low-fidelity training pass exists on the same workflow
+- the repo can show a before/after comparison on the same seeds without relying on `submit`
 ## 10. Fallback Rules
 If training evidence is weak:
 - keep claims conservative about policy quality
 - still ship a trained-policy demonstration and document its limitations plainly
 - do not skip the paired high-fidelity checks or submit-side manual trace
+- do not swap back to submit-included reward traces and present them as the current GRPO path
 If HF Space deployment is delayed:
 - [x] Pair the tracked fixtures with high-fidelity submit checks.
 - [x] Record one submit-side manual trace.
 - [x] Refresh the heuristic baseline from that playtest evidence.
+- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
+- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
+- [ ] Re-run the same seeds and save a before/after artifact.
 - [ ] Verify one clean HF Space episode with the same contract.

docs/P1_ENV_CONTRACT_V1.md CHANGED Viewed

@@ -1,7 +1,7 @@
 # P1 Environment Contract V1
 **Role:** Live technical contract SSOT for the current implementation phase
-**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](FUSION_DESIGN_LAB_PLAN_V2.md)
 **Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
 ## 1. Scope
@@ -179,7 +179,8 @@ VMEC preset mapping:
 Training and evaluation rule:
 - use low-fidelity `run` as the RL inner-loop surface
-- keep higher-fidelity `submit` for terminal truth, paired fixture checks, submit-side manual traces, and sparse checkpoint evaluation
 - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
 ## 9. Reward V0

 # P1 Environment Contract V1
 **Role:** Live technical contract SSOT for the current implementation phase
+**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](./FUSION_DESIGN_LAB_PLAN_V2.md)
 **Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
 ## 1. Scope
 Training and evaluation rule:
 - use low-fidelity `run` as the RL inner-loop surface
+- the standard repository notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on low-fidelity `run` only and ignore `submit` by default
+- keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
 - do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
 ## 9. Reward V0

docs/VERIFIER_REWARD_REVIEW_2026-03-08.md CHANGED Viewed

@@ -2,6 +2,10 @@
 Date: 2026-03-08
 ## Scope
 This note reviews how well the current verifier path and reward function serve the repo's stated goal:
@@ -30,26 +34,26 @@ The verifier implementation is directionally strong and mostly correct now. The
 The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
-The main gap is no longer basic verifier wiring. The main gap is that the repo still lacks the validation evidence that its own docs require before calling the environment "good":
-- no tracked fixtures
-- no manual playtest log
-- no documented reward pathology/fix or explicit "Reward V0 survived playtest" note
 ## Validated Findings
-### 1. Missing validation artifacts are still the biggest repo-level gap
-The planning docs explicitly say the environment artifact is the product, not just the code. Current repo state still lacks:
 - tracked `P1` fixtures
 - manual playtest evidence
 - a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
 Relevant references:
 - [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
-- [FUSION_DELIVERABLES_MAP.md](./FUSION_DELIVERABLES_MAP.md)
 - [server/data/p1/README.md](../server/data/p1/README.md)
 ### 2. Observation legibility still conflicts with the official tolerance semantics

 Date: 2026-03-08
+Historical note:
+This review captured repo state earlier on March 8, 2026, before the later notebook/helper alignment and low-fidelity-only LLM evaluation workflow landed. Fixtures, manual playtest notes, notebook wiring, and model-driven low-fidelity rollout tooling were added later the same day. Treat the repo-level gap callouts below as a dated snapshot; the main remaining validation gap is before/after trained-policy evidence on the current low-fidelity-only workflow.
 ## Scope
 This note reviews how well the current verifier path and reward function serve the repo's stated goal:
 The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
+The main gap at review time was no longer basic verifier wiring. The main gap was that the repo lacked the validation evidence its own docs required before calling the environment "good."
+Later on March 8, 2026, the repo closed the fixture/manual-playtest/notebook-wiring gaps. The remaining open validation gap is still before/after trained-policy evidence on the current low-fidelity-only workflow.
 ## Validated Findings
+### 1. Missing validation artifacts were the biggest repo-level gap at review time
+At review time, the planning docs explicitly said the environment artifact was the product, not just the code. The repo still lacked:
 - tracked `P1` fixtures
 - manual playtest evidence
 - a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
+Later updates on March 8, 2026 closed the tracked-fixture and manual-playtest parts of that gap. The live open item is still trained-policy evidence on the current low-fidelity-only workflow.
 Relevant references:
 - [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
+- [archive/FUSION_DELIVERABLES_MAP.md](./archive/FUSION_DELIVERABLES_MAP.md)
 - [server/data/p1/README.md](../server/data/p1/README.md)
 ### 2. Observation legibility still conflicts with the official tolerance semantics

training/README.md CHANGED Viewed

@@ -11,9 +11,11 @@ Training policy:
 ## Status
 - [ ] Northflank notebook artifacts saved
-- [ ] Colab notebook saved
 - [x] tiny low-fi PPO smoke artifact saved
-- [ ] trained-policy evidence saved
 ## Runnable paths
@@ -26,6 +28,15 @@ Training policy:
 - generate fresh model completions per seed and save aggregate reward/outcome metrics:
   `uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
 ## Shared LLM Contract
 The prompt/action/replay contract for LLM training lives in:

 ## Status
 - [ ] Northflank notebook artifacts saved
+- [x] repository GRPO notebook saved
+- [ ] Colab mirror or public notebook link saved if required by the submission surface
 - [x] tiny low-fi PPO smoke artifact saved
+- [ ] fixed-seed untrained baseline artifact saved
+- [ ] before/after trained-policy evidence saved
 ## Runnable paths
 - generate fresh model completions per seed and save aggregate reward/outcome metrics:
   `uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
+Use `monitor` when you already have one completion or one action plan and want a fixed replay across seeds.
+Use `evaluate` for before/after policy comparison because it generates a fresh completion per seed.
+## Current validation target
+- save one untrained fixed-seed baseline with `evaluate`
+- run one short GRPO pass on Northflank H100 with the repository notebook
+- rerun the same seeds and compare reward plus low-fidelity feasibility before versus after
 ## Shared LLM Contract
 The prompt/action/replay contract for LLM training lives in:

training/notebooks/README.md CHANGED Viewed

@@ -18,8 +18,9 @@ Recommended split:
 - [x] Northflank smoke notebook note saved
 - [x] runnable Northflank smoke script saved
 - [x] Northflank smoke test passed on the team H100
 - [ ] manual-playtest notebook or trace notebook saved
-- [ ] public submission notebook link saved
 Operational defaults:
@@ -27,6 +28,8 @@ Operational defaults:
 - keep heavy verifier and training work on Northflank
 - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
 - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
 - keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
@@ -47,4 +50,8 @@ LLM notebook helpers should use the packaged prompt/action contract in:
 - `fusion_lab/llm_agent.py`
 The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.

 - [x] Northflank smoke notebook note saved
 - [x] runnable Northflank smoke script saved
 - [x] Northflank smoke test passed on the team H100
+- [x] repository GRPO notebook saved
 - [ ] manual-playtest notebook or trace notebook saved
+- [ ] public Colab mirror or submission notebook link saved if still required
 Operational defaults:
 - keep heavy verifier and training work on Northflank
 - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
 - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
+- keep the repository GRPO notebook aligned to the shared helper contract in `fusion_lab/llm_agent.py`
+- the standard notebook reward/eval path is low-fidelity-only and ignores `submit` by default
 - keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
 - `fusion_lab/llm_agent.py`
+Current repository notebook:
+- `training/notebooks/fusion_design_lab_training.ipynb`
 The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.