Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
e8e5af5
1
Parent(s): ddcb837
docs: align training workflow and plan ssot
Browse files
README.md
CHANGED
|
@@ -15,7 +15,7 @@ An RL environment where agents optimize stellarator fusion reactor designs by ad
|
|
| 15 |
| `average_triangularity` | ≤ -0.5 |
|
| 16 |
| `edge_iota_over_nfp` | ≥ 0.3 |
|
| 17 |
|
| 18 |
-
The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit.
|
| 19 |
|
| 20 |
## Architecture
|
| 21 |
|
|
@@ -30,8 +30,10 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
|
|
| 30 |
- `P1` is locked as the benchmark task with `constellaration` as verifier of record
|
| 31 |
- The repaired 4-knob low-dimensional boundary family is wired into the runtime path
|
| 32 |
- Environment deployed to HF Spaces and verified (health, reset, step all operational)
|
| 33 |
-
- GRPO training notebook
|
|
|
|
| 34 |
- Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
|
|
|
|
| 35 |
|
| 36 |
## Execution Status
|
| 37 |
|
|
@@ -65,7 +67,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
|
|
| 65 |
- The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
|
| 66 |
- The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
|
| 67 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 68 |
-
-
|
| 69 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
| 70 |
- Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
|
| 71 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
|
@@ -135,7 +137,9 @@ uv sync --extra notebooks
|
|
| 135 |
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
|
| 136 |
- [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
|
| 137 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
| 138 |
-
- [ ]
|
|
|
|
|
|
|
| 139 |
- [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
|
| 140 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
| 141 |
- [x] Deploy the environment to HF Space.
|
|
|
|
| 15 |
| `average_triangularity` | ≤ -0.5 |
|
| 16 |
| `edge_iota_over_nfp` | ≥ 0.3 |
|
| 17 |
|
| 18 |
+
The environment uses [`constellaration`](https://pypi.org/project/constellaration/) as the physics verifier — low-fidelity (~0.6s) for the RL inner loop, high-fidelity (~4s) for terminal submit. The live environment still exposes **26 discrete actions** (4 parameters × 2 directions × 3 magnitudes + restore_best + submit), but the standard GRPO notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on the low-fidelity `run` surface and ignore `submit` by default.
|
| 19 |
|
| 20 |
## Architecture
|
| 21 |
|
|
|
|
| 30 |
- `P1` is locked as the benchmark task with `constellaration` as verifier of record
|
| 31 |
- The repaired 4-knob low-dimensional boundary family is wired into the runtime path
|
| 32 |
- Environment deployed to HF Spaces and verified (health, reset, step all operational)
|
| 33 |
+
- GRPO training notebook is checked into the repo and aligned with the shared `fusion_lab/llm_agent.py` contract
|
| 34 |
+
- LLM rollout tooling can now generate fresh model completions per seed and save fixed-seed reward/outcome summaries
|
| 35 |
- Low-fidelity PPO smoke artifacts and paired high-fidelity fixture checks exist
|
| 36 |
+
- Before/after trained-policy evidence on the current low-fidelity-only workflow is still open
|
| 37 |
|
| 38 |
## Execution Status
|
| 39 |
|
|
|
|
| 67 |
- The repaired family now has a first coarse measured sweep note in [docs/P1_MEASURED_SWEEP_NOTE.md](docs/P1_MEASURED_SWEEP_NOTE.md), but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks.
|
| 68 |
- The paired low-fi/high-fi fixture snapshots are now written into each fixture JSON and summarized in `baselines/fixture_high_fidelity_pairs.json`.
|
| 69 |
- `run` uses low-fidelity `constellaration` metrics, while `submit` re-evaluates the current design with high-fidelity `skip_qi`; do not present step-time metrics as final submission metrics.
|
| 70 |
+
- The standard LLM training and evaluation workflow is now low-fidelity-only: the repo notebook and `training/llm_rollout.py` `monitor` / `evaluate` ignore `submit` by default. Reserve `submit` for explicit replay/debug work, paired fixture checks, submit-side traces, and final evidence.
|
| 71 |
- VMEC failure semantics are now explicit in the runtime path. Failed evaluations cost budget, produce a visible failure observation, and apply a penalty.
|
| 72 |
- Terminal reward/reporting now uses a fidelity-consistent basis: `submit` compares against high-fidelity reference state instead of low-fidelity rollout score state.
|
| 73 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
|
|
|
| 137 |
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
|
| 138 |
- [x] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
|
| 139 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
| 140 |
+
- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
|
| 141 |
+
- [ ] Run one short H100 GRPO pass with the repository notebook on the same low-fidelity-only workflow.
|
| 142 |
+
- [ ] Re-run the same seeds after training and save one before/after artifact.
|
| 143 |
- [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
|
| 144 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
| 145 |
- [x] Deploy the environment to HF Space.
|
docs/{findings/FUSION_DESIGN_LAB_PLAN_V2.md → FUSION_DESIGN_LAB_PLAN_V2.md}
RENAMED
|
@@ -37,19 +37,23 @@ Completed:
|
|
| 37 |
- an initial low-fidelity manual playtest note now exists
|
| 38 |
- paired high-fidelity fixture checks for those tracked fixtures now exist
|
| 39 |
- one submit-side manual playtest trace exists
|
|
|
|
|
|
|
| 40 |
|
| 41 |
Still open:
|
| 42 |
|
| 43 |
- decision on whether reset-seed pool should change from paired checks
|
| 44 |
- HF Space deployment evidence
|
| 45 |
-
- public notebook
|
|
|
|
| 46 |
- demo and README polish after the artifacts are real
|
| 47 |
|
| 48 |
Current caution:
|
| 49 |
|
| 50 |
- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
|
| 51 |
- do not narrate low-fidelity rollout metrics as final submission truth
|
| 52 |
-
-
|
|
|
|
| 53 |
|
| 54 |
## 3. Locked Decisions
|
| 55 |
|
|
@@ -98,8 +102,9 @@ Use the docs like this:
|
|
| 98 |
|
| 99 |
Visible artifacts:
|
| 100 |
|
| 101 |
-
- [
|
| 102 |
-
- [
|
|
|
|
| 103 |
- [ ] 1-minute demo video
|
| 104 |
- [x] Public repo and README
|
| 105 |
|
|
@@ -116,10 +121,12 @@ Evidence order:
|
|
| 116 |
- [x] fixture checks
|
| 117 |
- [x] manual playtest log
|
| 118 |
- [x] tiny low-fi PPO smoke trace
|
|
|
|
|
|
|
| 119 |
- [ ] reward iteration note
|
| 120 |
- [ ] stable local and remote episodes
|
| 121 |
- [x] random and heuristic baselines
|
| 122 |
-
- [ ]
|
| 123 |
- [ ] demo and repo polish
|
| 124 |
|
| 125 |
## 7. Environment Summary
|
|
@@ -141,11 +148,14 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
|
|
| 141 |
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
|
| 142 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 143 |
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
|
|
|
|
|
|
|
|
|
|
| 144 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 145 |
- [x] Refresh the heuristic baseline using the repaired-family evidence.
|
| 146 |
- [ ] Prove a stable local episode path.
|
| 147 |
- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
|
| 148 |
-
- [ ]
|
| 149 |
- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
|
| 150 |
- [ ] Polish the public repo only after the artifacts above exist.
|
| 151 |
|
|
@@ -189,6 +199,12 @@ Gate 8: submission artifacts exist
|
|
| 189 |
|
| 190 |
- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
|
| 191 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
## 10. Fallback Rules
|
| 193 |
|
| 194 |
If training evidence is weak:
|
|
@@ -196,6 +212,7 @@ If training evidence is weak:
|
|
| 196 |
- keep claims conservative about policy quality
|
| 197 |
- still ship a trained-policy demonstration and document its limitations plainly
|
| 198 |
- do not skip the paired high-fidelity checks or submit-side manual trace
|
|
|
|
| 199 |
|
| 200 |
If HF Space deployment is delayed:
|
| 201 |
|
|
@@ -225,4 +242,7 @@ If the repaired family is too easy:
|
|
| 225 |
- [x] Pair the tracked fixtures with high-fidelity submit checks.
|
| 226 |
- [x] Record one submit-side manual trace.
|
| 227 |
- [x] Refresh the heuristic baseline from that playtest evidence.
|
|
|
|
|
|
|
|
|
|
| 228 |
- [ ] Verify one clean HF Space episode with the same contract.
|
|
|
|
| 37 |
- an initial low-fidelity manual playtest note now exists
|
| 38 |
- paired high-fidelity fixture checks for those tracked fixtures now exist
|
| 39 |
- one submit-side manual playtest trace exists
|
| 40 |
+
- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
|
| 41 |
+
- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines
|
| 42 |
|
| 43 |
Still open:
|
| 44 |
|
| 45 |
- decision on whether reset-seed pool should change from paired checks
|
| 46 |
- HF Space deployment evidence
|
| 47 |
+
- public Colab mirror or notebook submission link, if the submission surface still requires it
|
| 48 |
+
- before/after trained-policy evidence on the current low-fidelity-only workflow
|
| 49 |
- demo and README polish after the artifacts are real
|
| 50 |
|
| 51 |
Current caution:
|
| 52 |
|
| 53 |
- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
|
| 54 |
- do not narrate low-fidelity rollout metrics as final submission truth
|
| 55 |
+
- the standard notebook and `training/llm_rollout.py` `monitor` / `evaluate` paths now stay on low-fidelity `run` only and ignore `submit` by default
|
| 56 |
+
- reserve VMEC-backed `submit` for replay/debug work, paired fixture checks, submit-side traces, and final evidence
|
| 57 |
|
| 58 |
## 3. Locked Decisions
|
| 59 |
|
|
|
|
| 102 |
|
| 103 |
Visible artifacts:
|
| 104 |
|
| 105 |
+
- [x] HF Space environment
|
| 106 |
+
- [x] Repository training notebook
|
| 107 |
+
- [ ] Public Colab mirror or submission notebook link if required
|
| 108 |
- [ ] 1-minute demo video
|
| 109 |
- [x] Public repo and README
|
| 110 |
|
|
|
|
| 121 |
- [x] fixture checks
|
| 122 |
- [x] manual playtest log
|
| 123 |
- [x] tiny low-fi PPO smoke trace
|
| 124 |
+
- [x] shared-helper notebook alignment
|
| 125 |
+
- [x] model-driven low-fi LLM evaluation tooling
|
| 126 |
- [ ] reward iteration note
|
| 127 |
- [ ] stable local and remote episodes
|
| 128 |
- [x] random and heuristic baselines
|
| 129 |
+
- [ ] before/after trained-policy evidence
|
| 130 |
- [ ] demo and repo polish
|
| 131 |
|
| 132 |
## 7. Environment Summary
|
|
|
|
| 148 |
- [x] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
|
| 149 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 150 |
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
|
| 151 |
+
- [ ] Save one fixed-seed untrained baseline with the low-fidelity-only `training/llm_rollout.py evaluate` workflow.
|
| 152 |
+
- [ ] Run one short H100 GRPO pass with the repository notebook on that same low-fidelity-only workflow.
|
| 153 |
+
- [ ] Re-run the same seeds after training and save one before/after artifact.
|
| 154 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 155 |
- [x] Refresh the heuristic baseline using the repaired-family evidence.
|
| 156 |
- [ ] Prove a stable local episode path.
|
| 157 |
- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
|
| 158 |
+
- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
|
| 159 |
- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
|
| 160 |
- [ ] Polish the public repo only after the artifacts above exist.
|
| 161 |
|
|
|
|
| 199 |
|
| 200 |
- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
|
| 201 |
|
| 202 |
+
Gate 9: trained-policy evidence is real
|
| 203 |
+
|
| 204 |
+
- one fixed-seed untrained baseline exists
|
| 205 |
+
- one short low-fidelity training pass exists on the same workflow
|
| 206 |
+
- the repo can show a before/after comparison on the same seeds without relying on `submit`
|
| 207 |
+
|
| 208 |
## 10. Fallback Rules
|
| 209 |
|
| 210 |
If training evidence is weak:
|
|
|
|
| 212 |
- keep claims conservative about policy quality
|
| 213 |
- still ship a trained-policy demonstration and document its limitations plainly
|
| 214 |
- do not skip the paired high-fidelity checks or submit-side manual trace
|
| 215 |
+
- do not swap back to submit-included reward traces and present them as the current GRPO path
|
| 216 |
|
| 217 |
If HF Space deployment is delayed:
|
| 218 |
|
|
|
|
| 242 |
- [x] Pair the tracked fixtures with high-fidelity submit checks.
|
| 243 |
- [x] Record one submit-side manual trace.
|
| 244 |
- [x] Refresh the heuristic baseline from that playtest evidence.
|
| 245 |
+
- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
|
| 246 |
+
- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
|
| 247 |
+
- [ ] Re-run the same seeds and save a before/after artifact.
|
| 248 |
- [ ] Verify one clean HF Space episode with the same contract.
|
docs/P1_ENV_CONTRACT_V1.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
# P1 Environment Contract V1
|
| 2 |
|
| 3 |
**Role:** Live technical contract SSOT for the current implementation phase
|
| 4 |
-
**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](FUSION_DESIGN_LAB_PLAN_V2.md)
|
| 5 |
**Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
|
| 6 |
|
| 7 |
## 1. Scope
|
|
@@ -179,7 +179,8 @@ VMEC preset mapping:
|
|
| 179 |
Training and evaluation rule:
|
| 180 |
|
| 181 |
- use low-fidelity `run` as the RL inner-loop surface
|
| 182 |
-
-
|
|
|
|
| 183 |
- do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
|
| 184 |
|
| 185 |
## 9. Reward V0
|
|
|
|
| 1 |
# P1 Environment Contract V1
|
| 2 |
|
| 3 |
**Role:** Live technical contract SSOT for the current implementation phase
|
| 4 |
+
**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](./FUSION_DESIGN_LAB_PLAN_V2.md)
|
| 5 |
**Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
|
| 6 |
|
| 7 |
## 1. Scope
|
|
|
|
| 179 |
Training and evaluation rule:
|
| 180 |
|
| 181 |
- use low-fidelity `run` as the RL inner-loop surface
|
| 182 |
+
- the standard repository notebook and `training/llm_rollout.py` `monitor` / `evaluate` workflows stay on low-fidelity `run` only and ignore `submit` by default
|
| 183 |
+
- keep higher-fidelity `submit` for terminal truth, explicit replay/debug work, paired fixture checks, and submit-side manual traces
|
| 184 |
- do not move VMEC-backed submit evaluation into every training step unless the contract is deliberately redefined
|
| 185 |
|
| 186 |
## 9. Reward V0
|
docs/VERIFIER_REWARD_REVIEW_2026-03-08.md
CHANGED
|
@@ -2,6 +2,10 @@
|
|
| 2 |
|
| 3 |
Date: 2026-03-08
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
## Scope
|
| 6 |
|
| 7 |
This note reviews how well the current verifier path and reward function serve the repo's stated goal:
|
|
@@ -30,26 +34,26 @@ The verifier implementation is directionally strong and mostly correct now. The
|
|
| 30 |
|
| 31 |
The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
|
| 32 |
|
| 33 |
-
The main gap
|
| 34 |
|
| 35 |
-
-
|
| 36 |
-
- no manual playtest log
|
| 37 |
-
- no documented reward pathology/fix or explicit "Reward V0 survived playtest" note
|
| 38 |
|
| 39 |
## Validated Findings
|
| 40 |
|
| 41 |
-
### 1. Missing validation artifacts
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
- tracked `P1` fixtures
|
| 46 |
- manual playtest evidence
|
| 47 |
- a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
|
| 48 |
|
|
|
|
|
|
|
| 49 |
Relevant references:
|
| 50 |
|
| 51 |
- [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
|
| 52 |
-
- [FUSION_DELIVERABLES_MAP.md](./FUSION_DELIVERABLES_MAP.md)
|
| 53 |
- [server/data/p1/README.md](../server/data/p1/README.md)
|
| 54 |
|
| 55 |
### 2. Observation legibility still conflicts with the official tolerance semantics
|
|
|
|
| 2 |
|
| 3 |
Date: 2026-03-08
|
| 4 |
|
| 5 |
+
Historical note:
|
| 6 |
+
|
| 7 |
+
This review captured repo state earlier on March 8, 2026, before the later notebook/helper alignment and low-fidelity-only LLM evaluation workflow landed. Fixtures, manual playtest notes, notebook wiring, and model-driven low-fidelity rollout tooling were added later the same day. Treat the repo-level gap callouts below as a dated snapshot; the main remaining validation gap is before/after trained-policy evidence on the current low-fidelity-only workflow.
|
| 8 |
+
|
| 9 |
## Scope
|
| 10 |
|
| 11 |
This note reviews how well the current verifier path and reward function serve the repo's stated goal:
|
|
|
|
| 34 |
|
| 35 |
The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
|
| 36 |
|
| 37 |
+
The main gap at review time was no longer basic verifier wiring. The main gap was that the repo lacked the validation evidence its own docs required before calling the environment "good."
|
| 38 |
|
| 39 |
+
Later on March 8, 2026, the repo closed the fixture/manual-playtest/notebook-wiring gaps. The remaining open validation gap is still before/after trained-policy evidence on the current low-fidelity-only workflow.
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Validated Findings
|
| 42 |
|
| 43 |
+
### 1. Missing validation artifacts were the biggest repo-level gap at review time
|
| 44 |
|
| 45 |
+
At review time, the planning docs explicitly said the environment artifact was the product, not just the code. The repo still lacked:
|
| 46 |
|
| 47 |
- tracked `P1` fixtures
|
| 48 |
- manual playtest evidence
|
| 49 |
- a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
|
| 50 |
|
| 51 |
+
Later updates on March 8, 2026 closed the tracked-fixture and manual-playtest parts of that gap. The live open item is still trained-policy evidence on the current low-fidelity-only workflow.
|
| 52 |
+
|
| 53 |
Relevant references:
|
| 54 |
|
| 55 |
- [FUSION_DESIGN_LAB_PLAN_V2.md](./FUSION_DESIGN_LAB_PLAN_V2.md)
|
| 56 |
+
- [archive/FUSION_DELIVERABLES_MAP.md](./archive/FUSION_DELIVERABLES_MAP.md)
|
| 57 |
- [server/data/p1/README.md](../server/data/p1/README.md)
|
| 58 |
|
| 59 |
### 2. Observation legibility still conflicts with the official tolerance semantics
|
training/README.md
CHANGED
|
@@ -11,9 +11,11 @@ Training policy:
|
|
| 11 |
## Status
|
| 12 |
|
| 13 |
- [ ] Northflank notebook artifacts saved
|
| 14 |
-
- [
|
|
|
|
| 15 |
- [x] tiny low-fi PPO smoke artifact saved
|
| 16 |
-
- [ ]
|
|
|
|
| 17 |
|
| 18 |
## Runnable paths
|
| 19 |
|
|
@@ -26,6 +28,15 @@ Training policy:
|
|
| 26 |
- generate fresh model completions per seed and save aggregate reward/outcome metrics:
|
| 27 |
`uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## Shared LLM Contract
|
| 30 |
|
| 31 |
The prompt/action/replay contract for LLM training lives in:
|
|
|
|
| 11 |
## Status
|
| 12 |
|
| 13 |
- [ ] Northflank notebook artifacts saved
|
| 14 |
+
- [x] repository GRPO notebook saved
|
| 15 |
+
- [ ] Colab mirror or public notebook link saved if required by the submission surface
|
| 16 |
- [x] tiny low-fi PPO smoke artifact saved
|
| 17 |
+
- [ ] fixed-seed untrained baseline artifact saved
|
| 18 |
+
- [ ] before/after trained-policy evidence saved
|
| 19 |
|
| 20 |
## Runnable paths
|
| 21 |
|
|
|
|
| 28 |
- generate fresh model completions per seed and save aggregate reward/outcome metrics:
|
| 29 |
`uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2`
|
| 30 |
|
| 31 |
+
Use `monitor` when you already have one completion or one action plan and want a fixed replay across seeds.
|
| 32 |
+
Use `evaluate` for before/after policy comparison because it generates a fresh completion per seed.
|
| 33 |
+
|
| 34 |
+
## Current validation target
|
| 35 |
+
|
| 36 |
+
- save one untrained fixed-seed baseline with `evaluate`
|
| 37 |
+
- run one short GRPO pass on Northflank H100 with the repository notebook
|
| 38 |
+
- rerun the same seeds and compare reward plus low-fidelity feasibility before versus after
|
| 39 |
+
|
| 40 |
## Shared LLM Contract
|
| 41 |
|
| 42 |
The prompt/action/replay contract for LLM training lives in:
|
training/notebooks/README.md
CHANGED
|
@@ -18,8 +18,9 @@ Recommended split:
|
|
| 18 |
- [x] Northflank smoke notebook note saved
|
| 19 |
- [x] runnable Northflank smoke script saved
|
| 20 |
- [x] Northflank smoke test passed on the team H100
|
|
|
|
| 21 |
- [ ] manual-playtest notebook or trace notebook saved
|
| 22 |
-
- [ ] public submission notebook link saved
|
| 23 |
|
| 24 |
Operational defaults:
|
| 25 |
|
|
@@ -27,6 +28,8 @@ Operational defaults:
|
|
| 27 |
- keep heavy verifier and training work on Northflank
|
| 28 |
- keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
|
| 29 |
- use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
|
|
|
|
|
|
|
| 30 |
- keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
|
| 31 |
- prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
|
| 32 |
|
|
@@ -47,4 +50,8 @@ LLM notebook helpers should use the packaged prompt/action contract in:
|
|
| 47 |
|
| 48 |
- `fusion_lab/llm_agent.py`
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.
|
|
|
|
| 18 |
- [x] Northflank smoke notebook note saved
|
| 19 |
- [x] runnable Northflank smoke script saved
|
| 20 |
- [x] Northflank smoke test passed on the team H100
|
| 21 |
+
- [x] repository GRPO notebook saved
|
| 22 |
- [ ] manual-playtest notebook or trace notebook saved
|
| 23 |
+
- [ ] public Colab mirror or submission notebook link saved if still required
|
| 24 |
|
| 25 |
Operational defaults:
|
| 26 |
|
|
|
|
| 28 |
- keep heavy verifier and training work on Northflank
|
| 29 |
- keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
|
| 30 |
- use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
|
| 31 |
+
- keep the repository GRPO notebook aligned to the shared helper contract in `fusion_lab/llm_agent.py`
|
| 32 |
+
- the standard notebook reward/eval path is low-fidelity-only and ignores `submit` by default
|
| 33 |
- keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
|
| 34 |
- prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
|
| 35 |
|
|
|
|
| 50 |
|
| 51 |
- `fusion_lab/llm_agent.py`
|
| 52 |
|
| 53 |
+
Current repository notebook:
|
| 54 |
+
|
| 55 |
+
- `training/notebooks/fusion_design_lab_training.ipynb`
|
| 56 |
+
|
| 57 |
The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.
|