Spaces:
Sleeping
Sleeping
File size: 10,946 Bytes
65b799e 98ffb4a e815b38 918007b 65b799e 27d58b3 65b799e e815b38 65b799e 5354ca9 e815b38 5354ca9 e815b38 65b799e 27d58b3 65b799e e815b38 65b799e e815b38 65b799e e815b38 cdc237b e815b38 c3a24db e8e5af5 65b799e e815b38 acb992c c3a24db e815b38 e8e5af5 cdc237b e815b38 acb992c e815b38 65b799e e815b38 cdc237b 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e 1c1f314 8bf0155 1c1f314 513a2e2 1c1f314 e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e8e5af5 e815b38 65b799e e815b38 65b799e e815b38 2348d3e cdc237b 65b799e e815b38 65b799e e815b38 c3a24db e815b38 c3a24db e8e5af5 e815b38 e8e5af5 e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 cdc237b e815b38 cdc237b 65b799e e815b38 65b799e e815b38 5354ca9 c3a24db cdc237b e815b38 c3a24db cdc237b e8e5af5 e815b38 f238af4 e815b38 e8e5af5 e815b38 5354ca9 e815b38 5354ca9 e815b38 5354ca9 e815b38 65b799e 567ff67 1c1f314 8bf0155 c3a24db 1c1f314 567ff67 8bf0155 567ff67 1c1f314 65b799e e815b38 65b799e 1c1f314 65b799e e815b38 65b799e 1c1f314 65b799e e815b38 65b799e 1c1f314 65b799e e815b38 65b799e 1c1f314 65b799e 2348d3e 65b799e e8e5af5 cdc237b e8e5af5 e815b38 65b799e e815b38 65b799e 27d58b3 cdc237b 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 65b799e e815b38 c3a24db cdc237b c3a24db f238af4 e8e5af5 e815b38 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | # Fusion Design Lab — Plan V2
**Hackathon:** OpenEnv Hackathon, March 7-8, 2026
**Track:** Statement 3.1 (World Modeling — Professional Tasks)
**Role:** Planning and execution SSOT for this repo
**Updated:** March 8, 2026
## 1. Submission Thesis
Fusion Design Lab is not only a "trained model for fusion" submission.
It is a clear, reproducible environment for one constrained scientific design task:
- official `P1` benchmark semantics
- narrow, human-playable action space
- real verifier feedback from `constellaration`
- explicit constraints and failure semantics
- reward logic that can be explained and iterated
The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable.
## 2. Current State
Completed:
- `P1` is locked as the single benchmark task
- the repaired 4-knob low-dimensional runtime is live in code
- the official `constellaration` verifier path is wired
- the live environment is now unified onto one low-fidelity reward and verifier surface
- `submit` remains an explicit terminal action on that same live contract
- explicit VMEC failure semantics are implemented
- the Northflank smoke workflow is committed
- the Northflank smoke test passed on the team H100
- baseline comparison has been rerun on the real verifier path
- a coarse measured sweep note now exists
- the first tracked low-fidelity fixtures now exist
- an initial low-fidelity manual playtest note now exists
- paired high-fidelity fixture checks for those tracked fixtures now exist
- one submit-side manual playtest trace exists
- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines
Still open:
- decision on whether reset-seed pool should change from paired checks
- HF Space deployment evidence
- public Colab mirror or notebook submission link, if the submission surface still requires it
- before/after trained-policy evidence on the current unified low-fidelity workflow
- demo and README polish after the artifacts are real
Current caution:
- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
- do not narrate low-fidelity rollout metrics as final submission truth
- the standard notebook and `training/llm_rollout.py` paths should stay on the same live low-fidelity contract as the environment, including explicit `submit`
- reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence
## 3. Locked Decisions
These decisions are fixed unless a hard blocker appears:
- benchmark task: `P1`
- submission framing: `Statement 3.1`
- verifier of record: `constellaration.problems.GeometricalProblem`
- repo strategy: fresh wiring in this repo
- reuse policy: do not port the old `ai-sci-feasible-designs` harness
- scope rule: one stable task only
Execution rule:
- do not reopen strategy unless a real blocker appears
- convert decisions into code, fixtures, traces, baselines, or deployment work
## 4. Non-Negotiables
- Keep scope to one stable task.
- Keep claims conservative and evidence-backed.
- Do not let training-first work outrun environment stability.
- Do not rely on reward curves alone; keep trajectory evidence.
- Do not use reward complexity to hide a blocked action family.
- Do not polish repo or video before the environment and baselines are real.
Practical fail-fast rule:
- allow a tiny low-fidelity PPO smoke run before full submit-side validation
- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
- stop after a few readable trajectories or one clear failure mode
- run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
- keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop
## 5. Document Roles
Use the docs like this:
- this file defines planning order, status, gates, and fallback rules
- [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
- [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
- archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces
## 6. Artifact Plan
Visible artifacts:
- [x] HF Space environment
- [x] Repository training notebook
- [ ] Public Colab mirror or submission notebook link if required
- [ ] 1-minute demo video
- [x] Public repo and README
Compute surfaces:
- Northflank is the main compute workspace for verifier-heavy work
- HF Space is the hosted environment surface
- the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
- trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly
Evidence order:
- [x] measured sweep note
- [x] fixture checks
- [x] manual playtest log
- [x] tiny low-fi PPO smoke trace
- [x] shared-helper notebook alignment
- [x] model-driven low-fi LLM evaluation tooling
- [ ] reward iteration note
- [ ] stable local and remote episodes
- [x] random and heuristic baselines
- [ ] before/after trained-policy evidence
- [ ] demo and repo polish
## 7. Environment Summary
The environment contract must stay narrow and legible:
- one repaired low-dimensional boundary family derived from a rotating-ellipse seed
- discrete `run | submit | restore_best` interaction
- one low-fidelity verifier surface for all live environment actions
- readable observation surface with explicit fidelity labeling
- `Reward V2` keeps the verifier-native `Reward V1` core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop
The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
## 8. Execution Order
- [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
- [x] Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass.
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
- [ ] Save one fixed-seed untrained baseline with the unified live `training/llm_rollout.py evaluate` workflow.
- [ ] Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow.
- [ ] Re-run the same seeds after training and save one before/after artifact.
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
- [x] Refresh the heuristic baseline using the repaired-family evidence.
- [ ] Prove a stable local episode path.
- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
- [ ] Polish the public repo only after the artifacts above exist.
## 9. Success Gates
Gate 1: measured sweep exists
- repaired-family ranges, deltas, and reset seeds are justified by recorded evidence
Gate 2: tiny PPO smoke is sane
- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
- trajectories are readable enough to debug
- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
- current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)
Gate 3: fixture checks pass
- good, boundary, and bad references behave as expected
- the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work
Gate 4: manual playtest passes
- a human can read the observation
- a human can choose a plausible next action
- a human can explain the reward change
Gate 5: local episode is stable
- one clean trajectory is reproducible enough for demo use
Gate 6: baseline story is credible
- heuristic behavior is at least interpretable and preferable to random on the repaired task
Gate 7: remote surface is real
- HF Space preserves the same task contract as local
Gate 8: submission artifacts exist
- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
Gate 9: trained-policy evidence is real
- one fixed-seed untrained baseline exists
- one short low-fidelity training pass exists on the same workflow
- the repo can show a before/after comparison on the same seeds using the live environment contract, including `submit`
## 10. Fallback Rules
If training evidence is weak:
- keep claims conservative about policy quality
- still ship a trained-policy demonstration and document its limitations plainly
- do not skip the paired higher-fidelity validation artifacts
- do not split the notebook back onto a different submit contract than the live environment
If HF Space deployment is delayed:
- keep local and Northflank evidence first
- document the deployment blocker plainly
- do not invent remote claims without a real run
If reward behavior is confusing:
- fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity
If the repaired family is too hard:
- adjust ranges, deltas, or seeds from measured evidence
- do not expand into a broad Fourier action space just to rescue the hackathon scope
If the repaired family is too easy:
- prefer fixture and seed adjustments before broadening the action schema
## 11. Immediate Next Actions
- [x] Record the measured sweep and choose provisional defaults from evidence.
- [x] Check in tracked fixtures.
- [x] Record the first manual playtest log.
- [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
- [x] Pair the tracked fixtures with higher-fidelity validation checks.
- [x] Record one submit-side manual trace.
- [x] Refresh the heuristic baseline from that playtest evidence.
- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
- [ ] Re-run the same seeds and save a before/after artifact.
- [ ] Verify one clean HF Space episode with the same contract.
|