# Fusion Design Lab — Plan V2 **Hackathon:** OpenEnv Hackathon, March 7-8, 2026 **Track:** Statement 3.1 (World Modeling — Professional Tasks) **Role:** Planning and execution SSOT for this repo **Updated:** March 8, 2026 ## 1. Submission Thesis Fusion Design Lab is not only a "trained model for fusion" submission. It is a clear, reproducible environment for one constrained scientific design task: - official `P1` benchmark semantics - narrow, human-playable action space - real verifier feedback from `constellaration` - explicit constraints and failure semantics - reward logic that can be explained and iterated The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable. ## 2. Current State Completed: - `P1` is locked as the single benchmark task - the repaired 4-knob low-dimensional runtime is live in code - the official `constellaration` verifier path is wired - the live environment is now unified onto one low-fidelity reward and verifier surface - `submit` remains an explicit terminal action on that same live contract - explicit VMEC failure semantics are implemented - the Northflank smoke workflow is committed - the Northflank smoke test passed on the team H100 - baseline comparison has been rerun on the real verifier path - a coarse measured sweep note now exists - the first tracked low-fidelity fixtures now exist - an initial low-fidelity manual playtest note now exists - paired high-fidelity fixture checks for those tracked fixtures now exist - one submit-side manual playtest trace exists - the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract - model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines Still open: - decision on whether reset-seed pool should change from paired checks - HF Space deployment evidence - public Colab mirror or notebook submission link, if the submission surface still requires it - before/after trained-policy evidence on the current unified low-fidelity workflow - demo and README polish after the artifacts are real Current caution: - do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded - do not narrate low-fidelity rollout metrics as final submission truth - the standard notebook and `training/llm_rollout.py` paths should stay on the same live low-fidelity contract as the environment, including explicit `submit` - reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence ## 3. Locked Decisions These decisions are fixed unless a hard blocker appears: - benchmark task: `P1` - submission framing: `Statement 3.1` - verifier of record: `constellaration.problems.GeometricalProblem` - repo strategy: fresh wiring in this repo - reuse policy: do not port the old `ai-sci-feasible-designs` harness - scope rule: one stable task only Execution rule: - do not reopen strategy unless a real blocker appears - convert decisions into code, fixtures, traces, baselines, or deployment work ## 4. Non-Negotiables - Keep scope to one stable task. - Keep claims conservative and evidence-backed. - Do not let training-first work outrun environment stability. - Do not rely on reward curves alone; keep trajectory evidence. - Do not use reward complexity to hide a blocked action family. - Do not polish repo or video before the environment and baselines are real. Practical fail-fast rule: - allow a tiny low-fidelity PPO smoke run before full submit-side validation - use it only to surface obvious learnability bugs, reward exploits, or action-space problems - stop after a few readable trajectories or one clear failure mode - run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy - keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop ## 5. Document Roles Use the docs like this: - this file defines planning order, status, gates, and fallback rules - [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract - [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale - archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces ## 6. Artifact Plan Visible artifacts: - [x] HF Space environment - [x] Repository training notebook - [ ] Public Colab mirror or submission notebook link if required - [ ] 1-minute demo video - [x] Public repo and README Compute surfaces: - Northflank is the main compute workspace for verifier-heavy work - HF Space is the hosted environment surface - the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it - trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly Evidence order: - [x] measured sweep note - [x] fixture checks - [x] manual playtest log - [x] tiny low-fi PPO smoke trace - [x] shared-helper notebook alignment - [x] model-driven low-fi LLM evaluation tooling - [ ] reward iteration note - [ ] stable local and remote episodes - [x] random and heuristic baselines - [ ] before/after trained-policy evidence - [ ] demo and repo polish ## 7. Environment Summary The environment contract must stay narrow and legible: - one repaired low-dimensional boundary family derived from a rotating-ellipse seed - discrete `run | submit | restore_best` interaction - one low-fidelity verifier surface for all live environment actions - readable observation surface with explicit fidelity labeling - `Reward V2` keeps the verifier-native `Reward V1` core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here. ## 8. Execution Order - [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode. - [x] Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass. - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks. - [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology. - [ ] Save one fixed-seed untrained baseline with the unified live `training/llm_rollout.py evaluate` workflow. - [ ] Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow. - [ ] Re-run the same seeds after training and save one before/after artifact. - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem. - [x] Refresh the heuristic baseline using the repaired-family evidence. - [ ] Prove a stable local episode path. - [ ] Deploy the same task contract to HF Space and prove one clean remote episode. - [ ] Publish or mirror the notebook artifact only after the live before/after path is real. - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence. - [ ] Polish the public repo only after the artifacts above exist. ## 9. Success Gates Gate 1: measured sweep exists - repaired-family ranges, deltas, and reset seeds are justified by recorded evidence Gate 2: tiny PPO smoke is sane - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly - trajectories are readable enough to debug - the smoke run stops at that diagnostic threshold instead of turning into a broader training phase - current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md) Gate 3: fixture checks pass - good, boundary, and bad references behave as expected - the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work Gate 4: manual playtest passes - a human can read the observation - a human can choose a plausible next action - a human can explain the reward change Gate 5: local episode is stable - one clean trajectory is reproducible enough for demo use Gate 6: baseline story is credible - heuristic behavior is at least interpretable and preferable to random on the repaired task Gate 7: remote surface is real - HF Space preserves the same task contract as local Gate 8: submission artifacts exist - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one Gate 9: trained-policy evidence is real - one fixed-seed untrained baseline exists - one short low-fidelity training pass exists on the same workflow - the repo can show a before/after comparison on the same seeds using the live environment contract, including `submit` ## 10. Fallback Rules If training evidence is weak: - keep claims conservative about policy quality - still ship a trained-policy demonstration and document its limitations plainly - do not skip the paired higher-fidelity validation artifacts - do not split the notebook back onto a different submit contract than the live environment If HF Space deployment is delayed: - keep local and Northflank evidence first - document the deployment blocker plainly - do not invent remote claims without a real run If reward behavior is confusing: - fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity If the repaired family is too hard: - adjust ranges, deltas, or seeds from measured evidence - do not expand into a broad Fourier action space just to rescue the hackathon scope If the repaired family is too easy: - prefer fixture and seed adjustments before broadening the action schema ## 11. Immediate Next Actions - [x] Record the measured sweep and choose provisional defaults from evidence. - [x] Check in tracked fixtures. - [x] Record the first manual playtest log. - [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories. - [x] Pair the tracked fixtures with higher-fidelity validation checks. - [x] Record one submit-side manual trace. - [x] Refresh the heuristic baseline from that playtest evidence. - [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`. - [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`. - [ ] Re-run the same seeds and save a before/after artifact. - [ ] Verify one clean HF Space episode with the same contract.