fusion-design-lab / docs /FUSION_DESIGN_LAB_PLAN_V2.md
CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b
# Fusion Design Lab — Plan V2
**Hackathon:** OpenEnv Hackathon, March 7-8, 2026
**Track:** Statement 3.1 (World Modeling — Professional Tasks)
**Role:** Planning and execution SSOT for this repo
**Updated:** March 8, 2026
## 1. Submission Thesis
Fusion Design Lab is not only a "trained model for fusion" submission.
It is a clear, reproducible environment for one constrained scientific design task:
- official `P1` benchmark semantics
- narrow, human-playable action space
- real verifier feedback from `constellaration`
- explicit constraints and failure semantics
- reward logic that can be explained and iterated
The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable.
## 2. Current State
Completed:
- `P1` is locked as the single benchmark task
- the repaired 4-knob low-dimensional runtime is live in code
- the official `constellaration` verifier path is wired
- the live environment is now unified onto one low-fidelity reward and verifier surface
- `submit` remains an explicit terminal action on that same live contract
- explicit VMEC failure semantics are implemented
- the Northflank smoke workflow is committed
- the Northflank smoke test passed on the team H100
- baseline comparison has been rerun on the real verifier path
- a coarse measured sweep note now exists
- the first tracked low-fidelity fixtures now exist
- an initial low-fidelity manual playtest note now exists
- paired high-fidelity fixture checks for those tracked fixtures now exist
- one submit-side manual playtest trace exists
- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines
Still open:
- decision on whether reset-seed pool should change from paired checks
- HF Space deployment evidence
- public Colab mirror or notebook submission link, if the submission surface still requires it
- before/after trained-policy evidence on the current unified low-fidelity workflow
- demo and README polish after the artifacts are real
Current caution:
- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
- do not narrate low-fidelity rollout metrics as final submission truth
- the standard notebook and `training/llm_rollout.py` paths should stay on the same live low-fidelity contract as the environment, including explicit `submit`
- reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence
## 3. Locked Decisions
These decisions are fixed unless a hard blocker appears:
- benchmark task: `P1`
- submission framing: `Statement 3.1`
- verifier of record: `constellaration.problems.GeometricalProblem`
- repo strategy: fresh wiring in this repo
- reuse policy: do not port the old `ai-sci-feasible-designs` harness
- scope rule: one stable task only
Execution rule:
- do not reopen strategy unless a real blocker appears
- convert decisions into code, fixtures, traces, baselines, or deployment work
## 4. Non-Negotiables
- Keep scope to one stable task.
- Keep claims conservative and evidence-backed.
- Do not let training-first work outrun environment stability.
- Do not rely on reward curves alone; keep trajectory evidence.
- Do not use reward complexity to hide a blocked action family.
- Do not polish repo or video before the environment and baselines are real.
Practical fail-fast rule:
- allow a tiny low-fidelity PPO smoke run before full submit-side validation
- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
- stop after a few readable trajectories or one clear failure mode
- run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
- keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop
## 5. Document Roles
Use the docs like this:
- this file defines planning order, status, gates, and fallback rules
- [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
- [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
- archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces
## 6. Artifact Plan
Visible artifacts:
- [x] HF Space environment
- [x] Repository training notebook
- [ ] Public Colab mirror or submission notebook link if required
- [ ] 1-minute demo video
- [x] Public repo and README
Compute surfaces:
- Northflank is the main compute workspace for verifier-heavy work
- HF Space is the hosted environment surface
- the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
- trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly
Evidence order:
- [x] measured sweep note
- [x] fixture checks
- [x] manual playtest log
- [x] tiny low-fi PPO smoke trace
- [x] shared-helper notebook alignment
- [x] model-driven low-fi LLM evaluation tooling
- [ ] reward iteration note
- [ ] stable local and remote episodes
- [x] random and heuristic baselines
- [ ] before/after trained-policy evidence
- [ ] demo and repo polish
## 7. Environment Summary
The environment contract must stay narrow and legible:
- one repaired low-dimensional boundary family derived from a rotating-ellipse seed
- discrete `run | submit | restore_best` interaction
- one low-fidelity verifier surface for all live environment actions
- readable observation surface with explicit fidelity labeling
- `Reward V2` keeps the verifier-native `Reward V1` core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop
The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.
## 8. Execution Order
- [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
- [x] Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass.
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
- [ ] Save one fixed-seed untrained baseline with the unified live `training/llm_rollout.py evaluate` workflow.
- [ ] Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow.
- [ ] Re-run the same seeds after training and save one before/after artifact.
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
- [x] Refresh the heuristic baseline using the repaired-family evidence.
- [ ] Prove a stable local episode path.
- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
- [ ] Polish the public repo only after the artifacts above exist.
## 9. Success Gates
Gate 1: measured sweep exists
- repaired-family ranges, deltas, and reset seeds are justified by recorded evidence
Gate 2: tiny PPO smoke is sane
- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
- trajectories are readable enough to debug
- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
- current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)
Gate 3: fixture checks pass
- good, boundary, and bad references behave as expected
- the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work
Gate 4: manual playtest passes
- a human can read the observation
- a human can choose a plausible next action
- a human can explain the reward change
Gate 5: local episode is stable
- one clean trajectory is reproducible enough for demo use
Gate 6: baseline story is credible
- heuristic behavior is at least interpretable and preferable to random on the repaired task
Gate 7: remote surface is real
- HF Space preserves the same task contract as local
Gate 8: submission artifacts exist
- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
Gate 9: trained-policy evidence is real
- one fixed-seed untrained baseline exists
- one short low-fidelity training pass exists on the same workflow
- the repo can show a before/after comparison on the same seeds using the live environment contract, including `submit`
## 10. Fallback Rules
If training evidence is weak:
- keep claims conservative about policy quality
- still ship a trained-policy demonstration and document its limitations plainly
- do not skip the paired higher-fidelity validation artifacts
- do not split the notebook back onto a different submit contract than the live environment
If HF Space deployment is delayed:
- keep local and Northflank evidence first
- document the deployment blocker plainly
- do not invent remote claims without a real run
If reward behavior is confusing:
- fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity
If the repaired family is too hard:
- adjust ranges, deltas, or seeds from measured evidence
- do not expand into a broad Fourier action space just to rescue the hackathon scope
If the repaired family is too easy:
- prefer fixture and seed adjustments before broadening the action schema
## 11. Immediate Next Actions
- [x] Record the measured sweep and choose provisional defaults from evidence.
- [x] Check in tracked fixtures.
- [x] Record the first manual playtest log.
- [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
- [x] Pair the tracked fixtures with higher-fidelity validation checks.
- [x] Record one submit-side manual trace.
- [x] Refresh the heuristic baseline from that playtest evidence.
- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
- [ ] Re-run the same seeds and save a before/after artifact.
- [ ] Verify one clean HF Space episode with the same contract.