Spaces:

CreativeEngineer
/

fusion-design-lab

Running on CPU Upgrade

App Files Files Community

fusion-design-lab / docs /FUSION_DESIGN_LAB_PLAN_V2.md

CreativeEngineer

feat: reward verifier alignment, notebook hardening, model name fix

cdc237b 8 days ago

preview code

raw

history blame contribute delete

10.9 kB

	# Fusion Design Lab — Plan V2

	Hackathon: OpenEnv Hackathon, March 7-8, 2026
	Track: Statement 3.1 (World Modeling — Professional Tasks)
	Role: Planning and execution SSOT for this repo
	Updated: March 8, 2026

	## 1. Submission Thesis

	Fusion Design Lab is not only a "trained model for fusion" submission.

	It is a clear, reproducible environment for one constrained scientific design task:

	- official `P1` benchmark semantics
	- narrow, human-playable action space
	- real verifier feedback from `constellaration`
	- explicit constraints and failure semantics
	- reward logic that can be explained and iterated

	The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable.

	## 2. Current State

	Completed:

	- `P1` is locked as the single benchmark task
	- the repaired 4-knob low-dimensional runtime is live in code
	- the official `constellaration` verifier path is wired
	- the live environment is now unified onto one low-fidelity reward and verifier surface
	- `submit` remains an explicit terminal action on that same live contract
	- explicit VMEC failure semantics are implemented
	- the Northflank smoke workflow is committed
	- the Northflank smoke test passed on the team H100
	- baseline comparison has been rerun on the real verifier path
	- a coarse measured sweep note now exists
	- the first tracked low-fidelity fixtures now exist
	- an initial low-fidelity manual playtest note now exists
	- paired high-fidelity fixture checks for those tracked fixtures now exist
	- one submit-side manual playtest trace exists
	- the repository GRPO notebook is checked in and aligned to the shared `fusion_lab/llm_agent.py` helper contract
	- model-driven fixed-seed low-fidelity `monitor` / `evaluate` tooling exists for LLM baselines

	Still open:

	- decision on whether reset-seed pool should change from paired checks
	- HF Space deployment evidence
	- public Colab mirror or notebook submission link, if the submission surface still requires it
	- before/after trained-policy evidence on the current unified low-fidelity workflow
	- demo and README polish after the artifacts are real

	Current caution:

	- do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
	- do not narrate low-fidelity rollout metrics as final submission truth
	- the standard notebook and `training/llm_rollout.py` paths should stay on the same live low-fidelity contract as the environment, including explicit `submit`
	- reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence

	## 3. Locked Decisions

	These decisions are fixed unless a hard blocker appears:

	- benchmark task: `P1`
	- submission framing: `Statement 3.1`
	- verifier of record: `constellaration.problems.GeometricalProblem`
	- repo strategy: fresh wiring in this repo
	- reuse policy: do not port the old `ai-sci-feasible-designs` harness
	- scope rule: one stable task only

	Execution rule:

	- do not reopen strategy unless a real blocker appears
	- convert decisions into code, fixtures, traces, baselines, or deployment work

	## 4. Non-Negotiables

	- Keep scope to one stable task.
	- Keep claims conservative and evidence-backed.
	- Do not let training-first work outrun environment stability.
	- Do not rely on reward curves alone; keep trajectory evidence.
	- Do not use reward complexity to hide a blocked action family.
	- Do not polish repo or video before the environment and baselines are real.

	Practical fail-fast rule:

	- allow a tiny low-fidelity PPO smoke run before full submit-side validation
	- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
	- stop after a few readable trajectories or one clear failure mode
	- run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
	- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
	- keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop

	## 5. Document Roles

	Use the docs like this:

	- this file defines planning order, status, gates, and fallback rules
	- [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md) defines the live technical contract
	- [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md) keeps blocker evidence, sweep evidence, and supporting rationale
	- archived legacy planning docs live under [`archive/`](archive/) and are not active SSOT surfaces

	## 6. Artifact Plan

	Visible artifacts:

	- [x] HF Space environment
	- [x] Repository training notebook
	- [ ] Public Colab mirror or submission notebook link if required
	- [ ] 1-minute demo video
	- [x] Public repo and README

	Compute surfaces:

	- Northflank is the main compute workspace for verifier-heavy work
	- HF Space is the hosted environment surface
	- the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
	- trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly

	Evidence order:

	- [x] measured sweep note
	- [x] fixture checks
	- [x] manual playtest log
	- [x] tiny low-fi PPO smoke trace
	- [x] shared-helper notebook alignment
	- [x] model-driven low-fi LLM evaluation tooling
	- [ ] reward iteration note
	- [ ] stable local and remote episodes
	- [x] random and heuristic baselines
	- [ ] before/after trained-policy evidence
	- [ ] demo and repo polish

	## 7. Environment Summary

	The environment contract must stay narrow and legible:

	- one repaired low-dimensional boundary family derived from a rotating-ellipse seed
	- discrete `run \| submit \| restore_best` interaction
	- one low-fidelity verifier surface for all live environment actions
	- readable observation surface with explicit fidelity labeling
	- `Reward V2` keeps the verifier-native `Reward V1` core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop

	The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V1.md), not here.

	## 8. Execution Order

	- [x] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
	- [x] Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass.
	- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
	- [x] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
	- [ ] Save one fixed-seed untrained baseline with the unified live `training/llm_rollout.py evaluate` workflow.
	- [ ] Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow.
	- [ ] Re-run the same seeds after training and save one before/after artifact.
	- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
	- [x] Refresh the heuristic baseline using the repaired-family evidence.
	- [ ] Prove a stable local episode path.
	- [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
	- [ ] Publish or mirror the notebook artifact only after the live before/after path is real.
	- [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
	- [ ] Polish the public repo only after the artifacts above exist.

	## 9. Success Gates

	Gate 1: measured sweep exists

	- repaired-family ranges, deltas, and reset seeds are justified by recorded evidence

	Gate 2: tiny PPO smoke is sane

	- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
	- trajectories are readable enough to debug
	- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
	- current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in [`P1_PPO_SMOKE_NOTE.md`](P1_PPO_SMOKE_NOTE.md)

	Gate 3: fixture checks pass

	- good, boundary, and bad references behave as expected
	- the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work

	Gate 4: manual playtest passes

	- a human can read the observation
	- a human can choose a plausible next action
	- a human can explain the reward change

	Gate 5: local episode is stable

	- one clean trajectory is reproducible enough for demo use

	Gate 6: baseline story is credible

	- heuristic behavior is at least interpretable and preferable to random on the repaired task

	Gate 7: remote surface is real

	- HF Space preserves the same task contract as local

	Gate 8: submission artifacts exist

	- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one

	Gate 9: trained-policy evidence is real

	- one fixed-seed untrained baseline exists
	- one short low-fidelity training pass exists on the same workflow
	- the repo can show a before/after comparison on the same seeds using the live environment contract, including `submit`

	## 10. Fallback Rules

	If training evidence is weak:

	- keep claims conservative about policy quality
	- still ship a trained-policy demonstration and document its limitations plainly
	- do not skip the paired higher-fidelity validation artifacts
	- do not split the notebook back onto a different submit contract than the live environment

	If HF Space deployment is delayed:

	- keep local and Northflank evidence first
	- document the deployment blocker plainly
	- do not invent remote claims without a real run

	If reward behavior is confusing:

	- fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity

	If the repaired family is too hard:

	- adjust ranges, deltas, or seeds from measured evidence
	- do not expand into a broad Fourier action space just to rescue the hackathon scope

	If the repaired family is too easy:

	- prefer fixture and seed adjustments before broadening the action schema

	## 11. Immediate Next Actions

	- [x] Record the measured sweep and choose provisional defaults from evidence.
	- [x] Check in tracked fixtures.
	- [x] Record the first manual playtest log.
	- [x] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
	- [x] Pair the tracked fixtures with higher-fidelity validation checks.
	- [x] Record one submit-side manual trace.
	- [x] Refresh the heuristic baseline from that playtest evidence.
	- [ ] Save one fixed-seed untrained baseline with `training/llm_rollout.py evaluate`.
	- [ ] Run one short H100 GRPO pass with `training/notebooks/fusion_design_lab_training.ipynb`.
	- [ ] Re-run the same seeds and save a before/after artifact.
	- [ ] Verify one clean HF Space episode with the same contract.