Spaces:

CreativeEngineer
/

fusion-design-lab

Running on CPU Upgrade

App Files Files Community

fusion-design-lab / docs /FUSION_DESIGN_LAB_PLAN_V2.md

CreativeEngineer

feat: reward verifier alignment, notebook hardening, model name fix

cdc237b 8 days ago

preview code

raw

history blame contribute delete

10.9 kB

Fusion Design Lab — Plan V2

Hackathon: OpenEnv Hackathon, March 7-8, 2026 Track: Statement 3.1 (World Modeling — Professional Tasks) Role: Planning and execution SSOT for this repo Updated: March 8, 2026

1. Submission Thesis

Fusion Design Lab is not only a "trained model for fusion" submission.

It is a clear, reproducible environment for one constrained scientific design task:

official P1 benchmark semantics
narrow, human-playable action space
real verifier feedback from constellaration
explicit constraints and failure semantics
reward logic that can be explained and iterated

The environment is the product. A trained policy is required supporting evidence because it demonstrates that the environment is learnable in practice rather than only manually playable.

2. Current State

Completed:

P1 is locked as the single benchmark task
the repaired 4-knob low-dimensional runtime is live in code
the official constellaration verifier path is wired
the live environment is now unified onto one low-fidelity reward and verifier surface
submit remains an explicit terminal action on that same live contract
explicit VMEC failure semantics are implemented
the Northflank smoke workflow is committed
the Northflank smoke test passed on the team H100
baseline comparison has been rerun on the real verifier path
a coarse measured sweep note now exists
the first tracked low-fidelity fixtures now exist
an initial low-fidelity manual playtest note now exists
paired high-fidelity fixture checks for those tracked fixtures now exist
one submit-side manual playtest trace exists
the repository GRPO notebook is checked in and aligned to the shared fusion_lab/llm_agent.py helper contract
model-driven fixed-seed low-fidelity monitor / evaluate tooling exists for LLM baselines

Still open:

decision on whether reset-seed pool should change from paired checks
HF Space deployment evidence
public Colab mirror or notebook submission link, if the submission surface still requires it
before/after trained-policy evidence on the current unified low-fidelity workflow
demo and README polish after the artifacts are real

Current caution:

do not present repaired-family ranges, deltas, or budget choices as settled defaults until the measured sweep is recorded
do not narrate low-fidelity rollout metrics as final submission truth
the standard notebook and training/llm_rollout.py paths should stay on the same live low-fidelity contract as the environment, including explicit submit
reserve higher-fidelity validation for paired fixture checks, offline validation scripts, and final evidence

3. Locked Decisions

These decisions are fixed unless a hard blocker appears:

benchmark task: P1
submission framing: Statement 3.1
verifier of record: constellaration.problems.GeometricalProblem
repo strategy: fresh wiring in this repo
reuse policy: do not port the old ai-sci-feasible-designs harness
scope rule: one stable task only

Execution rule:

do not reopen strategy unless a real blocker appears
convert decisions into code, fixtures, traces, baselines, or deployment work

4. Non-Negotiables

Keep scope to one stable task.
Keep claims conservative and evidence-backed.
Do not let training-first work outrun environment stability.
Do not rely on reward curves alone; keep trajectory evidence.
Do not use reward complexity to hide a blocked action family.
Do not polish repo or video before the environment and baselines are real.

Practical fail-fast rule:

allow a tiny low-fidelity PPO smoke run before full submit-side validation
use it only to surface obvious learnability bugs, reward exploits, or action-space problems
stop after a few readable trajectories or one clear failure mode
run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
do not use low-fidelity training alone as proof that the terminal submit contract is trustworthy
keep any checkpoint high-fidelity evaluation sparse enough that it does not replace the low-fidelity inner loop

5. Document Roles

Use the docs like this:

this file defines planning order, status, gates, and fallback rules
P1_ENV_CONTRACT_V1.md defines the live technical contract
P1_PARAMETERIZATION_DEEPDIVE.md keeps blocker evidence, sweep evidence, and supporting rationale
archived legacy planning docs live under archive/ and are not active SSOT surfaces

6. Artifact Plan

Visible artifacts:

HF Space environment
Repository training notebook
Public Colab mirror or submission notebook link if required
1-minute demo video
Public repo and README

Compute surfaces:

Northflank is the main compute workspace for verifier-heavy work
HF Space is the hosted environment surface
the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
trained-policy work should iterate on the same live low-fidelity environment contract that will be demoed publicly

Evidence order:

measured sweep note
fixture checks
manual playtest log
tiny low-fi PPO smoke trace
shared-helper notebook alignment
model-driven low-fi LLM evaluation tooling
reward iteration note
stable local and remote episodes
random and heuristic baselines
before/after trained-policy evidence
demo and repo polish

7. Environment Summary

The environment contract must stay narrow and legible:

one repaired low-dimensional boundary family derived from a rotating-ellipse seed
discrete run | submit | restore_best interaction
one low-fidelity verifier surface for all live environment actions
readable observation surface with explicit fidelity labeling
Reward V2 keeps the verifier-native Reward V1 core and adds small best-so-far / anti-stagnation shaping for the low-fi repair loop

The live technical details belong in P1_ENV_CONTRACT_V1.md, not here.

8. Execution Order

Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
Pair the tracked low-fidelity fixtures with higher-fidelity validation checks immediately after the PPO smoke pass.
Decide whether the reset pool should change based on the measured sweep plus those paired checks.
Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
Save one fixed-seed untrained baseline with the unified live training/llm_rollout.py evaluate workflow.
Run one short H100 GRPO pass with the repository notebook on that same unified low-fidelity workflow.
Re-run the same seeds after training and save one before/after artifact.
Adjust reward or penalties only if playtesting exposes a concrete problem.
Refresh the heuristic baseline using the repaired-family evidence.
Prove a stable local episode path.
Deploy the same task contract to HF Space and prove one clean remote episode.
Publish or mirror the notebook artifact only after the live before/after path is real.
Record the demo around environment clarity, reward iteration, and baseline evidence.
Polish the public repo only after the artifacts above exist.

9. Success Gates

Gate 1: measured sweep exists

repaired-family ranges, deltas, and reset seeds are justified by recorded evidence

Gate 2: tiny PPO smoke is sane

a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
trajectories are readable enough to debug
the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
current status: passed as a plumbing/debugging gate, with the first exposed failure mode recorded in P1_PPO_SMOKE_NOTE.md

Gate 3: fixture checks pass

good, boundary, and bad references behave as expected
the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work

Gate 4: manual playtest passes

a human can read the observation
a human can choose a plausible next action
a human can explain the reward change

Gate 5: local episode is stable

one clean trajectory is reproducible enough for demo use

Gate 6: baseline story is credible

heuristic behavior is at least interpretable and preferable to random on the repaired task

Gate 7: remote surface is real

HF Space preserves the same task contract as local

Gate 8: submission artifacts exist

the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one

Gate 9: trained-policy evidence is real

one fixed-seed untrained baseline exists
one short low-fidelity training pass exists on the same workflow
the repo can show a before/after comparison on the same seeds using the live environment contract, including submit

10. Fallback Rules

If training evidence is weak:

keep claims conservative about policy quality
still ship a trained-policy demonstration and document its limitations plainly
do not skip the paired higher-fidelity validation artifacts
do not split the notebook back onto a different submit contract than the live environment

If HF Space deployment is delayed:

keep local and Northflank evidence first
document the deployment blocker plainly
do not invent remote claims without a real run

If reward behavior is confusing:

fix observation clarity, step magnitudes, seed choice, or terminal semantics before adding reward complexity

If the repaired family is too hard:

adjust ranges, deltas, or seeds from measured evidence
do not expand into a broad Fourier action space just to rescue the hackathon scope

If the repaired family is too easy:

prefer fixture and seed adjustments before broadening the action schema

11. Immediate Next Actions

Record the measured sweep and choose provisional defaults from evidence.
Check in tracked fixtures.
Record the first manual playtest log.
Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
Pair the tracked fixtures with higher-fidelity validation checks.
Record one submit-side manual trace.
Refresh the heuristic baseline from that playtest evidence.
Save one fixed-seed untrained baseline with training/llm_rollout.py evaluate.
Run one short H100 GRPO pass with training/notebooks/fusion_design_lab_training.ipynb.
Re-run the same seeds and save a before/after artifact.
Verify one clean HF Space episode with the same contract.