CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b

Fusion Design Lab TODO

This is the execution tracker for the hackathon repo.

Use this file for day-of build progress. Use the linked docs for rationale, contract truth, and submission framing:

Archived legacy references:

Priority source:

Current State

  • P1 strategy is locked
  • shared models reflect the repaired low-dimensional P1 contract
  • environment loop reflects the repaired low-dimensional P1 contract
  • API/task surface reflects P1
  • baselines reflect the P1 contract
  • repo docs call out the low-fi/high-fi constellaration split honestly
  • post-terminal guard in step()
  • constellaration verifier wiring
  • verify the current 3-knob family against the real low-fidelity verifier
  • repair the low-dimensional parameterization so triangularity is controllable
  • split boundary building from boundary evaluation
  • update the action schema from 3 knobs to the repaired low-dimensional family
  • add explicit VMEC failure semantics
  • label low-fi vs high-fi truth in the observation/task surface
  • separate high-fi submit scoring/reporting from low-fi rollout score state
  • tracked P1 fixtures
  • manual playtest log
  • settle the non-submit terminal reward policy
  • baseline comparison has been re-run on the constellaration branch state
  • tiny low-fi PPO smoke run exists Note: training/ppo_smoke.py now runs a diagnostic-only low-fidelity PPO smoke pass and the first artifact is summarized in docs/P1_PPO_SMOKE_NOTE.md
  • refresh the heuristic baseline for the real verifier path Note: the refreshed heuristic now uses the measured rotational_transform -> triangularity_scale -> elongation -> submit path; a fresh uv run python baselines/compare.py 5 rerun finished at 5/5 feasible high-fidelity finals and 5/5 wins over random

Execution Graph

flowchart TD
    A["Northflank Smoke Test"] --> E["Fixture Checks"]
    B["P1 Contract Lock"] --> D["P1 Models + Environment"]
    C["constellaration Physics Wiring"] --> D
    D --> P["Parameterization Repair"]
    P --> F["Tiny PPO Smoke"]
    F --> E["Fixture Checks"]
    E --> G["Submit-side Manual Playtest"]
    G --> H["Reward V2"]
    H --> I["Baselines"]
    I --> J["HF Space Deploy"]
    J --> K["Colab Notebook"]
    K --> L["Demo + README"]

Hour 0-2

Fresh Wiring

Validation and Reward

  • Run a small measured sweep on the repaired low-dimensional family Goal: choose useful parameter ranges, step deltas, and reset seeds from the repaired action family instead of guessing them from prose Related: P1 Environment Contract

  • Clarify or split fidelity-dependent best-state observation fields Goal: replace ambiguous mixed best-state reporting with explicit low-fidelity and high-fidelity best-state fields before fixture evidence or baseline comparisons Related: P1 Environment Contract

  • Add 1-2 tracked P1 fixtures Files: server/data/p1/README.md, P1 Pivot Record Note: paired high-fidelity submit checks are now written into each tracked fixture and summarized in baselines/fixture_high_fidelity_pairs.json

  • Run fixture sanity checks Goal: confirm paired low-fi/high-fi verifier outputs, objective direction, and reward ordering Related: Plan V2, Next 12 Hours Checklist

  • Run a tiny low-fi PPO smoke pass Goal: fail quickly on learnability, reward exploits, and action-space problems before investing in longer training Note: treat this as a smoke test, not as proof that the terminal submit contract is already validated stop after a few readable trajectories or one clear failure mode paired high-fidelity fixture checks must happen immediately after this smoke pass Status: first smoke artifact exists; next use of this step should only happen if a follow-up reward or observation change needs re-checking high-fidelity VMEC-backed submit should stay out of the normal RL inner loop

  • Manual-playtest 5-10 episodes Goal: start with one submit-side trace, then expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity Related: Plan V2, Deliverables Map

  • Update reward from V0 to V1 after playtesting exposed a real repair-path pathology Goal: keep a short exploit -> fix -> behavior improvement story Related: AGENTS.md, Plan V2

  • Update reward from V1 to V2 after the verifier-native shaping exposed short-horizon gaps Goal: add bounded new-best, near-feasible, and anti-stagnation terms without breaking the verifier-native reward story Related: AGENTS.md, P1 Environment Contract

  • Write down why Reward V0 did not survive unchanged Goal: document the concrete pathology: pure Δ official_feasibility hid useful non-dominant repairs because official feasibility is a max over normalized constraint violations Related: README.md, Plan V2

  • Decide the non-submit terminal reward policy Goal: budget exhaustion now yields a smaller end-of-episode reward than submit, so non-submitting agents still get terminal feedback without outranking explicit submit behavior Files: server/environment.py, README.md

Baselines

  • Implement the random baseline Files: baselines/random_agent.py, baselines/compare.py

  • Implement the heuristic baseline Files: baselines/heuristic_agent.py, baselines/compare.py

  • Run the baseline comparison on the current constellaration branch state Files: baselines/compare.py

  • Refresh the heuristic baseline after the constellaration rerun Goal: the old synthetic-path heuristic no longer gives a useful anchor on the real verifier path; redesign it after manual playtesting

  • Save one comparison trace that is presentation-ready Goal: show at least one stable trajectory and one heuristic-vs-random comparison

Submission Surfaces

Guardrails

  • Do not reopen P1 + rotating-ellipse strategy without a real blocker
  • Do not pretend the current 3-knob family is sufficient for P1 after the verified triangularity blocker
  • Do not guess repaired-family ranges, deltas, or budget changes without measurement
  • Do not port the old ai-sci-feasible-designs harness
  • Do not let notebook or demo work outrun environment evidence
  • Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
  • Do not move high-fidelity VMEC-backed submit into the normal RL inner loop
  • Do not describe low-fidelity run metrics as equivalent to high-fidelity submit results
  • Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
  • Do not describe the current baseline reset state as feasible or near-feasible
  • Do not force a new reward-version story until the previous reward version shows a real pathology Note: completed by recording the concrete Reward V0 pathology before Reward V1, then recording the concrete short-horizon Reward V1 gaps before Reward V2