Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

1c1f314

1 Parent(s): 27d58b3

docs: adopt hybrid fail-fast validation order

Browse files

Files changed (3) hide show

README.md +5 -4
TODO.md +16 -8
docs/FUSION_DESIGN_LAB_PLAN_V2.md +24 -6

README.md CHANGED Viewed

@@ -30,7 +30,7 @@ Implementation status:
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
 - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
-- the next runtime work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, and deployment evidence
 ## Execution Status
@@ -67,12 +67,12 @@ Implementation status:
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
-- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next meaningful playtest step is a real `submit` trace, not more abstract reward debate.
 Current mode:
 - strategic task choice is already locked
-- the next work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
@@ -127,8 +127,9 @@ uv sync --extra notebooks
 ## Immediate Next Steps
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
-- [ ] Run submit-side manual playtest episodes and record the first real reward pathology, if any.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
 - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
+- the next runtime work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, and deployment evidence
 ## Execution Status
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
+- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run alongside high-fidelity fixture pairing, then a real `submit` trace.
 Current mode:
 - strategic task choice is already locked
+- the next work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
 ## Immediate Next Steps
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
+- [ ] Run a tiny low-fidelity PPO smoke run and save a few trajectories.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
+- [ ] Run at least one submit-side manual trace and record the first real reward pathology, if any.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

TODO.md CHANGED Viewed

@@ -43,6 +43,7 @@ Priority source:
 - [x] manual playtest log
 - [x] settle the non-submit terminal reward policy
 - [x] baseline comparison has been re-run on the `constellaration` branch state
 - [ ] refresh the heuristic baseline for the real verifier path
 ## Execution Graph
@@ -54,12 +55,13 @@ flowchart TD
     C["constellaration Physics Wiring"] --> D
     D --> P["Parameterization Repair"]
     P --> E["Fixture Checks"]
-    E --> F["Manual Playtest"]
-    F --> G["Reward V1"]
-    G --> H["Baselines"]
-    H --> I["HF Space Deploy"]
-    I --> J["Colab Notebook"]
-    J --> K["Demo + README"]
 ```
 ## Hour 0-2
@@ -189,9 +191,15 @@ flowchart TD
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
 - [ ] Manual-playtest 5-10 episodes
   Goal:
-  expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
@@ -272,7 +280,7 @@ flowchart TD
 - [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
-- [ ] Do not add training-first complexity before manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible

 - [x] manual playtest log
 - [x] settle the non-submit terminal reward policy
 - [x] baseline comparison has been re-run on the `constellaration` branch state
+- [ ] tiny low-fi PPO smoke run exists
 - [ ] refresh the heuristic baseline for the real verifier path
 ## Execution Graph
     C["constellaration Physics Wiring"] --> D
     D --> P["Parameterization Repair"]
     P --> E["Fixture Checks"]
+    E --> F["Tiny PPO Smoke"]
+    F --> G["Submit-side Manual Playtest"]
+    G --> H["Reward V1"]
+    H --> I["Baselines"]
+    I --> J["HF Space Deploy"]
+    J --> K["Colab Notebook"]
+    K --> L["Demo + README"]
 ```
 ## Hour 0-2
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
+- [ ] Run a tiny low-fi PPO smoke pass
+  Goal:
+  fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
+  Note:
+  treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
 - [ ] Manual-playtest 5-10 episodes
   Goal:
+  start with one submit-side trace, then expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
   Related:
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
   [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
 - [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
 - [ ] Do not port the old `ai-sci-feasible-designs` harness
 - [ ] Do not let notebook or demo work outrun environment evidence
+- [ ] Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
 - [ ] Do not describe the current baseline reset state as feasible or near-feasible

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -38,6 +38,7 @@ Completed:
 Still open:
 - paired high-fidelity checks for the tracked fixtures
 - submit-side manual playtest evidence
 - heuristic baseline refresh on the repaired real-verifier path
@@ -75,6 +76,12 @@ Execution rule:
 - Do not use reward complexity to hide a blocked action family.
 - Do not polish repo or video before the environment and baselines are real.
 ## 5. Document Roles
 Use the docs like this:
@@ -104,6 +111,7 @@ Evidence order:
 - [x] measured sweep note
 - [ ] fixture checks
 - [x] manual playtest log
 - [ ] reward iteration note
 - [ ] stable local and remote episodes
 - [x] random and heuristic baselines
@@ -125,9 +133,10 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
 ## 8. Execution Order
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
-- [ ] Manual-playtest 5 to 10 episodes, including real submit traces, and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 - [ ] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
@@ -146,25 +155,30 @@ Gate 2: fixture checks pass
 - good, boundary, and bad references behave as expected
-Gate 3: manual playtest passes
 - a human can read the observation
 - a human can choose a plausible next action
 - a human can explain the reward change
-Gate 4: local episode is stable
 - one clean trajectory is reproducible enough for demo use
-Gate 5: baseline story is credible
 - heuristic behavior is at least interpretable and preferable to random on the repaired task
-Gate 6: remote surface is real
 - HF Space preserves the same task contract as local
-Gate 7: submission artifacts exist
 - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
@@ -174,6 +188,7 @@ If training evidence is weak:
 - keep claims conservative about policy quality
 - still ship a trained-policy demonstration and document its limitations plainly
 If HF Space deployment is delayed:
@@ -199,5 +214,8 @@ If the repaired family is too easy:
 - [x] Record the measured sweep and choose provisional defaults from evidence.
 - [x] Check in tracked fixtures.
 - [x] Record the first manual playtest log.
 - [ ] Refresh the heuristic baseline from that playtest evidence.
 - [ ] Verify one clean HF Space episode with the same contract.

 Still open:
+- tiny low-fidelity PPO smoke evidence
 - paired high-fidelity checks for the tracked fixtures
 - submit-side manual playtest evidence
 - heuristic baseline refresh on the repaired real-verifier path
 - Do not use reward complexity to hide a blocked action family.
 - Do not polish repo or video before the environment and baselines are real.
+Practical fail-fast rule:
+- allow a tiny low-fidelity PPO smoke run before full submit-side validation
+- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
+- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
 ## 5. Document Roles
 Use the docs like this:
 - [x] measured sweep note
 - [ ] fixture checks
 - [x] manual playtest log
+- [ ] tiny low-fi PPO smoke trace
 - [ ] reward iteration note
 - [ ] stable local and remote episodes
 - [x] random and heuristic baselines
 ## 8. Execution Order
+- [ ] Run a tiny low-fidelity PPO smoke pass and inspect a few trajectories for obvious learnability failures or reward exploits.
 - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
+- [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 - [ ] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - good, boundary, and bad references behave as expected
+Gate 3: tiny PPO smoke is sane
+- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
+- trajectories are readable enough to debug
+Gate 4: manual playtest passes
 - a human can read the observation
 - a human can choose a plausible next action
 - a human can explain the reward change
+Gate 5: local episode is stable
 - one clean trajectory is reproducible enough for demo use
+Gate 6: baseline story is credible
 - heuristic behavior is at least interpretable and preferable to random on the repaired task
+Gate 7: remote surface is real
 - HF Space preserves the same task contract as local
+Gate 8: submission artifacts exist
 - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
 - keep claims conservative about policy quality
 - still ship a trained-policy demonstration and document its limitations plainly
+- do not skip the paired high-fidelity checks or submit-side manual trace
 If HF Space deployment is delayed:
 - [x] Record the measured sweep and choose provisional defaults from evidence.
 - [x] Check in tracked fixtures.
 - [x] Record the first manual playtest log.
+- [ ] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
+- [ ] Pair the tracked fixtures with high-fidelity submit checks.
+- [ ] Record one submit-side manual trace.
 - [ ] Refresh the heuristic baseline from that playtest evidence.
 - [ ] Verify one clean HF Space episode with the same contract.