Commit ·
1c1f314
1
Parent(s): 27d58b3
docs: adopt hybrid fail-fast validation order
Browse files- README.md +5 -4
- TODO.md +16 -8
- docs/FUSION_DESIGN_LAB_PLAN_V2.md +24 -6
README.md
CHANGED
|
@@ -30,7 +30,7 @@ Implementation status:
|
|
| 30 |
- the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
|
| 31 |
- the repaired 4-knob low-dimensional family is now wired into the runtime path
|
| 32 |
- the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
|
| 33 |
-
- the next runtime work is paired high-fidelity fixture checks, submit-side manual
|
| 34 |
|
| 35 |
## Execution Status
|
| 36 |
|
|
@@ -67,12 +67,12 @@ Implementation status:
|
|
| 67 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
| 68 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 69 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 70 |
-
- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next
|
| 71 |
|
| 72 |
Current mode:
|
| 73 |
|
| 74 |
- strategic task choice is already locked
|
| 75 |
-
- the next work is paired high-fidelity fixture checks, submit-side manual
|
| 76 |
- new planning text should only appear when a real blocker forces a decision change
|
| 77 |
|
| 78 |
## Planned Repository Layout
|
|
@@ -127,8 +127,9 @@ uv sync --extra notebooks
|
|
| 127 |
## Immediate Next Steps
|
| 128 |
|
| 129 |
- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
|
|
|
|
| 130 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
| 131 |
-
- [ ] Run submit-side manual
|
| 132 |
- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
|
| 133 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
| 134 |
- [ ] Deploy the environment to HF Space.
|
|
|
|
| 30 |
- the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
|
| 31 |
- the repaired 4-knob low-dimensional family is now wired into the runtime path
|
| 32 |
- the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
|
| 33 |
+
- the next runtime work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, and deployment evidence
|
| 34 |
|
| 35 |
## Execution Status
|
| 36 |
|
|
|
|
| 67 |
- Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
|
| 68 |
- Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
|
| 69 |
- The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
|
| 70 |
+
- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run alongside high-fidelity fixture pairing, then a real `submit` trace.
|
| 71 |
|
| 72 |
Current mode:
|
| 73 |
|
| 74 |
- strategic task choice is already locked
|
| 75 |
+
- the next work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, smoke validation, and deployment
|
| 76 |
- new planning text should only appear when a real blocker forces a decision change
|
| 77 |
|
| 78 |
## Planned Repository Layout
|
|
|
|
| 127 |
## Immediate Next Steps
|
| 128 |
|
| 129 |
- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
|
| 130 |
+
- [ ] Run a tiny low-fidelity PPO smoke run and save a few trajectories.
|
| 131 |
- [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
|
| 132 |
+
- [ ] Run at least one submit-side manual trace and record the first real reward pathology, if any.
|
| 133 |
- [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
|
| 134 |
- [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
|
| 135 |
- [ ] Deploy the environment to HF Space.
|
TODO.md
CHANGED
|
@@ -43,6 +43,7 @@ Priority source:
|
|
| 43 |
- [x] manual playtest log
|
| 44 |
- [x] settle the non-submit terminal reward policy
|
| 45 |
- [x] baseline comparison has been re-run on the `constellaration` branch state
|
|
|
|
| 46 |
- [ ] refresh the heuristic baseline for the real verifier path
|
| 47 |
|
| 48 |
## Execution Graph
|
|
@@ -54,12 +55,13 @@ flowchart TD
|
|
| 54 |
C["constellaration Physics Wiring"] --> D
|
| 55 |
D --> P["Parameterization Repair"]
|
| 56 |
P --> E["Fixture Checks"]
|
| 57 |
-
E --> F["
|
| 58 |
-
F --> G["
|
| 59 |
-
G --> H["
|
| 60 |
-
H --> I["
|
| 61 |
-
I --> J["
|
| 62 |
-
J --> K["
|
|
|
|
| 63 |
```
|
| 64 |
|
| 65 |
## Hour 0-2
|
|
@@ -189,9 +191,15 @@ flowchart TD
|
|
| 189 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 190 |
[Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
|
| 191 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
- [ ] Manual-playtest 5-10 episodes
|
| 193 |
Goal:
|
| 194 |
-
expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
|
| 195 |
Related:
|
| 196 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 197 |
[Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
|
|
@@ -272,7 +280,7 @@ flowchart TD
|
|
| 272 |
- [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
|
| 273 |
- [ ] Do not port the old `ai-sci-feasible-designs` harness
|
| 274 |
- [ ] Do not let notebook or demo work outrun environment evidence
|
| 275 |
-
- [ ] Do not
|
| 276 |
- [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
|
| 277 |
- [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
|
| 278 |
- [ ] Do not describe the current baseline reset state as feasible or near-feasible
|
|
|
|
| 43 |
- [x] manual playtest log
|
| 44 |
- [x] settle the non-submit terminal reward policy
|
| 45 |
- [x] baseline comparison has been re-run on the `constellaration` branch state
|
| 46 |
+
- [ ] tiny low-fi PPO smoke run exists
|
| 47 |
- [ ] refresh the heuristic baseline for the real verifier path
|
| 48 |
|
| 49 |
## Execution Graph
|
|
|
|
| 55 |
C["constellaration Physics Wiring"] --> D
|
| 56 |
D --> P["Parameterization Repair"]
|
| 57 |
P --> E["Fixture Checks"]
|
| 58 |
+
E --> F["Tiny PPO Smoke"]
|
| 59 |
+
F --> G["Submit-side Manual Playtest"]
|
| 60 |
+
G --> H["Reward V1"]
|
| 61 |
+
H --> I["Baselines"]
|
| 62 |
+
I --> J["HF Space Deploy"]
|
| 63 |
+
J --> K["Colab Notebook"]
|
| 64 |
+
K --> L["Demo + README"]
|
| 65 |
```
|
| 66 |
|
| 67 |
## Hour 0-2
|
|
|
|
| 191 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 192 |
[Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
|
| 193 |
|
| 194 |
+
- [ ] Run a tiny low-fi PPO smoke pass
|
| 195 |
+
Goal:
|
| 196 |
+
fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
|
| 197 |
+
Note:
|
| 198 |
+
treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
|
| 199 |
+
|
| 200 |
- [ ] Manual-playtest 5-10 episodes
|
| 201 |
Goal:
|
| 202 |
+
start with one submit-side trace, then expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
|
| 203 |
Related:
|
| 204 |
[Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
|
| 205 |
[Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
|
|
|
|
| 280 |
- [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
|
| 281 |
- [ ] Do not port the old `ai-sci-feasible-designs` harness
|
| 282 |
- [ ] Do not let notebook or demo work outrun environment evidence
|
| 283 |
+
- [ ] Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
|
| 284 |
- [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
|
| 285 |
- [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
|
| 286 |
- [ ] Do not describe the current baseline reset state as feasible or near-feasible
|
docs/FUSION_DESIGN_LAB_PLAN_V2.md
CHANGED
|
@@ -38,6 +38,7 @@ Completed:
|
|
| 38 |
|
| 39 |
Still open:
|
| 40 |
|
|
|
|
| 41 |
- paired high-fidelity checks for the tracked fixtures
|
| 42 |
- submit-side manual playtest evidence
|
| 43 |
- heuristic baseline refresh on the repaired real-verifier path
|
|
@@ -75,6 +76,12 @@ Execution rule:
|
|
| 75 |
- Do not use reward complexity to hide a blocked action family.
|
| 76 |
- Do not polish repo or video before the environment and baselines are real.
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
## 5. Document Roles
|
| 79 |
|
| 80 |
Use the docs like this:
|
|
@@ -104,6 +111,7 @@ Evidence order:
|
|
| 104 |
- [x] measured sweep note
|
| 105 |
- [ ] fixture checks
|
| 106 |
- [x] manual playtest log
|
|
|
|
| 107 |
- [ ] reward iteration note
|
| 108 |
- [ ] stable local and remote episodes
|
| 109 |
- [x] random and heuristic baselines
|
|
@@ -125,9 +133,10 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
|
|
| 125 |
|
| 126 |
## 8. Execution Order
|
| 127 |
|
|
|
|
| 128 |
- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
|
| 129 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 130 |
-
- [ ]
|
| 131 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 132 |
- [ ] Refresh the heuristic baseline using the repaired-family evidence.
|
| 133 |
- [ ] Prove a stable local episode path.
|
|
@@ -146,25 +155,30 @@ Gate 2: fixture checks pass
|
|
| 146 |
|
| 147 |
- good, boundary, and bad references behave as expected
|
| 148 |
|
| 149 |
-
Gate 3:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
- a human can read the observation
|
| 152 |
- a human can choose a plausible next action
|
| 153 |
- a human can explain the reward change
|
| 154 |
|
| 155 |
-
Gate
|
| 156 |
|
| 157 |
- one clean trajectory is reproducible enough for demo use
|
| 158 |
|
| 159 |
-
Gate
|
| 160 |
|
| 161 |
- heuristic behavior is at least interpretable and preferable to random on the repaired task
|
| 162 |
|
| 163 |
-
Gate
|
| 164 |
|
| 165 |
- HF Space preserves the same task contract as local
|
| 166 |
|
| 167 |
-
Gate
|
| 168 |
|
| 169 |
- Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
|
| 170 |
|
|
@@ -174,6 +188,7 @@ If training evidence is weak:
|
|
| 174 |
|
| 175 |
- keep claims conservative about policy quality
|
| 176 |
- still ship a trained-policy demonstration and document its limitations plainly
|
|
|
|
| 177 |
|
| 178 |
If HF Space deployment is delayed:
|
| 179 |
|
|
@@ -199,5 +214,8 @@ If the repaired family is too easy:
|
|
| 199 |
- [x] Record the measured sweep and choose provisional defaults from evidence.
|
| 200 |
- [x] Check in tracked fixtures.
|
| 201 |
- [x] Record the first manual playtest log.
|
|
|
|
|
|
|
|
|
|
| 202 |
- [ ] Refresh the heuristic baseline from that playtest evidence.
|
| 203 |
- [ ] Verify one clean HF Space episode with the same contract.
|
|
|
|
| 38 |
|
| 39 |
Still open:
|
| 40 |
|
| 41 |
+
- tiny low-fidelity PPO smoke evidence
|
| 42 |
- paired high-fidelity checks for the tracked fixtures
|
| 43 |
- submit-side manual playtest evidence
|
| 44 |
- heuristic baseline refresh on the repaired real-verifier path
|
|
|
|
| 76 |
- Do not use reward complexity to hide a blocked action family.
|
| 77 |
- Do not polish repo or video before the environment and baselines are real.
|
| 78 |
|
| 79 |
+
Practical fail-fast rule:
|
| 80 |
+
|
| 81 |
+
- allow a tiny low-fidelity PPO smoke run before full submit-side validation
|
| 82 |
+
- use it only to surface obvious learnability bugs, reward exploits, or action-space problems
|
| 83 |
+
- do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
|
| 84 |
+
|
| 85 |
## 5. Document Roles
|
| 86 |
|
| 87 |
Use the docs like this:
|
|
|
|
| 111 |
- [x] measured sweep note
|
| 112 |
- [ ] fixture checks
|
| 113 |
- [x] manual playtest log
|
| 114 |
+
- [ ] tiny low-fi PPO smoke trace
|
| 115 |
- [ ] reward iteration note
|
| 116 |
- [ ] stable local and remote episodes
|
| 117 |
- [x] random and heuristic baselines
|
|
|
|
| 133 |
|
| 134 |
## 8. Execution Order
|
| 135 |
|
| 136 |
+
- [ ] Run a tiny low-fidelity PPO smoke pass and inspect a few trajectories for obvious learnability failures or reward exploits.
|
| 137 |
- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
|
| 138 |
- [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
|
| 139 |
+
- [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
|
| 140 |
- [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
|
| 141 |
- [ ] Refresh the heuristic baseline using the repaired-family evidence.
|
| 142 |
- [ ] Prove a stable local episode path.
|
|
|
|
| 155 |
|
| 156 |
- good, boundary, and bad references behave as expected
|
| 157 |
|
| 158 |
+
Gate 3: tiny PPO smoke is sane
|
| 159 |
+
|
| 160 |
+
- a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
|
| 161 |
+
- trajectories are readable enough to debug
|
| 162 |
+
|
| 163 |
+
Gate 4: manual playtest passes
|
| 164 |
|
| 165 |
- a human can read the observation
|
| 166 |
- a human can choose a plausible next action
|
| 167 |
- a human can explain the reward change
|
| 168 |
|
| 169 |
+
Gate 5: local episode is stable
|
| 170 |
|
| 171 |
- one clean trajectory is reproducible enough for demo use
|
| 172 |
|
| 173 |
+
Gate 6: baseline story is credible
|
| 174 |
|
| 175 |
- heuristic behavior is at least interpretable and preferable to random on the repaired task
|
| 176 |
|
| 177 |
+
Gate 7: remote surface is real
|
| 178 |
|
| 179 |
- HF Space preserves the same task contract as local
|
| 180 |
|
| 181 |
+
Gate 8: submission artifacts exist
|
| 182 |
|
| 183 |
- Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
|
| 184 |
|
|
|
|
| 188 |
|
| 189 |
- keep claims conservative about policy quality
|
| 190 |
- still ship a trained-policy demonstration and document its limitations plainly
|
| 191 |
+
- do not skip the paired high-fidelity checks or submit-side manual trace
|
| 192 |
|
| 193 |
If HF Space deployment is delayed:
|
| 194 |
|
|
|
|
| 214 |
- [x] Record the measured sweep and choose provisional defaults from evidence.
|
| 215 |
- [x] Check in tracked fixtures.
|
| 216 |
- [x] Record the first manual playtest log.
|
| 217 |
+
- [ ] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
|
| 218 |
+
- [ ] Pair the tracked fixtures with high-fidelity submit checks.
|
| 219 |
+
- [ ] Record one submit-side manual trace.
|
| 220 |
- [ ] Refresh the heuristic baseline from that playtest evidence.
|
| 221 |
- [ ] Verify one clean HF Space episode with the same contract.
|