CreativeEngineer commited on
Commit
1c1f314
·
1 Parent(s): 27d58b3

docs: adopt hybrid fail-fast validation order

Browse files
Files changed (3) hide show
  1. README.md +5 -4
  2. TODO.md +16 -8
  3. docs/FUSION_DESIGN_LAB_PLAN_V2.md +24 -6
README.md CHANGED
@@ -30,7 +30,7 @@ Implementation status:
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
- - the next runtime work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, and deployment evidence
34
 
35
  ## Execution Status
36
 
@@ -67,12 +67,12 @@ Implementation status:
67
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
68
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
69
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
70
- - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next meaningful playtest step is a real `submit` trace, not more abstract reward debate.
71
 
72
  Current mode:
73
 
74
  - strategic task choice is already locked
75
- - the next work is paired high-fidelity fixture checks, submit-side manual playtesting, heuristic refresh, smoke validation, and deployment
76
  - new planning text should only appear when a real blocker forces a decision change
77
 
78
  ## Planned Repository Layout
@@ -127,8 +127,9 @@ uv sync --extra notebooks
127
  ## Immediate Next Steps
128
 
129
  - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
 
130
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
131
- - [ ] Run submit-side manual playtest episodes and record the first real reward pathology, if any.
132
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
133
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
134
  - [ ] Deploy the environment to HF Space.
 
30
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
31
  - the repaired 4-knob low-dimensional family is now wired into the runtime path
32
  - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
33
+ - the next runtime work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, and deployment evidence
34
 
35
  ## Execution Status
36
 
 
67
  - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
68
  - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
69
  - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
70
+ - The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run alongside high-fidelity fixture pairing, then a real `submit` trace.
71
 
72
  Current mode:
73
 
74
  - strategic task choice is already locked
75
+ - the next work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, smoke validation, and deployment
76
  - new planning text should only appear when a real blocker forces a decision change
77
 
78
  ## Planned Repository Layout
 
127
  ## Immediate Next Steps
128
 
129
  - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
130
+ - [ ] Run a tiny low-fidelity PPO smoke run and save a few trajectories.
131
  - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
132
+ - [ ] Run at least one submit-side manual trace and record the first real reward pathology, if any.
133
  - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
134
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
135
  - [ ] Deploy the environment to HF Space.
TODO.md CHANGED
@@ -43,6 +43,7 @@ Priority source:
43
  - [x] manual playtest log
44
  - [x] settle the non-submit terminal reward policy
45
  - [x] baseline comparison has been re-run on the `constellaration` branch state
 
46
  - [ ] refresh the heuristic baseline for the real verifier path
47
 
48
  ## Execution Graph
@@ -54,12 +55,13 @@ flowchart TD
54
  C["constellaration Physics Wiring"] --> D
55
  D --> P["Parameterization Repair"]
56
  P --> E["Fixture Checks"]
57
- E --> F["Manual Playtest"]
58
- F --> G["Reward V1"]
59
- G --> H["Baselines"]
60
- H --> I["HF Space Deploy"]
61
- I --> J["Colab Notebook"]
62
- J --> K["Demo + README"]
 
63
  ```
64
 
65
  ## Hour 0-2
@@ -189,9 +191,15 @@ flowchart TD
189
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
190
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
191
 
 
 
 
 
 
 
192
  - [ ] Manual-playtest 5-10 episodes
193
  Goal:
194
- expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
195
  Related:
196
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
197
  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
@@ -272,7 +280,7 @@ flowchart TD
272
  - [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
273
  - [ ] Do not port the old `ai-sci-feasible-designs` harness
274
  - [ ] Do not let notebook or demo work outrun environment evidence
275
- - [ ] Do not add training-first complexity before manual playtesting
276
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
277
  - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
278
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
 
43
  - [x] manual playtest log
44
  - [x] settle the non-submit terminal reward policy
45
  - [x] baseline comparison has been re-run on the `constellaration` branch state
46
+ - [ ] tiny low-fi PPO smoke run exists
47
  - [ ] refresh the heuristic baseline for the real verifier path
48
 
49
  ## Execution Graph
 
55
  C["constellaration Physics Wiring"] --> D
56
  D --> P["Parameterization Repair"]
57
  P --> E["Fixture Checks"]
58
+ E --> F["Tiny PPO Smoke"]
59
+ F --> G["Submit-side Manual Playtest"]
60
+ G --> H["Reward V1"]
61
+ H --> I["Baselines"]
62
+ I --> J["HF Space Deploy"]
63
+ J --> K["Colab Notebook"]
64
+ K --> L["Demo + README"]
65
  ```
66
 
67
  ## Hour 0-2
 
191
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
192
  [Next 12 Hours Checklist](docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md)
193
 
194
+ - [ ] Run a tiny low-fi PPO smoke pass
195
+ Goal:
196
+ fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
197
+ Note:
198
+ treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
199
+
200
  - [ ] Manual-playtest 5-10 episodes
201
  Goal:
202
+ start with one submit-side trace, then expand the initial low-fidelity playtest note into 5-10 episodes and surface at least one pathology or ambiguity
203
  Related:
204
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md),
205
  [Deliverables Map](docs/archive/FUSION_DELIVERABLES_MAP.md)
 
280
  - [ ] Do not guess repaired-family ranges, deltas, or budget changes without measurement
281
  - [ ] Do not port the old `ai-sci-feasible-designs` harness
282
  - [ ] Do not let notebook or demo work outrun environment evidence
283
+ - [ ] Do not let tiny low-fi smoke training replace paired high-fidelity checks or submit-side manual playtesting
284
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
285
  - [x] Do not compare high-fidelity submit scores against low-fidelity best/initial score state in the final story
286
  - [ ] Do not describe the current baseline reset state as feasible or near-feasible
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -38,6 +38,7 @@ Completed:
38
 
39
  Still open:
40
 
 
41
  - paired high-fidelity checks for the tracked fixtures
42
  - submit-side manual playtest evidence
43
  - heuristic baseline refresh on the repaired real-verifier path
@@ -75,6 +76,12 @@ Execution rule:
75
  - Do not use reward complexity to hide a blocked action family.
76
  - Do not polish repo or video before the environment and baselines are real.
77
 
 
 
 
 
 
 
78
  ## 5. Document Roles
79
 
80
  Use the docs like this:
@@ -104,6 +111,7 @@ Evidence order:
104
  - [x] measured sweep note
105
  - [ ] fixture checks
106
  - [x] manual playtest log
 
107
  - [ ] reward iteration note
108
  - [ ] stable local and remote episodes
109
  - [x] random and heuristic baselines
@@ -125,9 +133,10 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
125
 
126
  ## 8. Execution Order
127
 
 
128
  - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
129
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
130
- - [ ] Manual-playtest 5 to 10 episodes, including real submit traces, and record the first real confusion point, exploit, or reward pathology.
131
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
132
  - [ ] Refresh the heuristic baseline using the repaired-family evidence.
133
  - [ ] Prove a stable local episode path.
@@ -146,25 +155,30 @@ Gate 2: fixture checks pass
146
 
147
  - good, boundary, and bad references behave as expected
148
 
149
- Gate 3: manual playtest passes
 
 
 
 
 
150
 
151
  - a human can read the observation
152
  - a human can choose a plausible next action
153
  - a human can explain the reward change
154
 
155
- Gate 4: local episode is stable
156
 
157
  - one clean trajectory is reproducible enough for demo use
158
 
159
- Gate 5: baseline story is credible
160
 
161
  - heuristic behavior is at least interpretable and preferable to random on the repaired task
162
 
163
- Gate 6: remote surface is real
164
 
165
  - HF Space preserves the same task contract as local
166
 
167
- Gate 7: submission artifacts exist
168
 
169
  - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
170
 
@@ -174,6 +188,7 @@ If training evidence is weak:
174
 
175
  - keep claims conservative about policy quality
176
  - still ship a trained-policy demonstration and document its limitations plainly
 
177
 
178
  If HF Space deployment is delayed:
179
 
@@ -199,5 +214,8 @@ If the repaired family is too easy:
199
  - [x] Record the measured sweep and choose provisional defaults from evidence.
200
  - [x] Check in tracked fixtures.
201
  - [x] Record the first manual playtest log.
 
 
 
202
  - [ ] Refresh the heuristic baseline from that playtest evidence.
203
  - [ ] Verify one clean HF Space episode with the same contract.
 
38
 
39
  Still open:
40
 
41
+ - tiny low-fidelity PPO smoke evidence
42
  - paired high-fidelity checks for the tracked fixtures
43
  - submit-side manual playtest evidence
44
  - heuristic baseline refresh on the repaired real-verifier path
 
76
  - Do not use reward complexity to hide a blocked action family.
77
  - Do not polish repo or video before the environment and baselines are real.
78
 
79
+ Practical fail-fast rule:
80
+
81
+ - allow a tiny low-fidelity PPO smoke run before full submit-side validation
82
+ - use it only to surface obvious learnability bugs, reward exploits, or action-space problems
83
+ - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
84
+
85
  ## 5. Document Roles
86
 
87
  Use the docs like this:
 
111
  - [x] measured sweep note
112
  - [ ] fixture checks
113
  - [x] manual playtest log
114
+ - [ ] tiny low-fi PPO smoke trace
115
  - [ ] reward iteration note
116
  - [ ] stable local and remote episodes
117
  - [x] random and heuristic baselines
 
133
 
134
  ## 8. Execution Order
135
 
136
+ - [ ] Run a tiny low-fidelity PPO smoke pass and inspect a few trajectories for obvious learnability failures or reward exploits.
137
  - [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
138
  - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
139
+ - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
140
  - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
141
  - [ ] Refresh the heuristic baseline using the repaired-family evidence.
142
  - [ ] Prove a stable local episode path.
 
155
 
156
  - good, boundary, and bad references behave as expected
157
 
158
+ Gate 3: tiny PPO smoke is sane
159
+
160
+ - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
161
+ - trajectories are readable enough to debug
162
+
163
+ Gate 4: manual playtest passes
164
 
165
  - a human can read the observation
166
  - a human can choose a plausible next action
167
  - a human can explain the reward change
168
 
169
+ Gate 5: local episode is stable
170
 
171
  - one clean trajectory is reproducible enough for demo use
172
 
173
+ Gate 6: baseline story is credible
174
 
175
  - heuristic behavior is at least interpretable and preferable to random on the repaired task
176
 
177
+ Gate 7: remote surface is real
178
 
179
  - HF Space preserves the same task contract as local
180
 
181
+ Gate 8: submission artifacts exist
182
 
183
  - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
184
 
 
188
 
189
  - keep claims conservative about policy quality
190
  - still ship a trained-policy demonstration and document its limitations plainly
191
+ - do not skip the paired high-fidelity checks or submit-side manual trace
192
 
193
  If HF Space deployment is delayed:
194
 
 
214
  - [x] Record the measured sweep and choose provisional defaults from evidence.
215
  - [x] Check in tracked fixtures.
216
  - [x] Record the first manual playtest log.
217
+ - [ ] Run a tiny low-fidelity PPO smoke pass and save a few trajectories.
218
+ - [ ] Pair the tracked fixtures with high-fidelity submit checks.
219
+ - [ ] Record one submit-side manual trace.
220
  - [ ] Refresh the heuristic baseline from that playtest evidence.
221
  - [ ] Verify one clean HF Space episode with the same contract.