CreativeEngineer commited on
Commit
ba716cf
·
1 Parent(s): 16065b1

docs: sync status trackers to verifier state

Browse files
README.md CHANGED
@@ -22,6 +22,7 @@ Implementation status:
22
  - docs are aligned to fresh `P1` wiring in this repo
23
  - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
24
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 
25
 
26
  ## Execution Status
27
 
@@ -35,6 +36,8 @@ Implementation status:
35
  - [x] Replace the synthetic evaluator with `constellaration`
36
  - [ ] Add tracked `P1` fixtures under `server/data/p1/`
37
  - [ ] Run manual playtesting and record the first reward pathology
 
 
38
  - [ ] Deploy the real environment to HF Space
39
 
40
  ## Known Gaps
@@ -47,7 +50,7 @@ Implementation status:
47
  Current mode:
48
 
49
  - strategic task choice is already locked
50
- - the next work is implementation, smoke validation, and manual playtesting
51
  - new planning text should only appear when a real blocker forces a decision change
52
 
53
  ## Planned Repository Layout
@@ -100,15 +103,15 @@ uv sync --extra notebooks
100
 
101
  ## Immediate Next Steps
102
 
103
- 1. Set up the Northflank Jupyter Notebook with PyTorch and attach persistent storage.
104
- 2. Pass a Northflank smoke test:
 
 
105
  - import `constellaration`
106
  - run one rotating-ellipse generation plus one low-fidelity verifier call
107
  - write an artifact to persistent storage
108
- 3. Add tracked `P1` fixtures under `server/data/p1`.
109
- 4. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
110
- 5. Add the Colab notebook under `training/notebooks`.
111
- 6. Run manual playtest episodes before heavy training work.
112
 
113
  These are implementation steps, not another planning phase.
114
 
 
22
  - docs are aligned to fresh `P1` wiring in this repo
23
  - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
24
  - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
25
+ - the remaining runtime work is fixture coverage, manual playtesting, heuristic refresh, and deployment evidence
26
 
27
  ## Execution Status
28
 
 
36
  - [x] Replace the synthetic evaluator with `constellaration`
37
  - [ ] Add tracked `P1` fixtures under `server/data/p1/`
38
  - [ ] Run manual playtesting and record the first reward pathology
39
+ - [ ] Refresh the heuristic baseline for the real verifier path
40
+ - [ ] Pass the Northflank smoke test on the H100 workspace
41
  - [ ] Deploy the real environment to HF Space
42
 
43
  ## Known Gaps
 
50
  Current mode:
51
 
52
  - strategic task choice is already locked
53
+ - the next work is fixtures, manual playtesting, heuristic refresh, smoke validation, and deployment
54
  - new planning text should only appear when a real blocker forces a decision change
55
 
56
  ## Planned Repository Layout
 
103
 
104
  ## Immediate Next Steps
105
 
106
+ 1. Add tracked `P1` fixtures under `server/data/p1`.
107
+ 2. Run manual playtest episodes and record the first real reward pathology, if any.
108
+ 3. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
109
+ 4. Pass a Northflank smoke test:
110
  - import `constellaration`
111
  - run one rotating-ellipse generation plus one low-fidelity verifier call
112
  - write an artifact to persistent storage
113
+ 5. Deploy the environment to HF Space.
114
+ 6. Add the Colab notebook under `training/notebooks`.
 
 
115
 
116
  These are implementation steps, not another planning phase.
117
 
TODO.md CHANGED
@@ -36,9 +36,9 @@ Priority source:
36
 
37
  ```mermaid
38
  flowchart TD
39
- A["Northflank Smoke Test"] --> C["constellaration Physics Wiring"]
40
  B["P1 Contract Lock"] --> D["P1 Models + Environment"]
41
- C --> D
42
  D --> E["Fixture Checks"]
43
  E --> F["Manual Playtest"]
44
  F --> G["Reward V1"]
@@ -121,6 +121,13 @@ flowchart TD
121
  [AGENTS.md](AGENTS.md),
122
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
123
 
 
 
 
 
 
 
 
124
  - [x] Decide the non-submit terminal reward policy
125
  Goal:
126
  budget exhaustion now yields a smaller end-of-episode reward than `submit`, so non-submitting agents still get terminal feedback without outranking explicit submit behavior
@@ -184,3 +191,4 @@ flowchart TD
184
  - [ ] Do not add training-first complexity before manual playtesting
185
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
186
  - [ ] Do not describe the current baseline reset state as already feasible
 
 
36
 
37
  ```mermaid
38
  flowchart TD
39
+ A["Northflank Smoke Test"] --> E["Fixture Checks"]
40
  B["P1 Contract Lock"] --> D["P1 Models + Environment"]
41
+ C["constellaration Physics Wiring"] --> D
42
  D --> E["Fixture Checks"]
43
  E --> F["Manual Playtest"]
44
  F --> G["Reward V1"]
 
121
  [AGENTS.md](AGENTS.md),
122
  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
123
 
124
+ - [ ] Write down whether `Reward V0` survives unchanged
125
+ Goal:
126
+ if playtesting does not reveal a real pathology, record that outcome explicitly instead of forcing a `V1`
127
+ Related:
128
+ [README.md](README.md),
129
+ [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
130
+
131
  - [x] Decide the non-submit terminal reward policy
132
  Goal:
133
  budget exhaustion now yields a smaller end-of-episode reward than `submit`, so non-submitting agents still get terminal feedback without outranking explicit submit behavior
 
191
  - [ ] Do not add training-first complexity before manual playtesting
192
  - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
193
  - [ ] Do not describe the current baseline reset state as already feasible
194
+ - [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting
baselines/README.md CHANGED
@@ -1,5 +1,14 @@
1
  Random and heuristic baselines will live here.
2
 
 
 
 
 
 
 
 
 
 
3
  The first baseline milestone is:
4
 
5
  - one random agent
 
1
  Random and heuristic baselines will live here.
2
 
3
+ ## Status
4
+
5
+ - [x] random baseline exists
6
+ - [x] heuristic baseline exists
7
+ - [x] baseline comparison script exists
8
+ - [x] baseline comparison rerun completed on the real verifier path
9
+ - [ ] heuristic refreshed after the real-verifier rerun
10
+ - [ ] presentation-ready comparison trace exported
11
+
12
  The first baseline milestone is:
13
 
14
  - one random agent
demo/README.md CHANGED
@@ -1,5 +1,12 @@
1
  Demo assets belong here.
2
 
 
 
 
 
 
 
 
3
  Expected contents:
4
 
5
  - one stable episode capture
 
1
  Demo assets belong here.
2
 
3
+ ## Status
4
+
5
+ - [ ] stable episode capture exported
6
+ - [ ] reward-iteration story note exported
7
+ - [ ] baseline comparison figure exported
8
+ - [ ] final 1-minute script drafted
9
+
10
  Expected contents:
11
 
12
  - one stable episode capture
docs/FUSION_DELIVERABLES_MAP.md CHANGED
@@ -6,6 +6,16 @@ Northflank is the recommended compute workspace behind those artifacts. HF Space
6
 
7
  Use this map to sequence execution, not to reopen already-locked task choices.
8
 
 
 
 
 
 
 
 
 
 
 
9
  ## Deliverables Tree
10
 
11
  ```mermaid
@@ -90,14 +100,12 @@ flowchart LR
90
 
91
  ## Priority Order
92
 
93
- 1. Bring up the Northflank H100 workspace with persistent storage.
94
- 2. Pass the Northflank smoke test.
95
- 3. Prove the fresh local `P1` verifier loop.
96
- 4. Freeze the environment contract and mark the initial reward as `V0`.
97
- 5. Run verifier/fixture checks and then manual-playtest the environment.
98
- 6. Fix obvious reward/pathology issues.
99
- 7. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
100
- 8. Get random and heuristic baselines.
101
- 9. Use the notebook to show traces and comparisons; include training only if it adds signal.
102
- 10. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
103
- 11. Polish the repo only after the artifacts are real.
 
6
 
7
  Use this map to sequence execution, not to reopen already-locked task choices.
8
 
9
+ ## Current Branch Status
10
+
11
+ - [x] `P1` contract is frozen in code
12
+ - [x] official `constellaration` verifier loop is wired
13
+ - [x] baseline comparison has been rerun on the real verifier path
14
+ - [ ] tracked fixtures are checked in
15
+ - [ ] manual playtest evidence exists
16
+ - [ ] heuristic baseline has been refreshed for the real verifier path
17
+ - [ ] HF Space deployment is live
18
+
19
  ## Deliverables Tree
20
 
21
  ```mermaid
 
100
 
101
  ## Priority Order
102
 
103
+ 1. Add tracked fixtures and run fixture sanity checks.
104
+ 2. Manual-playtest the environment and record the first real pathology, if any.
105
+ 3. Refresh the heuristic baseline from that evidence.
106
+ 4. Bring up the Northflank H100 workspace with persistent storage.
107
+ 5. Pass the Northflank smoke test.
108
+ 6. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
109
+ 7. Use the notebook to show traces and comparisons; include training only if it adds signal.
110
+ 8. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
111
+ 9. Polish the repo only after the artifacts are real.
 
 
docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -4,6 +4,19 @@
4
  **Track:** Statement 3.1 (World Modeling — Professional Tasks)
5
  **Status:** Judge-aligned plan with `P1` locked
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ## 1. Submission Thesis
8
 
9
  We are not primarily submitting "a trained model for fusion."
@@ -154,7 +167,7 @@ Allowed reuse:
154
 
155
  Implementation handoff:
156
 
157
- - the remaining work is now wiring, smoke validation, manual playtesting, baselines, and deployment
158
  - do not treat supporting decision notes as a new planning backlog
159
 
160
  ## 8.1 Compute Surfaces
@@ -296,6 +309,10 @@ We should expect at least some of these:
296
 
297
  The reward is only acceptable after we test for those behaviors.
298
 
 
 
 
 
299
  ## 12. Verifier and Reward Fixture Checks
300
 
301
  Before training, we should validate environment wiring with a few fixed fixtures.
 
4
  **Track:** Statement 3.1 (World Modeling — Professional Tasks)
5
  **Status:** Judge-aligned plan with `P1` locked
6
 
7
+ ## 0. Current Branch Status
8
+
9
+ - [x] `P1` task family is locked
10
+ - [x] rotating-ellipse `P1` contract is implemented in code
11
+ - [x] real `constellaration` verifier wiring is in place
12
+ - [x] low-fidelity `run` plus high-fidelity `submit` split is documented
13
+ - [x] post-terminal `step()` guard is in place
14
+ - [x] baseline comparison has been rerun on the real verifier path
15
+ - [ ] tracked `P1` fixtures are added
16
+ - [ ] manual playtest evidence is recorded
17
+ - [ ] heuristic baseline is refreshed for the real verifier path
18
+ - [ ] HF Space deployment evidence is recorded
19
+
20
  ## 1. Submission Thesis
21
 
22
  We are not primarily submitting "a trained model for fusion."
 
167
 
168
  Implementation handoff:
169
 
170
+ - the remaining work is now fixture coverage, manual playtesting, heuristic refresh, smoke validation, and deployment
171
  - do not treat supporting decision notes as a new planning backlog
172
 
173
  ## 8.1 Compute Surfaces
 
309
 
310
  The reward is only acceptable after we test for those behaviors.
311
 
312
+ Important execution rule:
313
+
314
+ - if manual playtesting does not reveal a real pathology, keep `Reward V0` and document that outcome rather than forcing a `Reward V1`
315
+
316
  ## 12. Verifier and Reward Fixture Checks
317
 
318
  Before training, we should validate environment wiring with a few fixed fixtures.
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED
@@ -13,7 +13,9 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
13
  - [x] baselines and API surface have been moved to the `P1` contract
14
  - [x] add a post-terminal guard in `step()`
15
  - [x] replace the synthetic evaluator with `constellaration`
 
16
  - [ ] add tracked fixtures and manual playtest evidence
 
17
 
18
  ## Plan V2 Inheritance
19
 
@@ -103,6 +105,7 @@ Transition rule:
103
  - bad behavior
104
  - refinement to reward V1
105
  - improved behavior
 
106
 
107
  Exit condition: you can explain why the environment now rewards the intended behavior.
108
 
 
13
  - [x] baselines and API surface have been moved to the `P1` contract
14
  - [x] add a post-terminal guard in `step()`
15
  - [x] replace the synthetic evaluator with `constellaration`
16
+ - [x] re-run baselines on the real verifier path
17
  - [ ] add tracked fixtures and manual playtest evidence
18
+ - [ ] refresh the heuristic baseline after the real-verifier rerun
19
 
20
  ## Plan V2 Inheritance
21
 
 
105
  - bad behavior
106
  - refinement to reward V1
107
  - improved behavior
108
+ 8. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
109
 
110
  Exit condition: you can explain why the environment now rewards the intended behavior.
111
 
docs/PIVOT_P1_ROTATING_ELLIPSE.md CHANGED
@@ -6,6 +6,15 @@
6
 
7
  Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
8
 
 
 
 
 
 
 
 
 
 
9
  ## Decision
10
 
11
  Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
@@ -147,7 +156,6 @@ current_params: {aspect_ratio, elongation, rotational_transform}
147
  best_params: {aspect_ratio, elongation, rotational_transform}
148
  initial_score: float
149
  best_score: float
150
- current_feasibility: float
151
  best_feasibility: float
152
  history: list[str]
153
  ```
@@ -171,13 +179,17 @@ history: list[str]
171
 
172
  ### Phase 1: Physics Backend (~1 hour)
173
 
 
 
174
  Rewrite `server/physics.py` to wrap:
175
  - `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
176
  - `constellaration.forward_model.forward_model` with low-fi settings for evaluation
177
- - `constellaration.problems.GeometricalProblem` for official P1 scoring on submit
178
 
179
  ### Phase 2: Environment Contract (~1 hour)
180
 
 
 
181
  Update `server/environment.py`:
182
  - New observation schema with P1 metrics
183
  - New action schema for rotating-ellipse perturbations
@@ -188,6 +200,8 @@ Update `fusion_lab/models.py` for new schemas.
188
 
189
  ### Phase 3: Manual Playtest (~30 min)
190
 
 
 
191
  Validate hypothesis: "6 actions is enough."
192
  - Play 5-10 episodes manually
193
  - Log: can a human reach feasibility? Improve elongation?
@@ -196,12 +210,16 @@ Validate hypothesis: "6 actions is enough."
196
 
197
  ### Phase 4: Baselines (~30 min)
198
 
 
 
199
  - Random agent
200
  - Heuristic agent (greedy toward known-good parameter region)
201
  - Comparison table
202
 
203
  ### Phase 5: Deploy + Evidence (~2 hours)
204
 
 
 
205
  - Update Dockerfile/deps for constellaration
206
  - `openenv validate` + `openenv push`
207
  - Colab notebook connecting to live environment
 
6
 
7
  Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
8
 
9
+ ## Current Branch Status
10
+
11
+ - [x] pivot accepted
12
+ - [x] rotating-ellipse `P1` contract is implemented
13
+ - [x] `constellaration` verifier path is wired
14
+ - [ ] tracked fixtures are added
15
+ - [ ] manual playtest evidence is recorded
16
+ - [ ] heuristic baseline is refreshed for the real verifier path
17
+
18
  ## Decision
19
 
20
  Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
 
156
  best_params: {aspect_ratio, elongation, rotational_transform}
157
  initial_score: float
158
  best_score: float
 
159
  best_feasibility: float
160
  history: list[str]
161
  ```
 
179
 
180
  ### Phase 1: Physics Backend (~1 hour)
181
 
182
+ Status: done.
183
+
184
  Rewrite `server/physics.py` to wrap:
185
  - `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
186
  - `constellaration.forward_model.forward_model` with low-fi settings for evaluation
187
+ - `constellaration.problems.GeometricalProblem` for official P1 scoring on every evaluation
188
 
189
  ### Phase 2: Environment Contract (~1 hour)
190
 
191
+ Status: done.
192
+
193
  Update `server/environment.py`:
194
  - New observation schema with P1 metrics
195
  - New action schema for rotating-ellipse perturbations
 
200
 
201
  ### Phase 3: Manual Playtest (~30 min)
202
 
203
+ Status: open.
204
+
205
  Validate hypothesis: "6 actions is enough."
206
  - Play 5-10 episodes manually
207
  - Log: can a human reach feasibility? Improve elongation?
 
210
 
211
  ### Phase 4: Baselines (~30 min)
212
 
213
+ Status: partial. Baselines exist, but the heuristic needs refresh on the real verifier path.
214
+
215
  - Random agent
216
  - Heuristic agent (greedy toward known-good parameter region)
217
  - Comparison table
218
 
219
  ### Phase 5: Deploy + Evidence (~2 hours)
220
 
221
+ Status: open.
222
+
223
  - Update Dockerfile/deps for constellaration
224
  - `openenv validate` + `openenv push`
225
  - Colab notebook connecting to live environment
server/data/README.md CHANGED
@@ -1,3 +1,7 @@
1
  Baseline VMEC inputs and related static assets belong here.
2
 
3
  Do not commit generated solver outputs or large transient artifacts.
 
 
 
 
 
1
  Baseline VMEC inputs and related static assets belong here.
2
 
3
  Do not commit generated solver outputs or large transient artifacts.
4
+
5
+ ## Status
6
+
7
+ - [ ] tracked `P1` fixture assets added under `server/data/p1/`
server/data/p1/README.md CHANGED
@@ -10,4 +10,11 @@ Intended contents:
10
 
11
  These fixtures are for verifier and reward sanity checks.
12
 
 
 
 
 
 
 
 
13
  Do not copy the old `ai-sci-feasible-designs` harness here. Reuse only the specific JSON artifacts needed for the fresh `P1` environment.
 
10
 
11
  These fixtures are for verifier and reward sanity checks.
12
 
13
+ ## Status
14
+
15
+ - [ ] known-good or near-winning fixture added
16
+ - [ ] near-boundary fixture added
17
+ - [ ] clearly infeasible fixture added
18
+ - [ ] fixture sanity note written
19
+
20
  Do not copy the old `ai-sci-feasible-designs` harness here. Reuse only the specific JSON artifacts needed for the fresh `P1` environment.
training/README.md CHANGED
@@ -1,3 +1,9 @@
1
  Training and evaluation notebooks belong here.
2
 
3
  This repository treats notebooks as supporting evidence for the environment, not the primary product.
 
 
 
 
 
 
 
1
  Training and evaluation notebooks belong here.
2
 
3
  This repository treats notebooks as supporting evidence for the environment, not the primary product.
4
+
5
+ ## Status
6
+
7
+ - [ ] Northflank notebook artifacts saved
8
+ - [ ] Colab notebook saved
9
+ - [ ] training evidence included only if it is persuasive
training/notebooks/README.md CHANGED
@@ -12,6 +12,12 @@ Recommended split:
12
  - Northflank notebook: main compute workspace on the team H100
13
  - Colab notebook: thin public artifact required by the hackathon
14
 
 
 
 
 
 
 
15
  Operational defaults:
16
 
17
  - use the same Python dependency set as the repo runtime
 
12
  - Northflank notebook: main compute workspace on the team H100
13
  - Colab notebook: thin public artifact required by the hackathon
14
 
15
+ ## Status
16
+
17
+ - [ ] Northflank smoke notebook note saved
18
+ - [ ] manual-playtest notebook or trace notebook saved
19
+ - [ ] thin public Colab notebook saved
20
+
21
  Operational defaults:
22
 
23
  - use the same Python dependency set as the repo runtime