File size: 2,519 Bytes
95cbc5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
## Checkpoints (sync-or-die contract)



Goal: keep three engineers aligned and prevent cool demo scope creep from killing the submission. Source: `../prd.md` 12.



### Checkpoint 1  Midnight (00:00 IST)  scope freeze + Phase 1 gate



**Everyone must demonstrate (live, locally or on Space):**

- **Env server runs** and responds to `GET /health`

- **OpenEnv loop works**: `reset`  `step`  done, without crashing

- **Action parser is robust**: malformed XML doesnt crash; returns safe error

- **No-leak invariant**: observation contains no ground truth fields



**Role deliverables:**

- **Env/Server owner**: endpoints exist (`/health`, `/reset`, `/step`, `/state`, `/docs`)

- **Reward owner**: reward function wired and deterministic on handcrafted cases

- **Training owner**: mock training loop can call env repeatedly (even if reward is dummy)



**If any of these are red, trigger a scope cut immediately:**

- 3-action env incomplete  cut to 2-action env (analyze + verdict)

- Tiered reward unstable  cut to binary reward only



**After this checkpoint:**

- **Scope freeze is active.** New features go to `.agent/FUTURE_WORK.md` only.



### Checkpoint 2  9:00 AM Sunday  training evidence gate



**Everyone must demonstrate:**

- Training run launched (HF Jobs A10G preferred) or fallback running

- Wandb logging works (reward curve visible)

- Evaluation script/notebook can run 100 held-out samples



**Scope-cut triggers:**

- Training blocked by infra >30 min  move to GCP A10G fallback

- Training curve still flat by 10:00 AM  commit to qualitative narrative (no more training tweaks)



**What gets cut first (in order):**

1. P2 items (web UI polish, blog post)

2. Per-CWE breakdown (keep overall accuracy)

3. Exploit sketch bonus (keep binary + CWE if stable)

4. CWE classification bonus (keep binary only)



### Checkpoint 3  3:00 PM Sunday  feature freeze gate



**Everyone must demonstrate:**

- HF Space is live and stable; `/health` 200; `/docs` loads

- `tests/` pass (see `.agent/test_contracts.md`)

- Demo artifact path is locked (video or text-trace fallback)

- README has all submission links (Space, notebook, video, wandb, repo)



**Hard rule:**

- **No changes after 3:00 PM** except emergency fixes that prevent submission failure.



**Final scope cuts (if needed to protect submission):**

1. Video  text trace in README

2. Training curve  single plot + narrative

3. Held-out eval  small N sanity check