Spaces:
Sleeping
Checkpoints (sync-or-die contract)
Goal: keep three engineers aligned and prevent cool demo scope creep from killing the submission. Source: ../prd.md 12.
Checkpoint 1 Midnight (00:00 IST) scope freeze + Phase 1 gate
Everyone must demonstrate (live, locally or on Space):
Env server runs and responds to
GET /healthOpenEnv loop works:
resetstepdone, without crashingAction parser is robust: malformed XML doesnt crash; returns safe error
No-leak invariant: observation contains no ground truth fields
Role deliverables:
Env/Server owner: endpoints exist (
/health,/reset,/step,/state,/docs)Reward owner: reward function wired and deterministic on handcrafted cases
Training owner: mock training loop can call env repeatedly (even if reward is dummy)
If any of these are red, trigger a scope cut immediately:
3-action env incomplete cut to 2-action env (analyze + verdict)
Tiered reward unstable cut to binary reward only
After this checkpoint:
- Scope freeze is active. New features go to
.agent/FUTURE_WORK.mdonly.
Checkpoint 2 9:00 AM Sunday training evidence gate
Everyone must demonstrate:
Training run launched (HF Jobs A10G preferred) or fallback running
Wandb logging works (reward curve visible)
Evaluation script/notebook can run 100 held-out samples
Scope-cut triggers:
Training blocked by infra >30 min move to GCP A10G fallback
Training curve still flat by 10:00 AM commit to qualitative narrative (no more training tweaks)
What gets cut first (in order):
P2 items (web UI polish, blog post)
Per-CWE breakdown (keep overall accuracy)
Exploit sketch bonus (keep binary + CWE if stable)
CWE classification bonus (keep binary only)
Checkpoint 3 3:00 PM Sunday feature freeze gate
Everyone must demonstrate:
HF Space is live and stable;
/health200;/docsloadstests/pass (see.agent/test_contracts.md)Demo artifact path is locked (video or text-trace fallback)
README has all submission links (Space, notebook, video, wandb, repo)
Hard rule:
- No changes after 3:00 PM except emergency fixes that prevent submission failure.
Final scope cuts (if needed to protect submission):
Video text trace in README
Training curve single plot + narrative
Held-out eval small N sanity check