Baladithya Balamurugan

Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)

bd0c358 16 days ago

2.65 kB

	# Architecture Decision Records

	\| # \| Title \| Status \| Date \|
	\|---\|-------\|--------\|------\|
	\| [ADR-001](ADR-001-gpu-venue.md) \| GPU venue \| accepted \| — \|
	\| [ADR-002](ADR-002-trace-source.md) \| Trace source \| accepted \| — \|
	\| [ADR-003](ADR-003-diloco-impl.md) \| DiLoCo implementation \| accepted \| — \|
	\| [ADR-004](ADR-004-replaysim-normalization.md) \| ReplaySim normalization \| accepted \| — \|
	\| [ADR-005](ADR-005-serverless-diloco.md) \| Serverless DiLoCo \| accepted \| — \|
	\| [ADR-006](ADR-006-rl-frameworks.md) \| RL framework strategy: TRL + VeRL + PRIME-RL \| accepted (amended-by ADR-008) \| 2026-05-26 \|
	\| [ADR-007](ADR-007-self-distillation-losses.md) \| Self-distillation losses landscape \| accepted \| 2026-05-26 \|
	\| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) \| Target Dr. GRPO + host live SDPO channel in TRL trainer \| accepted \| 2026-05-29 \|
	\| [ADR-009](ADR-009-layered-hint-generator.md) \| Layered HintGenerator for SDPO textual feedback \| accepted \| 2026-05-29 \|
	\| [ADR-010](ADR-010-feature-deletion-datagen.md) \| FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates \| accepted \| 2026-05-29 \|
	\| [ADR-011](ADR-011-sdpo-alignment-indices.md) \| Collator-emitted SDPO alignment indices (close strict-guard regression) \| accepted (amends ADR-008) \| 2026-05-29 \|
	\| [ADR-012](ADR-012-close-review-findings.md) \| Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) \| accepted (amends 008/009/010) \| 2026-05-29 \|
	\| [ADR-013](ADR-013-lma-integration-channel-ladder.md) \| LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) \| accepted \| 2026-05-29 \|
	\| [ADR-014](ADR-014-policy-optimization-objective-menu.md) \| Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) \| accepted \| 2026-05-30 \|
	\| [ADR-015](ADR-015-holdout-killswitch.md) \| Held-out disjoint eval + depth/generation kill-switch (run-level collapse safeguard #2): `HeldOutGuard` + `HeldoutSplit` in `composer_replication.safety` \| accepted \| 2026-06-09 \|

	Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.

	> Provenance note (ADR-014). ADR-014 also records the canonical correction that the
	> framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
	> part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
	> pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
	> + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.

	# Architecture Decision Records

	\| # \| Title \| Status \| Date \|
	\|---\|-------\|--------\|------\|
	\| [ADR-001](ADR-001-gpu-venue.md) \| GPU venue \| accepted \| — \|
	\| [ADR-002](ADR-002-trace-source.md) \| Trace source \| accepted \| — \|
	\| [ADR-003](ADR-003-diloco-impl.md) \| DiLoCo implementation \| accepted \| — \|
	\| [ADR-004](ADR-004-replaysim-normalization.md) \| ReplaySim normalization \| accepted \| — \|
	\| [ADR-005](ADR-005-serverless-diloco.md) \| Serverless DiLoCo \| accepted \| — \|
	\| [ADR-006](ADR-006-rl-frameworks.md) \| RL framework strategy: TRL + VeRL + PRIME-RL \| accepted (amended-by ADR-008) \| 2026-05-26 \|
	\| [ADR-007](ADR-007-self-distillation-losses.md) \| Self-distillation losses landscape \| accepted \| 2026-05-26 \|
	\| [ADR-008](ADR-008-drgrpo-sdpo-live-channel.md) \| Target Dr. GRPO + host live SDPO channel in TRL trainer \| accepted \| 2026-05-29 \|
	\| [ADR-009](ADR-009-layered-hint-generator.md) \| Layered HintGenerator for SDPO textual feedback \| accepted \| 2026-05-29 \|
	\| [ADR-010](ADR-010-feature-deletion-datagen.md) \| FeatureDeletionEnv synthetic-data subsystem over OSS SWE substrates \| accepted \| 2026-05-29 \|
	\| [ADR-011](ADR-011-sdpo-alignment-indices.md) \| Collator-emitted SDPO alignment indices (close strict-guard regression) \| accepted (amends ADR-008) \| 2026-05-29 \|
	\| [ADR-012](ADR-012-close-review-findings.md) \| Close open cross-family-review findings (KL/hint-routing/provenance/curriculum) \| accepted (amends 008/009/010) \| 2026-05-29 \|
	\| [ADR-013](ADR-013-lma-integration-channel-ladder.md) \| LMA integration — isolated-channel ladder (supersedes tie-in Phase-3 hyperparams) \| accepted \| 2026-05-29 \|
	\| [ADR-014](ADR-014-policy-optimization-objective-menu.md) \| Policy-optimization objective MENU: base RL objective selectable (default Dr.GRPO) over TRL 1.5.0 GRPOConfig (builds-on ADR-006/007/008) \| accepted \| 2026-05-30 \|
	\| [ADR-015](ADR-015-holdout-killswitch.md) \| Held-out disjoint eval + depth/generation kill-switch (run-level collapse safeguard #2): `HeldOutGuard` + `HeldoutSplit` in `composer_replication.safety` \| accepted \| 2026-06-09 \|

	Sorted by number ascending. ADRs are immutable after `accepted`; supersede or amend rather than edit.

	> Provenance note (ADR-014). ADR-014 also records the canonical correction that the
	> framework's **trace-replay-DPO channel (channel 3) is an additive research channel, NOT
	> part of Cursor's Composer recipe** -- Composer's primary sources contain no DPO / preference
	> pairs / reward models / multiple teachers. Genuine replication is channels 1 (Dr.GRPO base)
	> + 2 (SDPO). See [`docs/OVERVIEW.md`](../OVERVIEW.md) for the honest three-channel summary.