composer-replication-framework / research /critic-findings-depth.json

Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 24 days ago

7.59 kB

	{
	"critic": "depth",
	"findings": [
	{
	"severity": "high",
	"section": "10. Cost, Throughput, Failure Modes (and the §3 callback at line 55)",
	"issue": "The single quantitative anchor for the entire 'divergence-gating is mandatory' argument misreads its own source. The report frames '~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree.' But research/05:256 derives $64 explicitly as a FLAT replay cost ($0.008/step x 1000 steps x 8 teachers = 8000 forward passes, no branching). The repo's own flat-to-tree note (flat-multi-teacher-to-branching...md:40) states this directly: 'research/05 ... already prices the FLAT case at ~$64/trace ungated for 8 teachers x 1000 steps; a tree makes [gating] mandatory.' Both numbers the report compares are flat costs that differ only in scale (N=3 short trace = $0.98 from teacher_replay.py:7-8 spike-001; N=8 x 1000 steps = $64). Labeling the $64 figure a 'branching tree' conflates a teacher-count/length scale difference with the flat-vs-tree distinction, and badly UNDERSELLS the real tree cost: a true O(N^D) branching tree is combinatorially worse than $64, not equal to it. The argument's headline number is wrong in the direction that weakens the report's own thesis.",
	"fix": "Reframe to: flat Channel-3 replay is ~$0.98/trace at N=3 (teacher_replay.py:7-8) and ~$64/trace at the 8-teacher x 1000-step scale (research/05:256) — both FLAT, O(N*T). A branching tree is O(N^D), strictly worse than either flat figure; that combinatorial blow-up (not the $0.98-to-$64 gap) is what makes divergence-gating mandatory. Drop 'branching tree' from the $64 clause.",
	"anchor_quote": "~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree"
	},
	{
	"severity": "medium",
	"section": "9. The SageMaker Path and the Recommended Hybrid (also §6 reuse/build table)",
	"issue": "The '~150 LOC each' executor estimate (and the '~300 LOC' combined figure in §6) undershoots the repo's only working ServerlessExecutor backend by ~2.5x and is not grounded in the existence proof the report itself cites. The report leans on ModalSpawnExecutor as the 'working proof' that calibrates the delta [42], but modal_spawn.py is 390 LOC and the executor.py reference (Protocol + LocalProcessExecutor) is 310 LOC. An EKS adapter that must handle Indexed Jobs, JOB_COMPLETION_INDEX->REPLICA_RANK mapping, GPU limits, IRSA, optional runtimeClassName, plus poll/cancel/stream_logs/collect against the Batch/Pod APIs is unlikely to be half the size of the Modal adapter. The figure reads as optimistic rather than measured, which weakens the report's load-bearing 'nine-tenths already exists / bounded delta' claim.",
	"fix": "Either ground the estimate (e.g. 'ModalSpawnExecutor is 390 LOC; expect EKSExecutor in the same 300-400 LOC range') or soften to an order-of-magnitude ('a few hundred LOC each, comparable to the existing Modal adapter') instead of the precise '~150 LOC each'.",
	"anchor_quote": "`EKSExecutor` (~150 LOC, primary)"
	},
	{
	"severity": "low",
	"section": "6. Grounding in the composer-replication-framework (reuse/build table) and §8/§9",
	"issue": "The report consistently presents `EKSExecutor` as the repo's own reserved slot ('AWS leaf adapters \| Build (~300 LOC) \| `EKSExecutor` + `SageMakerExecutor`'). But the repo never names an EKSExecutor: the ServerlessExecutor Protocol docstring (executor.py:41) lists 'RunPodExecutor, SageMakerExecutor, K8sExecutor' as Future, and INTEGRATION_RECIPES.md:685 lists `K8sExecutor` (KubeRay/Volcano) as Roadmap. `EKSExecutor` is the report's coinage. SageMakerExecutor is a genuine repo-reserved name; EKSExecutor is not. This slightly overstates how pre-slotted the EKS path is.",
	"fix": "Either note that the repo's roadmap slot is `K8sExecutor` (executor.py:41 / INTEGRATION_RECIPES.md:685) and EKSExecutor is the proposed concrete K8s implementation, or rename to `K8sExecutor` to match the repo. A one-clause parenthetical ('the repo's reserved `K8sExecutor` slot, here specialized to EKS') closes the gap.",
	"anchor_quote": "`EKSExecutor` + `SageMakerExecutor` [42]"
	},
	{
	"severity": "low",
	"section": "7. What the Literature Says (Endorsements, the counterfactual-credit backbone)",
	"issue": "The divergence-as-counterfactual-credit claim slightly conflates two distinct mechanisms. The report says siblings from a shared parent are 'low-variance because the shared parent differences out the baseline,' then attributes this to 'the quantity learned counterfactual-credit methods approximate with a hindsight model' [33]. But 2011.09464 (the cited note) achieves low variance via a FUTURE-CONDITIONAL (hindsight) baseline that conditions on the realized trajectory — not via a shared-parent/leave-one-out baseline (which is the standard MC advantage the repo's GRPO LOO already does). 'Shared parent differences out the baseline' is really the LOO/group-relative argument (closer to Tree-GRPO [44]), whereas the hindsight-model framing is CCA. The two are run together as if one mechanism.",
	"fix": "Separate the two: the shared-parent differencing is a group-relative/LOO baseline (Tree-GRPO [44]); CCA [33] is the stronger, hindsight-conditioned variant that the executed-sibling structure approximates non-parametrically. Stating both as distinct sources of the low-variance claim is more accurate and actually strengthens the backbone.",
	"anchor_quote": "low-variance because the shared parent differences out the baseline"
	}
	],
	"overall": "The report's core mechanism claims are unusually well-grounded — I verified each axis against source and most are faithful to the byte level. The flat->tree fitness delta (extract_dpo_pairs breaks after one teacher-plurality pair; _grade() returns masked pass-fraction) is exact. The SDPO-carrier-for-world-model claim is mechanically sound: the world-model 'splice realized observation into ctx_teacher as privileged info' reuses the same ctx_teacher = ctx_student + hint pattern, post-hint mask, and ADR-011 aligned-index gather that the real collator already implements (data_collator.py, ADR-011) — no hand-waving. Both prune gates are real: oracle-cleanliness = _grade() 0-masking (env.py:90), per-turn signal-presence = the collator empty-recovery row-drop (data_collator.py L308). ObjectStoreAllReduce is verified to the line: PUT round_{NNNNNN}/rank_{RRRR}.pt, poll-until-all-peers, mean, and the 'straggler blocks at the poll loop bounded by timeout_s=1800' claim is exactly what the code does (allreduce.py:151-162). The counterfactual-credit backbone is grounded (2011.09464 + Tree-GRPO step-level DPO equivalence), with only a minor mechanism conflation. The depth weaknesses are concentrated in the QUANTITATIVE concreteness, not the conceptual substance: the headline cost anchor mislabels a flat-scale figure as a tree cost (and thereby undersells the tree's true O(N^D) cost), the executor LOC estimates undershoot the only working backend by ~2.5x, and EKSExecutor is presented as a repo slot when the repo reserves K8sExecutor. None touch the load-bearing argument; all are surgical fixes that make the numbers honest.",
	"findings_count_note": "4 findings: 1 high (cost-anchor misread), 1 medium (LOC estimate ungrounded), 2 low (naming + credit-mechanism conflation). The conceptual axes the checklist flagged are solid and I say so in overall rather than inventing nits."
	}