Baladithya Balamurugan
R14: harden live-Docker tests against transient daemon contention
c647fe9
Raw
History Blame Contribute Delete
9.47 kB
{
"scope": "Phase-8 FINAL-VERIFICATION (read-only) of Wave-3 reconciliation items R1-R13 + overall integration health on branch backlog/goal-resolution-2026-06 (HEAD ace4fac, on top of main 4e6e82e). Verified independently against code/tests/git, not the backlog status column.",
"overall_verdict": "minor-open",
"item_verdicts": [
{
"id": "R1",
"status": "resolved",
"evidence": "HeldOutGuard wired into composer_replication/trainer/composer_trainer.py: __init__ accepts heldout_guard/heldout_eval_fn/strict_killswitch (lines 101-138); _maybe_update_killswitch (line 184) folds metrics into guard at the SAME logging cadence as loss-component logging (called from _compute_loss line 176, inside the `global_step % log_steps == 0` gate). Feeds in_loop_reward from TRL _metrics['train']['reward'], heldout_score from injected heldout_eval_fn(), kl_to_init from TRL 'kl' metric (line 223). On fire: raise_if_fired (hard, line 246) or control.should_training_stop (soft, line 251). OFF BY DEFAULT VERIFIED: guard is None => immediate return (lines 205-207), zero behavior change; tests test_absent_guard_is_noop (eval fn never called, nothing logged) and test_constructor_defaults_leave_killswitch_off prove it. Integration test composer_replication/trainer/tests/test_killswitch_integration.py has 11 tests covering all 7 acceptance gates; whole targeted suite is 192 passed / 18 skipped."
},
{
"id": "R2",
"status": "resolved",
"evidence": "composer_replication/safety/holdout.py exists with HeldoutSplit (frozen dataclass, id-based + optional content-hash disjointness, deterministic .split() constructor) and HeldoutOverlapError carrying overlapping_ids/overlapping_hashes. Both re-exported in safety/__init__.py __all__ (lines 41-42). test_holdout.py = 10/10 passed standalone. content_hash deliberately excludes task_id to catch re-id'd near-duplicates."
},
{
"id": "R3",
"status": "resolved",
"evidence": "replica_entrypoint.py __main__ block (lines 91-141) uses a _resolve(arg_val, env_key, required, cast) helper: argv flags default to None (NOT required=True), fall back to RENDEZVOUS_URI/WORLD_SIZE/TRAINER_MODULE/TRAINER_FN/TRAINER_KWARGS_JSON env vars, error only if NEITHER source supplies a mandatory field. EKS env path works: eks.py _build_env (line 270) injects REPLICA_RANK via downward API (job-completion-index annotation), WORLD_SIZE literal, and upper-cases every scalar entrypoint_arg (key.upper(), line 301) so rendezvous_uri -> RENDEZVOUS_URI exactly matching the entrypoint's env resolution. The pure-env EKS pod no longer crashes at arg-parsing."
},
{
"id": "R4",
"status": "resolved",
"evidence": "kill_switch.py calibrate_kl_threshold (line 388): raises ValueError if factor <= 0 (line 422-423), raises if any baseline KL < 0 (line 424-428), and floors the stored kl_hard_stop at max(min(calibrated, current), 1e-6) (line 434) so an all-zero baseline cannot drive the ceiling to 0. Cannot yield a non-positive kl_hard_stop."
},
{
"id": "R5",
"status": "resolved",
"evidence": "EKS cancel (eks.py:594-602) swallows ONLY status in (404, 409) and re-raises everything else. SageMaker cancel (sagemaker.py:472-481) swallows ONLY _is_resource_not_found OR _is_already_terminal and re-raises everything else. Propagation tested: test_sagemaker_executor.py::test_cancel_reraises_unexpected_error asserts a RuntimeError(AccessDenied) propagates; EKS test_poll_reraises_non_404_api_exception covers the poll path. EKS cancel-swallows-404 and unknown-handle-noop tests present."
},
{
"id": "R6",
"status": "partial",
"evidence": "EKS collect() _result_dict (eks.py:658-679) DOES include the 'result' key (= handle.metadata['rendezvous_uri'], or None) for cross-backend shape uniformity — the code change is genuinely present and correct. GAP: the collect test (test_eks_executor.py::test_collect_returns_terminal_results_in_order, line 599) asserts rank/status/exit_code/error/job_name but does NOT explicitly assert the 'result' key is present. Functional fix is in; test coverage of that specific key is missing. Cosmetic test-completeness nit, not a defect."
},
{
"id": "R7",
"status": "resolved",
"evidence": "docs/API_REFERENCE.md: §15 (line 1504) serverless cloud executors documents class EKSExecutor (line 1510) + class SageMakerExecutor (line 1583) with [eks]/[aws]/[serverless] extras (line 1508); §16 (line 1650) DockerSandbox; §17 (line 1739) safety/kill_switch with HeldOutGuard/TripwireStatus/CollapseStopError/kl_token_trust_filter + the HeldoutSplit discipline note (line 1829)."
},
{
"id": "R8",
"status": "resolved",
"evidence": "docs/adrs/ADR-015-holdout-killswitch.md exists (10760 bytes). Indexed in docs/adrs/README.md:19 (status accepted, 2026-06-09). Referenced from safety/__init__.py:21 ('See docs/adrs/ADR-015-holdout-killswitch.md') — the dangling ref now resolves. Also cited in API_REFERENCE.md:1741,1829."
},
{
"id": "R10",
"status": "resolved",
"evidence": "test_kill_switch.py::test_gap_blowout_fires_even_when_real_still_rising (line 354) pins path-(c) as a divergence-RATE gate: with decline_patience=99 (path-a isolated) the guard fires when proxy sprints while held-out is still rising slowly, asserting status.fire, 'gap' in reason, and heldout_ema actually rose. kill_switch.py:325 implements path (c) as gap blowout."
},
{
"id": "R11",
"status": "resolved",
"evidence": "Backlog documents the spike-006 fix (torch.manual_seed(0) to remove CPU-contention flakiness). The full targeted suite for this verification (safety + serverless + trainer/tests + datagen) ran 192 passed / 18 skipped with zero failures; spike-006 is outside the verified suite but the documented fix is a seed pin and the LOW item is closed. Not independently re-run under contention (out of verified scope)."
},
{
"id": "R13",
"status": "resolved",
"evidence": "Filed in docs/BACKLOG_RESOLUTION_2026-06-09.md lines 80 and 103 (commit ace4fac). Claim VERIFIED: ruff B904 errors exist ONLY in pre-existing files (modal_spawn.py, allreduce.py, hf_jobs.py, modal.py — 4 found); the Wave-2/3 new files (eks.py, sagemaker.py, safety/, datagen/docker_sandbox.py, composer_trainer.py) are all ruff-clean ('All checks passed!')."
},
{
"id": "INTEGRATION-import",
"status": "resolved",
"evidence": "`.venv/bin/python -c 'import composer_replication'` succeeds. All Wave-2/3 public symbols import (HeldOutGuard, HeldoutSplit, TripwireStatus, CollapseStopError, HeldoutOverlapError, EKSExecutor, SageMakerExecutor). serverless __init__ re-exports EKSExecutor/SageMakerExecutor in __all__."
},
{
"id": "INTEGRATION-lazy-deps",
"status": "resolved",
"evidence": "The NEW Wave-2/3 modules keep optional deps LAZY: no top-level import of boto3/kubernetes/docker in eks.py/sagemaker.py/executor.py/hf_jobs.py/modal.py/modal_spawn.py/datagen/docker_sandbox.py (all imports inside functions/methods). After `import composer_replication`: kubernetes=False, docker=False, s3fs=False (all lazy). NOTE: boto3 IS eager-loaded, but the import trace proves the source is the PRE-EXISTING trl dependency chain (composer_replication/__init__.py:98 -> composer_replication.trainer -> `from trl import GRPOTrainer` -> accelerate -> accelerate/commands/config/sagemaker.py imports boto3). This trainer eager-import existed on main 4e6e82e (git diff confirms __init__.py:98 imported the trainer before this branch). It is NOT introduced by R1-R13 and is unrelated to the serverless executors, which correctly defer boto3."
},
{
"id": "INTEGRATION-test-suite",
"status": "resolved",
"evidence": "`.venv/bin/python -m pytest composer_replication/safety composer_replication/diloco/serverless composer_replication/trainer/tests composer_replication/datagen -q` => 192 passed, 18 skipped in 57.59s. Zero failures, zero errors. Skips are host/dep-gated (docker/cloud)."
}
],
"remaining_open": [
"R6 (cosmetic): EKS collect() test test_collect_returns_terminal_results_in_order does not assert the 'result' key is present in the returned dicts, even though the production code (eks.py:678) emits it. Functional fix is in; only the test coverage of that one key is missing.",
"Minor test-coverage gap (not a backlog item): no EKS-specific cancel re-raise-on-non-404 test (SageMaker has test_cancel_reraises_unexpected_error; EKS code at eks.py:602 is correct and EKS poll has test_poll_reraises_non_404_api_exception, but EKS cancel non-404 re-raise is not directly tested).",
"Pre-existing (out of scope, NOT a regression): boto3 is eager-imported at `import composer_replication` via the trl/accelerate dependency chain (accelerate/commands/config/sagemaker.py). Predates this branch (trainer eager-import on main 4e6e82e). The framework's own optional-dep modules stay lazy; only the trl trainer dependency pulls boto3. Would require gating the trainer import in composer_replication/__init__.py behind a lazy/try-except to fully eliminate.",
"R9 (LOW, not in my assigned R-set): canonical test-count refresh in V1_V8_COVERAGE was not independently re-measured in this verification."
],
"confirmed_resolved": ["R1", "R2", "R3", "R4", "R5", "R7", "R8", "R10", "R11", "R13"]
}