Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit ·
c647fe9
1
Parent(s): b1adfc9
R14: harden live-Docker tests against transient daemon contention
Browse filesThe Phase-8 verifiers flagged 2 live-Docker DockerSandbox tests as flaky under
concurrent container resource pressure (full suite + review agents spawning
containers) — passing in isolation. Added _retry_docker(): bounded retry (3x,
backoff) on transient daemon errors (timeout/resource/conflict/500/connection),
re-raising genuine repeatable failures. Wired into the 4-gate inversion test and
the network-disabled test. 27 passed / 1 skipped, lint-clean.
Closes the last addressable Phase-8 verification finding; backlog now fully
resolved on this host (remainder is user-gated GPU/token only).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
composer_replication/datagen/tests/test_docker_sandbox.py
CHANGED
|
@@ -449,6 +449,40 @@ live = pytest.mark.skipif(
|
|
| 449 |
"hardware-gated (mirror test_docker_substrate_e2e.py).",
|
| 450 |
)
|
| 451 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 452 |
# Minimal synthetic FD task (same shape as test_docker_substrate_e2e.py).
|
| 453 |
_MODULE_SOLVED = textwrap.dedent('''\
|
| 454 |
def add(a, b):
|
|
@@ -526,17 +560,22 @@ def test_live_four_inversion_gates_in_hardened_container():
|
|
| 526 |
guard = "feature.py::test_add" # PASS_TO_PASS — must survive the deletion
|
| 527 |
|
| 528 |
def _run(module_src, node):
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 540 |
|
| 541 |
# Gate 1 — solved: both pass.
|
| 542 |
g1t, _ = _run(_MODULE_SOLVED, target)
|
|
@@ -562,18 +601,22 @@ def test_live_network_is_disabled():
|
|
| 562 |
container — the reward-hack egress kill-switch."""
|
| 563 |
if not _image_present(_TEST_IMAGE):
|
| 564 |
pytest.skip(f"{_TEST_IMAGE} not present locally")
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
with
|
| 568 |
-
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 577 |
|
| 578 |
|
| 579 |
@live
|
|
|
|
| 449 |
"hardware-gated (mirror test_docker_substrate_e2e.py).",
|
| 450 |
)
|
| 451 |
|
| 452 |
+
|
| 453 |
+
def _retry_docker(fn, *, attempts: int = 3, base_delay: float = 1.5):
|
| 454 |
+
"""Run ``fn`` with bounded retry on TRANSIENT Docker contention errors.
|
| 455 |
+
|
| 456 |
+
The live tests boot real containers; when the full suite runs them alongside
|
| 457 |
+
other container-spawning work (concurrent pytest workers, review agents), the
|
| 458 |
+
daemon transiently rejects/timeouts a create/start under resource pressure —
|
| 459 |
+
a CONTENTION artifact, not a code defect (they pass in isolation). We retry a
|
| 460 |
+
handful of times with backoff so the assertion reflects the sandbox's
|
| 461 |
+
behavior, not the host's momentary load. A genuine, repeatable failure still
|
| 462 |
+
surfaces after the attempts are exhausted.
|
| 463 |
+
"""
|
| 464 |
+
import time
|
| 465 |
+
|
| 466 |
+
last_exc: Exception | None = None
|
| 467 |
+
for i in range(attempts):
|
| 468 |
+
try:
|
| 469 |
+
return fn()
|
| 470 |
+
except Exception as e: # noqa: BLE001 — retry any transient daemon error
|
| 471 |
+
msg = str(e).lower()
|
| 472 |
+
transient = any(
|
| 473 |
+
s in msg
|
| 474 |
+
for s in (
|
| 475 |
+
"timeout", "timed out", "resource", "conflict", "500 server",
|
| 476 |
+
"connection", "temporarily", "too many", "deadline",
|
| 477 |
+
)
|
| 478 |
+
)
|
| 479 |
+
if not transient or i == attempts - 1:
|
| 480 |
+
raise
|
| 481 |
+
last_exc = e
|
| 482 |
+
time.sleep(base_delay * (i + 1))
|
| 483 |
+
if last_exc is not None: # pragma: no cover — loop always returns/raises
|
| 484 |
+
raise last_exc
|
| 485 |
+
|
| 486 |
# Minimal synthetic FD task (same shape as test_docker_substrate_e2e.py).
|
| 487 |
_MODULE_SOLVED = textwrap.dedent('''\
|
| 488 |
def add(a, b):
|
|
|
|
| 560 |
guard = "feature.py::test_add" # PASS_TO_PASS — must survive the deletion
|
| 561 |
|
| 562 |
def _run(module_src, node):
|
| 563 |
+
def _once():
|
| 564 |
+
with tempfile.TemporaryDirectory() as d:
|
| 565 |
+
_materialize(d, module_src)
|
| 566 |
+
sb = DockerSandbox(workdir=d, exec_timeout_s=60)
|
| 567 |
+
sb.boot(_TEST_IMAGE)
|
| 568 |
+
try:
|
| 569 |
+
# run_tests appends the shlex-quoted node id to the command,
|
| 570 |
+
# and the runner uses it to pick which FIXED check to run.
|
| 571 |
+
res = sb.run_tests("python runner.py", (node,))
|
| 572 |
+
return node in res.passed, res.stdout
|
| 573 |
+
finally:
|
| 574 |
+
sb.close()
|
| 575 |
+
|
| 576 |
+
# Retry on transient daemon contention (full-suite concurrency); a real
|
| 577 |
+
# repeatable failure still surfaces after the attempts are exhausted.
|
| 578 |
+
return _retry_docker(_once)
|
| 579 |
|
| 580 |
# Gate 1 — solved: both pass.
|
| 581 |
g1t, _ = _run(_MODULE_SOLVED, target)
|
|
|
|
| 601 |
container — the reward-hack egress kill-switch."""
|
| 602 |
if not _image_present(_TEST_IMAGE):
|
| 603 |
pytest.skip(f"{_TEST_IMAGE} not present locally")
|
| 604 |
+
|
| 605 |
+
def _once():
|
| 606 |
+
with tempfile.TemporaryDirectory() as d:
|
| 607 |
+
_materialize(d, _MODULE_SOLVED)
|
| 608 |
+
with open(os.path.join(d, "netprobe.py"), "w") as f:
|
| 609 |
+
f.write(_NETPROBE)
|
| 610 |
+
sb = DockerSandbox(workdir=d, exec_timeout_s=30)
|
| 611 |
+
sb.boot(_TEST_IMAGE)
|
| 612 |
+
try:
|
| 613 |
+
return sb.exec({"command": "python netprobe.py"})
|
| 614 |
+
finally:
|
| 615 |
+
sb.close()
|
| 616 |
+
|
| 617 |
+
out = _retry_docker(_once) # retry on transient daemon contention
|
| 618 |
+
assert "CONNECTED" not in out, f"network egress was NOT blocked:\n{out}"
|
| 619 |
+
assert "BLOCKED" in out, f"unexpected network probe output:\n{out}"
|
| 620 |
|
| 621 |
|
| 622 |
@live
|
docs/BACKLOG_RESOLUTION_2026-06-09.md
CHANGED
|
@@ -98,7 +98,7 @@ Sandbox refactor verdict: **clean** (no regression to LocalSubprocessSandbox/Fea
|
|
| 98 |
**Design note (R7-area):** EKSExecutor/SageMakerExecutor/DockerSandbox/HeldOutGuard are exported from their SUBMODULE paths (`composer_replication.diloco.serverless`, `.datagen`, `.safety`) — matching the existing convention (Modal/HFJobs executors are likewise not at package root) and keeping `import composer_replication` from force-loading every cloud-executor module. They are documented in API_REFERENCE §15-17.
|
| 99 |
|
| 100 |
### Final disposition
|
| 101 |
-
- **CLOSED (done + tested):** B1-B8, C1, C2, C3, D1, E3, R1-
|
| 102 |
- **GATED-AS-DESIGNED (user-only, cannot execute here):** F1 (HF token rotation — audited clean, user rotates), F2/E1/E2 real 8B GPU runs (harness paths buildable; the spend is the user's go/no-go).
|
| 103 |
- **TRACKED tech-debt (out of scope, filed):** R13 (pre-existing serverless ruff B904 debt — do not reformat unauthored code in this effort).
|
| 104 |
|
|
|
|
| 98 |
**Design note (R7-area):** EKSExecutor/SageMakerExecutor/DockerSandbox/HeldOutGuard are exported from their SUBMODULE paths (`composer_replication.diloco.serverless`, `.datagen`, `.safety`) — matching the existing convention (Modal/HFJobs executors are likewise not at package root) and keeping `import composer_replication` from force-loading every cloud-executor module. They are documented in API_REFERENCE §15-17.
|
| 99 |
|
| 100 |
### Final disposition
|
| 101 |
+
- **CLOSED (done + tested):** B1-B8, C1, C2, C3, D1, E3, R1-R12, R14 (C3 live-Docker contention flakiness — added _retry_docker bounded retry on transient daemon errors; 27/1 green).
|
| 102 |
- **GATED-AS-DESIGNED (user-only, cannot execute here):** F1 (HF token rotation — audited clean, user rotates), F2/E1/E2 real 8B GPU runs (harness paths buildable; the spend is the user's go/no-go).
|
| 103 |
- **TRACKED tech-debt (out of scope, filed):** R13 (pre-existing serverless ruff B904 debt — do not reformat unauthored code in this effort).
|
| 104 |
|
research/verify-wave3.json
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"scope": "Phase-8 FINAL-VERIFICATION (read-only) of Wave-3 reconciliation items R1-R13 + overall integration health on branch backlog/goal-resolution-2026-06 (HEAD ace4fac, on top of main 4e6e82e). Verified independently against code/tests/git, not the backlog status column.",
|
| 3 |
+
"overall_verdict": "minor-open",
|
| 4 |
+
"item_verdicts": [
|
| 5 |
+
{
|
| 6 |
+
"id": "R1",
|
| 7 |
+
"status": "resolved",
|
| 8 |
+
"evidence": "HeldOutGuard wired into composer_replication/trainer/composer_trainer.py: __init__ accepts heldout_guard/heldout_eval_fn/strict_killswitch (lines 101-138); _maybe_update_killswitch (line 184) folds metrics into guard at the SAME logging cadence as loss-component logging (called from _compute_loss line 176, inside the `global_step % log_steps == 0` gate). Feeds in_loop_reward from TRL _metrics['train']['reward'], heldout_score from injected heldout_eval_fn(), kl_to_init from TRL 'kl' metric (line 223). On fire: raise_if_fired (hard, line 246) or control.should_training_stop (soft, line 251). OFF BY DEFAULT VERIFIED: guard is None => immediate return (lines 205-207), zero behavior change; tests test_absent_guard_is_noop (eval fn never called, nothing logged) and test_constructor_defaults_leave_killswitch_off prove it. Integration test composer_replication/trainer/tests/test_killswitch_integration.py has 11 tests covering all 7 acceptance gates; whole targeted suite is 192 passed / 18 skipped."
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"id": "R2",
|
| 12 |
+
"status": "resolved",
|
| 13 |
+
"evidence": "composer_replication/safety/holdout.py exists with HeldoutSplit (frozen dataclass, id-based + optional content-hash disjointness, deterministic .split() constructor) and HeldoutOverlapError carrying overlapping_ids/overlapping_hashes. Both re-exported in safety/__init__.py __all__ (lines 41-42). test_holdout.py = 10/10 passed standalone. content_hash deliberately excludes task_id to catch re-id'd near-duplicates."
|
| 14 |
+
},
|
| 15 |
+
{
|
| 16 |
+
"id": "R3",
|
| 17 |
+
"status": "resolved",
|
| 18 |
+
"evidence": "replica_entrypoint.py __main__ block (lines 91-141) uses a _resolve(arg_val, env_key, required, cast) helper: argv flags default to None (NOT required=True), fall back to RENDEZVOUS_URI/WORLD_SIZE/TRAINER_MODULE/TRAINER_FN/TRAINER_KWARGS_JSON env vars, error only if NEITHER source supplies a mandatory field. EKS env path works: eks.py _build_env (line 270) injects REPLICA_RANK via downward API (job-completion-index annotation), WORLD_SIZE literal, and upper-cases every scalar entrypoint_arg (key.upper(), line 301) so rendezvous_uri -> RENDEZVOUS_URI exactly matching the entrypoint's env resolution. The pure-env EKS pod no longer crashes at arg-parsing."
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"id": "R4",
|
| 22 |
+
"status": "resolved",
|
| 23 |
+
"evidence": "kill_switch.py calibrate_kl_threshold (line 388): raises ValueError if factor <= 0 (line 422-423), raises if any baseline KL < 0 (line 424-428), and floors the stored kl_hard_stop at max(min(calibrated, current), 1e-6) (line 434) so an all-zero baseline cannot drive the ceiling to 0. Cannot yield a non-positive kl_hard_stop."
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
"id": "R5",
|
| 27 |
+
"status": "resolved",
|
| 28 |
+
"evidence": "EKS cancel (eks.py:594-602) swallows ONLY status in (404, 409) and re-raises everything else. SageMaker cancel (sagemaker.py:472-481) swallows ONLY _is_resource_not_found OR _is_already_terminal and re-raises everything else. Propagation tested: test_sagemaker_executor.py::test_cancel_reraises_unexpected_error asserts a RuntimeError(AccessDenied) propagates; EKS test_poll_reraises_non_404_api_exception covers the poll path. EKS cancel-swallows-404 and unknown-handle-noop tests present."
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"id": "R6",
|
| 32 |
+
"status": "partial",
|
| 33 |
+
"evidence": "EKS collect() _result_dict (eks.py:658-679) DOES include the 'result' key (= handle.metadata['rendezvous_uri'], or None) for cross-backend shape uniformity — the code change is genuinely present and correct. GAP: the collect test (test_eks_executor.py::test_collect_returns_terminal_results_in_order, line 599) asserts rank/status/exit_code/error/job_name but does NOT explicitly assert the 'result' key is present. Functional fix is in; test coverage of that specific key is missing. Cosmetic test-completeness nit, not a defect."
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"id": "R7",
|
| 37 |
+
"status": "resolved",
|
| 38 |
+
"evidence": "docs/API_REFERENCE.md: §15 (line 1504) serverless cloud executors documents class EKSExecutor (line 1510) + class SageMakerExecutor (line 1583) with [eks]/[aws]/[serverless] extras (line 1508); §16 (line 1650) DockerSandbox; §17 (line 1739) safety/kill_switch with HeldOutGuard/TripwireStatus/CollapseStopError/kl_token_trust_filter + the HeldoutSplit discipline note (line 1829)."
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"id": "R8",
|
| 42 |
+
"status": "resolved",
|
| 43 |
+
"evidence": "docs/adrs/ADR-015-holdout-killswitch.md exists (10760 bytes). Indexed in docs/adrs/README.md:19 (status accepted, 2026-06-09). Referenced from safety/__init__.py:21 ('See docs/adrs/ADR-015-holdout-killswitch.md') — the dangling ref now resolves. Also cited in API_REFERENCE.md:1741,1829."
|
| 44 |
+
},
|
| 45 |
+
{
|
| 46 |
+
"id": "R10",
|
| 47 |
+
"status": "resolved",
|
| 48 |
+
"evidence": "test_kill_switch.py::test_gap_blowout_fires_even_when_real_still_rising (line 354) pins path-(c) as a divergence-RATE gate: with decline_patience=99 (path-a isolated) the guard fires when proxy sprints while held-out is still rising slowly, asserting status.fire, 'gap' in reason, and heldout_ema actually rose. kill_switch.py:325 implements path (c) as gap blowout."
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"id": "R11",
|
| 52 |
+
"status": "resolved",
|
| 53 |
+
"evidence": "Backlog documents the spike-006 fix (torch.manual_seed(0) to remove CPU-contention flakiness). The full targeted suite for this verification (safety + serverless + trainer/tests + datagen) ran 192 passed / 18 skipped with zero failures; spike-006 is outside the verified suite but the documented fix is a seed pin and the LOW item is closed. Not independently re-run under contention (out of verified scope)."
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"id": "R13",
|
| 57 |
+
"status": "resolved",
|
| 58 |
+
"evidence": "Filed in docs/BACKLOG_RESOLUTION_2026-06-09.md lines 80 and 103 (commit ace4fac). Claim VERIFIED: ruff B904 errors exist ONLY in pre-existing files (modal_spawn.py, allreduce.py, hf_jobs.py, modal.py — 4 found); the Wave-2/3 new files (eks.py, sagemaker.py, safety/, datagen/docker_sandbox.py, composer_trainer.py) are all ruff-clean ('All checks passed!')."
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"id": "INTEGRATION-import",
|
| 62 |
+
"status": "resolved",
|
| 63 |
+
"evidence": "`.venv/bin/python -c 'import composer_replication'` succeeds. All Wave-2/3 public symbols import (HeldOutGuard, HeldoutSplit, TripwireStatus, CollapseStopError, HeldoutOverlapError, EKSExecutor, SageMakerExecutor). serverless __init__ re-exports EKSExecutor/SageMakerExecutor in __all__."
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"id": "INTEGRATION-lazy-deps",
|
| 67 |
+
"status": "resolved",
|
| 68 |
+
"evidence": "The NEW Wave-2/3 modules keep optional deps LAZY: no top-level import of boto3/kubernetes/docker in eks.py/sagemaker.py/executor.py/hf_jobs.py/modal.py/modal_spawn.py/datagen/docker_sandbox.py (all imports inside functions/methods). After `import composer_replication`: kubernetes=False, docker=False, s3fs=False (all lazy). NOTE: boto3 IS eager-loaded, but the import trace proves the source is the PRE-EXISTING trl dependency chain (composer_replication/__init__.py:98 -> composer_replication.trainer -> `from trl import GRPOTrainer` -> accelerate -> accelerate/commands/config/sagemaker.py imports boto3). This trainer eager-import existed on main 4e6e82e (git diff confirms __init__.py:98 imported the trainer before this branch). It is NOT introduced by R1-R13 and is unrelated to the serverless executors, which correctly defer boto3."
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"id": "INTEGRATION-test-suite",
|
| 72 |
+
"status": "resolved",
|
| 73 |
+
"evidence": "`.venv/bin/python -m pytest composer_replication/safety composer_replication/diloco/serverless composer_replication/trainer/tests composer_replication/datagen -q` => 192 passed, 18 skipped in 57.59s. Zero failures, zero errors. Skips are host/dep-gated (docker/cloud)."
|
| 74 |
+
}
|
| 75 |
+
],
|
| 76 |
+
"remaining_open": [
|
| 77 |
+
"R6 (cosmetic): EKS collect() test test_collect_returns_terminal_results_in_order does not assert the 'result' key is present in the returned dicts, even though the production code (eks.py:678) emits it. Functional fix is in; only the test coverage of that one key is missing.",
|
| 78 |
+
"Minor test-coverage gap (not a backlog item): no EKS-specific cancel re-raise-on-non-404 test (SageMaker has test_cancel_reraises_unexpected_error; EKS code at eks.py:602 is correct and EKS poll has test_poll_reraises_non_404_api_exception, but EKS cancel non-404 re-raise is not directly tested).",
|
| 79 |
+
"Pre-existing (out of scope, NOT a regression): boto3 is eager-imported at `import composer_replication` via the trl/accelerate dependency chain (accelerate/commands/config/sagemaker.py). Predates this branch (trainer eager-import on main 4e6e82e). The framework's own optional-dep modules stay lazy; only the trl trainer dependency pulls boto3. Would require gating the trainer import in composer_replication/__init__.py behind a lazy/try-except to fully eliminate.",
|
| 80 |
+
"R9 (LOW, not in my assigned R-set): canonical test-count refresh in V1_V8_COVERAGE was not independently re-measured in this verification."
|
| 81 |
+
],
|
| 82 |
+
"confirmed_resolved": ["R1", "R2", "R3", "R4", "R5", "R7", "R8", "R10", "R11", "R13"]
|
| 83 |
+
}
|