Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on 20 days ago

Commit

7a55e1e

1 Parent(s): c11cf49

Wave 2: 4 new modules (kill-switch, EKS/SageMaker executors, DockerSandbox) + B4/B7 completion

Built by the parallel execution team (worktree-isolated), integrated + tested here.

New modules (all CPU-testable via mock/lazy-import; optional deps gated):
- composer_replication/safety/ — HeldOutGuard run-level collapse kill-switch
(held-out-declines-while-reward-rises streak + KL-to-init hard-stop 0.08 +
proxy-real hacking-gap; EMA-denoised, latched-fire, CollapseStopError). The
documented #2 safeguard for the self-evolving flywheel. 23 tests.
- composer_replication/diloco/serverless/eks.py — EKSExecutor satisfying the
ServerlessExecutor Protocol via a SINGLE Kubernetes Indexed Job → N rank-ordered
ReplicaHandles, gang-cancel (Background propagation), REPLICA_RANK via the
downward API, S3 rendezvous (IRSA). 28 tests (mock BatchV1/CoreV1).
- composer_replication/diloco/serverless/sagemaker.py — SageMakerExecutor
(boto3 create_training_job, same S3 rendezvous, status mapping). +13-test
module written during integration (the build agent shipped it test-less).
- composer_replication/datagen/docker_sandbox.py — DockerSandbox (ephemeral
container, --network none, mem/pids limits, gVisor runtime option) + refactored
the per-class _scrub_tree into a shared module-level scrub_tree free function
so every sandbox backend applies the identical reward-hack control. Live Docker
tests pass; LocalSubprocessSandbox/FeatureDeletionEnv unaffected (review: clean).

Wiring + completion:
- Re-exported EKSExecutor/SageMakerExecutor (serverless __init__) and
DockerSandbox/scrub_tree (datagen __init__).
- pyproject: added [eks] (kubernetes) + [aws] (boto3) extras.
- B7-complete: added make_dr_grpo_config/make_po_config/PO_OBJECTIVES to the
TOP-LEVEL __all__ (were importable but missing from __all__).
- B4-complete: reconciled the 4 surviving stale "115 passing" current-framed
claims (README/OVERVIEW/VISION_VALIDATION) to the canonical 266/62.
- All new files ruff-clean (E,F,W,I,N,UP,B).

Full suite: 355 passed / 65 skipped / 1 flaky-under-contention (spike-006
loss-trend test, passes in isolation — tracked as R11, not a regression).
Wave-3 backlog (R1-R12) filed in docs/BACKLOG_RESOLUTION_2026-06-09.md from the
concurrent review team.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (23) hide show

composer_replication/__init__.py +3 -0
composer_replication/datagen/__init__.py +5 -0
composer_replication/datagen/docker_sandbox.py +331 -0
composer_replication/datagen/sandbox.py +49 -28
composer_replication/datagen/tests/test_docker_sandbox.py +620 -0
composer_replication/diloco/serverless/__init__.py +4 -0
composer_replication/diloco/serverless/eks.py +674 -0
composer_replication/diloco/serverless/executor.py +4 -3
composer_replication/diloco/serverless/sagemaker.py +619 -0
composer_replication/diloco/serverless/tests/test_eks_executor.py +625 -0
composer_replication/diloco/serverless/tests/test_sagemaker_executor.py +244 -0
composer_replication/safety/__init__.py +34 -0
composer_replication/safety/kill_switch.py +447 -0
composer_replication/safety/tests/__init__.py +0 -0
composer_replication/safety/tests/test_kill_switch.py +320 -0
docs/BACKLOG_RESOLUTION_2026-06-09.md +25 -0
docs/OVERVIEW.md +2 -2
docs/VISION_VALIDATION.md +1 -1
pyproject.toml +9 -0
research/review-executors.json +44 -0
research/review-newgaps.json +39 -0
research/review-safety.json +54 -0
research/review-sandbox.json +29 -0

composer_replication/__init__.py CHANGED Viewed

@@ -132,6 +132,9 @@ __all__ = [
     "replay_trace",
     # Trainer
     "ComposerReplicationTrainer",
     # DiLoCo (optional)
     "make_diloco_outer_loop",
     # Meta

     "replay_trace",
     # Trainer
     "ComposerReplicationTrainer",
+    "make_dr_grpo_config",
+    "make_po_config",
+    "PO_OBJECTIVES",
     # DiLoCo (optional)
     "make_diloco_outer_loop",
     # Meta

composer_replication/datagen/__init__.py CHANGED Viewed

@@ -8,6 +8,7 @@ Public surface:
   - FeatureDeletionTask  — the task tuple (schema.py)
   - FeatureDeletionEnv   — Gym/OpenEnv-style env + TRL reward_fn adapter (env.py)
   - Sandbox / FakeSandbox / LocalSubprocessSandbox — execution backends (sandbox.py)
   - HackMonitor          — reward-hacking provenance monitor (monitor.py)
   - DifficultyCurriculum — online pass-rate difficulty gate (curriculum.py)
   - validate_task        — 4-gate solvability validator (validator.py)
@@ -20,11 +21,13 @@ from __future__ import annotations
 from composer_replication.datagen.curriculum import DifficultyCurriculum
 from composer_replication.datagen.env import FeatureDeletionEnv, StepResult
 from composer_replication.datagen.monitor import HackMonitor
 from composer_replication.datagen.sandbox import (
     FakeSandbox,
     LocalSubprocessSandbox,
     Sandbox,
     TestRunResult,
 )
 from composer_replication.datagen.schema import FeatureDeletionTask
 from composer_replication.datagen.substrates import SweBenchAdapter
@@ -37,7 +40,9 @@ __all__ = [
     "Sandbox",
     "FakeSandbox",
     "LocalSubprocessSandbox",
     "TestRunResult",
     "HackMonitor",
     "DifficultyCurriculum",
     "validate_task",

   - FeatureDeletionTask  — the task tuple (schema.py)
   - FeatureDeletionEnv   — Gym/OpenEnv-style env + TRL reward_fn adapter (env.py)
   - Sandbox / FakeSandbox / LocalSubprocessSandbox — execution backends (sandbox.py)
+  - DockerSandbox        — hardened ephemeral-container backend (docker_sandbox.py)
   - HackMonitor          — reward-hacking provenance monitor (monitor.py)
   - DifficultyCurriculum — online pass-rate difficulty gate (curriculum.py)
   - validate_task        — 4-gate solvability validator (validator.py)
 from composer_replication.datagen.curriculum import DifficultyCurriculum
 from composer_replication.datagen.env import FeatureDeletionEnv, StepResult
 from composer_replication.datagen.monitor import HackMonitor
+from composer_replication.datagen.docker_sandbox import DockerSandbox
 from composer_replication.datagen.sandbox import (
     FakeSandbox,
     LocalSubprocessSandbox,
     Sandbox,
     TestRunResult,
+    scrub_tree,
 )
 from composer_replication.datagen.schema import FeatureDeletionTask
 from composer_replication.datagen.substrates import SweBenchAdapter
     "Sandbox",
     "FakeSandbox",
     "LocalSubprocessSandbox",
+    "DockerSandbox",
     "TestRunResult",
+    "scrub_tree",
     "HackMonitor",
     "DifficultyCurriculum",
     "validate_task",

composer_replication/datagen/docker_sandbox.py ADDED Viewed

	@@ -0,0 +1,331 @@

+"""docker_sandbox.py — the hardened container backend for the FD env (ADR-010 §3).
+`DockerSandbox` is a drop-in `Sandbox` (boot/exec/run_tests/trajectory/
+is_command_allowed) that runs the agent's tool calls and the verifiable test
+command inside an ephemeral, locked-down Docker container instead of a raw host
+subprocess. It is the production execution path for genuinely UNTRUSTED
+model-generated code — the LocalSubprocessSandbox sibling runs everything in the
+host process and only enforces the scrub + denylist, which is fine for
+first-party / dev use but is NOT a host-security boundary.
+The lockdown recipe (CIS Docker benchmark 5.x + gVisor guidance, see the
+`sandbox-container` research digest):
+  - `network_mode='none'`              — no egress: no decompiler downloads, no
+    signature exfil/recovery over the wire (the SWE-RL reward-hack threat).
+  - `read_only=True` root fs + a small `tmpfs={'/tmp': ...}` for scratch.
+  - the working tree bind-mounted RW at /work (the agent must mutate the repo).
+  - `cap_drop=['ALL']` + `security_opt=['no-new-privileges:true']`.
+  - `user='1000:1000'` — never run agent code as root in the container.
+  - `pids_limit` (fork-bomb guard) + `mem_limit==memswap_limit` (OOM guard, no
+    swap) + `nano_cpus` (CPU quota).
+  - optional `runtime='runsc'` (gVisor) — a userspace kernel that intercepts
+    syscalls so a kernel-exploit payload hits the Sentry, not the host. This is
+    the RECOMMENDED runtime for untrusted model code per the ADR-010 threat
+    model, but requires host setup (`sudo runsc install` writes the 'runsc'
+    entry into /etc/docker/daemon.json + dockerd restart) so it is OPTIONAL and
+    defaults to None (= the daemon default, runc).
+PRIMARY reward-hack control: the SAME host-side `scrub_tree(workdir)` used by
+LocalSubprocessSandbox runs in `boot()` BEFORE the container starts. The bind
+mount is shared host<->container, so scrubbing __pycache__/.git/*.pyc on the
+host pre-boot is exactly equivalent to scrubbing inside the container. The
+command denylist remains cheap defense-in-depth, not the wall.
+`docker` is LAZY-imported inside methods so the pure-Python core and the
+FakeSandbox path never require the SDK; a clear RuntimeError is raised if the
+SDK or the daemon is absent.
+"""
+from __future__ import annotations
+import shlex
+from dataclasses import dataclass, field
+from uuid import uuid4
+from composer_replication.datagen.sandbox import (
+    SANDBOX_DENYLIST,
+    TestRunResult,
+    scrub_tree,
+)
+# Label stamped on every container we create, so a reaper can sweep ephemeral
+# containers leaked by a crashed episode (docker-py-native `--rm` durability
+# without the auto_remove log-loss/race problem).
+_LABEL_KEY = "composer_replication"
+_LABEL_VALUE = "datagen"
+def _require_docker():
+    """Lazy-import the docker SDK and return the module, or raise a clear
+    RuntimeError if the SDK is not installed. (Kept separate from the client so
+    callers can introspect `docker.errors` without opening a connection.)"""
+    try:
+        import docker  # noqa: PLC0415  (intentional lazy import)
+    except ImportError as e:  # pragma: no cover - exercised via monkeypatch
+        raise RuntimeError(
+            "DockerSandbox requires the 'docker' Python SDK (docker>=7). "
+            "Install it with `pip install docker` (or the project's [datagen] "
+            "extra). It is lazy-imported so the FakeSandbox/core path never "
+            "needs it."
+        ) from e
+    return docker
+def _make_client():
+    """Construct a `docker.from_env()` client or raise a clear RuntimeError if
+    the daemon is unreachable. The SDK constructs lazily, so we ping with
+    `client.ping()` to surface a dead daemon here rather than at first use."""
+    docker = _require_docker()
+    try:
+        client = docker.from_env()
+        client.ping()
+        return client
+    except Exception as e:  # docker.errors.DockerException, ConnectionError, ...
+        raise RuntimeError(
+            "DockerSandbox could not reach a Docker daemon (is `docker info` "
+            f"healthy?). Underlying error: {e!r}"
+        ) from e
+def runsc_available(client=None) -> bool:
+    """True iff the gVisor 'runsc' runtime is registered with the daemon.
+    Mirrors the `_docker_available()` gating philosophy: runsc is not installed
+    on most dev/CI boxes, so any runsc-specific behavior must be gated on this.
+    """
+    try:
+        client = client or _make_client()
+        runtimes = client.info().get("Runtimes", {}) or {}
+        return "runsc" in runtimes
+    except Exception:
+        return False
+@dataclass
+class DockerSandbox:
+    """Hardened ephemeral-container `Sandbox`. See module docstring.
+    Args:
+      workdir:  host path to the materialized repo; bind-mounted RW at /work and
+                scrubbed on the host before boot (the primary reward-hack
+                control). MUST be an existing directory by `boot()` time.
+      runtime:  None (=> daemon default, runc) or 'runsc' (gVisor) for untrusted
+                model code. Requires host-side `sudo runsc install` + dockerd
+                restart; gate with `runsc_available()`.
+      mem_limit / memswap_limit:  OOM guard; equal values forbid swap.
+      pids_limit:  fork-bomb guard.
+      nano_cpus:   CPU quota in 1e-9 CPUs (2_000_000_000 == 2 CPUs).
+      user:        non-root uid:gid the agent code runs as inside the container.
+      exec_timeout_s:  wall-clock cap injected via coreutils `timeout` (exec_run
+                has no timeout param — docker-py #2651).
+    """
+    workdir: str
+    runtime: str | None = None
+    mem_limit: str = "1g"
+    memswap_limit: str = "1g"
+    pids_limit: int = 256
+    nano_cpus: int = 2_000_000_000  # 2 CPUs
+    user: str = "1000:1000"
+    container_workdir: str = "/work"
+    tmpfs_size: str = "64m"
+    exec_timeout_s: int = 600
+    keep_root_writable: bool = False  # escape hatch if read-only fs breaks tooling
+    _trajectory: list[dict] = field(default_factory=list, init=False)
+    booted_image: str | None = field(default=None, init=False)
+    _client: object | None = field(default=None, init=False)
+    _container: object | None = field(default=None, init=False)
+    # ---- construction of the hardening kwargs --------------------------------
+    def container_kwargs(self, image: str) -> dict:
+        """The full hardened `containers.run` kwarg set. Pulled out as a method
+        so the pure-unit tests can assert the config (network_disabled,
+        mem_limit, runtime, ...) WITHOUT a live daemon."""
+        kwargs: dict = {
+            "image": image,
+            # Long-lived idle container; exec_run drives the actual work.
+            "command": ["sleep", "infinity"],
+            "detach": True,
+            # --- network egress kill-switch ---
+            # network_disabled removes networking entirely; we ALSO set
+            # network_mode='none' for parity with the existing CLI substrate
+            # test (`--network none`) and belt-and-suspenders.
+            "network_disabled": True,
+            "network_mode": "none",
+            # --- filesystem lockdown ---
+            "read_only": not self.keep_root_writable,
+            "tmpfs": {"/tmp": f"rw,noexec,nosuid,size={self.tmpfs_size}"},
+            "volumes": {
+                self.workdir: {"bind": self.container_workdir, "mode": "rw"}
+            },
+            "working_dir": self.container_workdir,
+            # --- privilege lockdown ---
+            "user": self.user,
+            "cap_drop": ["ALL"],
+            "security_opt": ["no-new-privileges:true"],
+            # --- resource limits ---
+            "pids_limit": self.pids_limit,
+            "mem_limit": self.mem_limit,
+            "memswap_limit": self.memswap_limit,
+            "nano_cpus": self.nano_cpus,
+            # --- lifecycle / reaping ---
+            "name": f"fd-{uuid4().hex[:12]}",
+            "labels": {_LABEL_KEY: _LABEL_VALUE},
+        }
+        # runtime is OPTIONAL; only pass it through when set so the default
+        # (runc) path never references a runtime that may not exist.
+        if self.runtime:
+            kwargs["runtime"] = self.runtime
+        return kwargs
+    # ---- Sandbox Protocol ----------------------------------------------------
+    def boot(self, image: str) -> None:
+        """Scrub the HOST workdir (primary control), reap any leaked siblings,
+        then start the hardened ephemeral container."""
+        self.booted_image = image
+        self._trajectory = []
+        # PRIMARY reward-hack control — run on the host before the bind mount.
+        scrub_tree(self.workdir)
+        self._client = _make_client()
+        self.reap_leaked(self._client)
+        docker = _require_docker()
+        kwargs = self.container_kwargs(image)
+        try:
+            self._container = self._client.containers.run(**kwargs)
+        except docker.errors.ImageNotFound as e:
+            raise RuntimeError(
+                f"DockerSandbox.boot: image {image!r} not found locally and "
+                "could not be pulled (the container is --network none). Pull it "
+                f"on the host first. Underlying: {e!r}"
+            ) from e
+        except docker.errors.APIError as e:
+            raise RuntimeError(
+                f"DockerSandbox.boot: Docker API error starting {image!r} with "
+                f"runtime={self.runtime!r}: {e!r}"
+            ) from e
+    def is_command_allowed(self, command: str) -> bool:
+        # First-token-only check — see SANDBOX_DENYLIST notes. NOT a boundary on
+        # its own; the container isolation + host scrub are the real controls.
+        return command not in SANDBOX_DENYLIST
+    def _exec(self, cmd: str) -> tuple[int, str]:
+        """Run one shell command in the live container via exec_run, enforcing a
+        wall-clock cap with coreutils `timeout` (exec_run has no timeout param —
+        docker-py #2651). Returns (exit_code, combined_output)."""
+        if self._container is None:
+            raise RuntimeError("DockerSandbox.exec called before boot()")
+        # Wrap in `timeout` then a login-ish shell so PATH lookups work. demux
+        # keeps stdout/stderr separable but we combine them like the local
+        # sandbox does for the pytest parser.
+        wrapped = f"timeout {self.exec_timeout_s} {cmd}"
+        full = ["/bin/sh", "-c", wrapped]
+        res = self._container.exec_run(
+            full, workdir=self.container_workdir, demux=True
+        )
+        exit_code = res.exit_code if res.exit_code is not None else -1
+        out = res.output
+        # demux=True => output is (stdout_bytes|None, stderr_bytes|None).
+        if isinstance(out, tuple):
+            stdout_b, stderr_b = out
+        else:  # defensive — some daemons return raw bytes
+            stdout_b, stderr_b = out, None
+        text = self._decode(stdout_b) + self._decode(stderr_b)
+        return exit_code, text
+    @staticmethod
+    def _decode(b) -> str:
+        """Untrusted code can emit non-UTF-8 bytes; never rely on text mode."""
+        if not b:
+            return ""
+        if isinstance(b, bytes):
+            return b.decode("utf-8", errors="replace")
+        return str(b)
+    def exec(self, action: dict) -> str:
+        self._trajectory.append(action)
+        cmd = str(action.get("command", ""))
+        if not cmd.strip():
+            return ""
+        head = cmd.strip().split()[0]
+        if not self.is_command_allowed(head):
+            return f"ERROR: command '{head}' is not allowed in the sandbox."
+        _exit, out = self._exec(cmd)
+        return out
+    def run_tests(self, test_command: str, tests: tuple[str, ...]) -> TestRunResult:
+        # shlex.quote each node id — SWE-bench node ids contain spaces/brackets
+        # (parametrized tests) and could otherwise break the shell or inject
+        # commands (matches LocalSubprocessSandbox).
+        node_ids = " ".join(shlex.quote(t) for t in tests)
+        cmd = f"{test_command} {node_ids}".strip()
+        returncode, out = self._exec(cmd)
+        # Same conservative parse as LocalSubprocessSandbox: a test is "passed"
+        # only if its node id appears with PASSED, else failed; collection
+        # errors => collected_ok False.
+        passed, failed = set(), set()
+        collected_ok = "errors during collection" not in out.lower()
+        for t in tests:
+            if f"{t} PASSED" in out or (returncode == 0 and not failed):
+                passed.add(t)
+            else:
+                failed.add(t)
+        return TestRunResult(
+            passed=frozenset(passed),
+            failed=frozenset(failed),
+            stdout=out,
+            collected_ok=collected_ok,
+        )
+    def trajectory(self) -> list[dict]:
+        return list(self._trajectory)
+    # ---- lifecycle / cleanup -------------------------------------------------
+    def close(self) -> None:
+        """Tear down the ephemeral container (force=True). Idempotent; swallows
+        errors so an already-gone container never masks the episode result."""
+        c = self._container
+        self._container = None
+        if c is not None:
+            try:
+                c.remove(force=True)
+            except Exception:
+                pass
+    @staticmethod
+    def reap_leaked(client=None) -> int:
+        """Sweep ephemeral containers leaked by a crashed episode (labelled
+        composer_replication=datagen). Callable at boot and shutdown. Returns
+        the count removed. Best-effort — never raises."""
+        removed = 0
+        try:
+            client = client or _make_client()
+            leaked = client.containers.list(
+                all=True, filters={"label": f"{_LABEL_KEY}={_LABEL_VALUE}"}
+            )
+            for c in leaked:
+                try:
+                    c.remove(force=True)
+                    removed += 1
+                except Exception:
+                    pass
+        except Exception:
+            pass
+        return removed
+    def __enter__(self) -> DockerSandbox:
+        return self
+    def __exit__(self, *exc) -> None:
+        self.close()
+    def __del__(self):  # pragma: no cover - best-effort GC cleanup
+        try:
+            self.close()
+        except Exception:
+            pass

composer_replication/datagen/sandbox.py CHANGED Viewed

@@ -50,11 +50,51 @@ SANDBOX_DENYLIST: frozenset[str] = frozenset(
 # primary control. `is_command_allowed` checks only the first whitespace token,
 # so `/usr/bin/find`, `sh -c "strings x"`, and especially `python -c "import
 # marshal,dis; ..."` all bypass it. The ADR-claimed PRIMARY control is the
-# pre-task cache/.git scrub in `boot()` — see `_scrub_tree` below, which is now
 # implemented (was previously absent, making the denylist the only — broken —
 # defense). The denylist remains as cheap defense-in-depth, not the wall.
 @runtime_checkable
 class Sandbox(Protocol):
     """An execution environment for one FD episode."""
@@ -120,13 +160,11 @@ class LocalSubprocessSandbox:
     _trajectory: list[dict] = field(default_factory=list)
     booted_image: str | None = None
-    # Cache/history artifacts that let an agent recover a deleted signature
-    # WITHOUT reimplementing it (the Composer-blog reward-hacks). Scrubbed at
-    # boot() so the denylist isn't the only — and bypassable — line of defense.
-    _SCRUB_NAMES: tuple[str, ...] = (
-        "__pycache__", ".mypy_cache", ".pytest_cache", ".git", ".hg",
-    )
-    _SCRUB_SUFFIXES: tuple[str, ...] = (".pyc", ".pyo", ".class")
     def boot(self, image: str) -> None:
         self.booted_image = image
@@ -134,26 +172,9 @@ class LocalSubprocessSandbox:
         self._scrub_tree()
     def _scrub_tree(self) -> None:
-        """PRIMARY reward-hack control (ADR-010 §3): physically remove byte-code
-        caches, type-check caches, and VCS history from the working tree before
-        the episode starts, so there is no cached signature to recover. This is
-        the wall; the command denylist is only cheap defense-in-depth on top.
-        Cross-family review 2026-05-29 found this was previously UNIMPLEMENTED —
-        boot() only recorded the image string."""
-        if not self.workdir or not os.path.isdir(self.workdir):
-            return
-        for root, dirs, files in os.walk(self.workdir, topdown=True):
-            # Remove (and stop descending into) scrub-named directories.
-            for d in list(dirs):
-                if d in self._SCRUB_NAMES:
-                    shutil.rmtree(os.path.join(root, d), ignore_errors=True)
-                    dirs.remove(d)
-            for f in files:
-                if f.endswith(self._SCRUB_SUFFIXES):
-                    try:
-                        os.remove(os.path.join(root, f))
-                    except OSError:
-                        pass
     def is_command_allowed(self, command: str) -> bool:
         # NOTE: first-token-only check — see SANDBOX_DENYLIST comment. This is

 # primary control. `is_command_allowed` checks only the first whitespace token,
 # so `/usr/bin/find`, `sh -c "strings x"`, and especially `python -c "import
 # marshal,dis; ..."` all bypass it. The ADR-claimed PRIMARY control is the
+# pre-task cache/.git scrub in `boot()` — see `scrub_tree` below, which is now
 # implemented (was previously absent, making the denylist the only — broken —
 # defense). The denylist remains as cheap defense-in-depth, not the wall.
+# Cache/history artifacts that let an agent recover a deleted signature WITHOUT
+# reimplementing it (the Composer-blog reward-hacks). Scrubbed at boot() so the
+# denylist isn't the only — and bypassable — line of defense. These are module
+# level so EVERY sandbox backend (LocalSubprocessSandbox AND DockerSandbox)
+# applies the identical primary control via the shared `scrub_tree` free
+# function below.
+SCRUB_NAMES: tuple[str, ...] = (
+    "__pycache__", ".mypy_cache", ".pytest_cache", ".git", ".hg",
+)
+SCRUB_SUFFIXES: tuple[str, ...] = (".pyc", ".pyo", ".class")
+def scrub_tree(workdir: str) -> None:
+    """PRIMARY reward-hack control (ADR-010 §3): physically remove byte-code
+    caches, type-check caches, and VCS history from the working tree before the
+    episode starts, so there is no cached signature to recover. This is the
+    wall; the command denylist is only cheap defense-in-depth on top.
+    Shared by LocalSubprocessSandbox (scrubs the subprocess cwd) and
+    DockerSandbox (scrubs the HOST workdir BEFORE the bind mount — the mount is
+    shared host<->container, so a host-side scrub pre-boot is exactly equivalent
+    to scrubbing inside the container). Cross-family review 2026-05-29 found this
+    was previously UNIMPLEMENTED — boot() only recorded the image string.
+    """
+    if not workdir or not os.path.isdir(workdir):
+        return
+    for root, dirs, files in os.walk(workdir, topdown=True):
+        # Remove (and stop descending into) scrub-named directories.
+        for d in list(dirs):
+            if d in SCRUB_NAMES:
+                shutil.rmtree(os.path.join(root, d), ignore_errors=True)
+                dirs.remove(d)
+        for f in files:
+            if f.endswith(SCRUB_SUFFIXES):
+                try:
+                    os.remove(os.path.join(root, f))
+                except OSError:
+                    pass
 @runtime_checkable
 class Sandbox(Protocol):
     """An execution environment for one FD episode."""
     _trajectory: list[dict] = field(default_factory=list)
     booted_image: str | None = None
+    # Back-compat aliases for the module-level scrub constants (callers/tests
+    # that referenced the old instance attributes keep working). The real
+    # control is the shared module-level `scrub_tree` free function.
+    _SCRUB_NAMES: tuple[str, ...] = SCRUB_NAMES
+    _SCRUB_SUFFIXES: tuple[str, ...] = SCRUB_SUFFIXES
     def boot(self, image: str) -> None:
         self.booted_image = image
         self._scrub_tree()
     def _scrub_tree(self) -> None:
+        """Delegate to the shared module-level `scrub_tree` (see its docstring).
+        Kept as a method for back-compat with existing callers."""
+        scrub_tree(self.workdir)
     def is_command_allowed(self, command: str) -> bool:
         # NOTE: first-token-only check — see SANDBOX_DENYLIST comment. This is

composer_replication/datagen/tests/test_docker_sandbox.py ADDED Viewed

	@@ -0,0 +1,620 @@

+"""test_docker_sandbox.py — DockerSandbox unit + live-Docker coverage.
+Two tiers, mirroring the repo's `test_docker_substrate_e2e.py` gating and the
+ModalSpawnExecutor mock pattern:
+  1. PURE-UNIT (always run, no daemon): a mock docker client/container asserts
+     the hardening config (network_disabled, mem_limit, runtime, cap_drop, ...),
+     the shared host-side scrub, the bytes->str decode, the pytest-summary
+     parse in run_tests, the denylist short-circuit, and the missing-SDK /
+     dead-daemon RuntimeError paths. These cover DockerSandbox even on a box
+     with no Docker.
+  2. LIVE-DOCKER (skipif `_docker_available()`): boots a REAL hardened
+     `python:3.11-slim` container with --network none and runs the 4 inversion
+     gates + a cache-scrub check + a network-isolation check inside it. Since
+     Docker is available on this host, these ACTUALLY RUN.
+"""
+from __future__ import annotations
+import os
+import shutil
+import subprocess
+import tempfile
+import textwrap
+import types
+from collections import namedtuple
+import pytest
+from composer_replication.datagen import docker_sandbox as ds_mod
+from composer_replication.datagen.docker_sandbox import DockerSandbox
+from composer_replication.datagen.sandbox import SANDBOX_DENYLIST, Sandbox
+# ---------------------------------------------------------------------------
+# Live-Docker gate (mirrors test_docker_substrate_e2e.py)
+# ---------------------------------------------------------------------------
+def _docker_available() -> bool:
+    """True iff a usable Docker daemon is reachable via the CLI."""
+    if shutil.which("docker") is None:
+        return False
+    try:
+        r = subprocess.run(["docker", "info"], capture_output=True, timeout=10)
+        return r.returncode == 0
+    except Exception:
+        return False
+# A tiny image we know exists locally on this host (212MB python:3.11-slim).
+# The live tests run `--network none`, so the image MUST be present already
+# (no pull possible inside a network-disabled container).
+_TEST_IMAGE = "python:3.11-slim"
+def _image_present(image: str) -> bool:
+    try:
+        r = subprocess.run(
+            ["docker", "image", "inspect", image], capture_output=True, timeout=15
+        )
+        return r.returncode == 0
+    except Exception:
+        return False
+# ===========================================================================
+# TIER 1 — PURE-UNIT (mock docker client, no daemon required)
+# ===========================================================================
+_ExecResult = namedtuple("ExecResult", ["exit_code", "output"])
+class _MockContainer:
+    """Stand-in for a docker-py Container.
+    Knobs:
+      - exec_script: callable(cmd_argv) -> (exit_code, stdout_bytes, stderr_bytes)
+        used to fake exec_run results. Defaults to an empty success.
+    """
+    def __init__(self, *, exec_script=None):
+        self.exec_calls: list[tuple] = []
+        self.removed = False
+        self.remove_force = None
+        self._exec_script = exec_script or (lambda argv: (0, b"", b""))
+    def exec_run(self, cmd, *, workdir=None, demux=False, **kw):
+        self.exec_calls.append((cmd, {"workdir": workdir, "demux": demux, **kw}))
+        code, out, err = self._exec_script(cmd)
+        if demux:
+            return _ExecResult(code, (out, err))
+        return _ExecResult(code, (out or b"") + (err or b""))
+    def remove(self, force=False):
+        self.removed = True
+        self.remove_force = force
+class _MockContainers:
+    def __init__(self, container, *, run_raises=None):
+        self._container = container
+        self.run_kwargs: dict | None = None
+        self._run_raises = run_raises
+        self._listed: list = []
+    def run(self, **kwargs):
+        self.run_kwargs = kwargs
+        if self._run_raises is not None:
+            raise self._run_raises
+        return self._container
+    def list(self, all=False, filters=None):  # noqa: A002 - matches docker-py
+        return list(self._listed)
+class _MockClient:
+    def __init__(self, container, *, run_raises=None, runtimes=None):
+        self.containers = _MockContainers(container, run_raises=run_raises)
+        self._info = {"Runtimes": runtimes or {"runc": {}}}
+        self.pinged = False
+    def ping(self):
+        self.pinged = True
+        return True
+    def info(self):
+        return self._info
+def _patch_client(monkeypatch, client):
+    """Make DockerSandbox use `client` instead of a real daemon, and stub the
+    lazy `docker` module so `docker.errors.*` resolve in boot()."""
+    monkeypatch.setattr(ds_mod, "_make_client", lambda: client)
+    fake_docker = types.ModuleType("docker")
+    errors = types.ModuleType("docker.errors")
+    class ImageNotFound(Exception):  # noqa: N818 — mirrors docker.errors.ImageNotFound name
+        pass
+    class APIError(Exception):
+        pass
+    errors.ImageNotFound = ImageNotFound
+    errors.APIError = APIError
+    fake_docker.errors = errors
+    fake_docker.from_env = lambda: client
+    monkeypatch.setattr(ds_mod, "_require_docker", lambda: fake_docker)
+    return fake_docker
+def test_dockersandbox_is_a_sandbox_protocol_instance():
+    """Drop-in for FakeSandbox/LocalSubprocessSandbox in env/validator."""
+    sb = DockerSandbox(workdir="/tmp")
+    assert isinstance(sb, Sandbox)
+def test_container_kwargs_hardening_config():
+    """The lockdown recipe is present and correct WITHOUT any daemon."""
+    sb = DockerSandbox(workdir="/some/work")
+    kw = sb.container_kwargs(_TEST_IMAGE)
+    # network egress kill-switch
+    assert kw["network_disabled"] is True
+    assert kw["network_mode"] == "none"
+    # filesystem lockdown
+    assert kw["read_only"] is True
+    assert kw["tmpfs"] == {"/tmp": "rw,noexec,nosuid,size=64m"}
+    assert kw["volumes"] == {"/some/work": {"bind": "/work", "mode": "rw"}}
+    assert kw["working_dir"] == "/work"
+    # privilege lockdown
+    assert kw["user"] == "1000:1000"
+    assert kw["cap_drop"] == ["ALL"]
+    assert kw["security_opt"] == ["no-new-privileges:true"]
+    # resource limits
+    assert kw["pids_limit"] == 256
+    assert kw["mem_limit"] == "1g"
+    assert kw["memswap_limit"] == "1g"  # == mem_limit => no swap
+    assert kw["nano_cpus"] == 2_000_000_000
+    # lifecycle
+    assert kw["detach"] is True
+    assert kw["command"] == ["sleep", "infinity"]
+    assert kw["labels"] == {"composer_replication": "datagen"}
+    assert kw["name"].startswith("fd-")
+def test_runtime_optional_default_runc():
+    """runtime defaults to None => the 'runtime' kwarg is omitted (daemon
+    default runc), so the default path never names a runtime that may not
+    exist."""
+    assert "runtime" not in DockerSandbox(workdir="/w").container_kwargs("img")
+def test_runtime_runsc_passed_through_when_set():
+    """When the caller opts into gVisor, runtime='runsc' reaches run()."""
+    kw = DockerSandbox(workdir="/w", runtime="runsc").container_kwargs("img")
+    assert kw["runtime"] == "runsc"
+def test_resource_limits_are_configurable():
+    sb = DockerSandbox(
+        workdir="/w", mem_limit="256m", memswap_limit="256m",
+        pids_limit=64, nano_cpus=1_000_000_000, tmpfs_size="16m",
+    )
+    kw = sb.container_kwargs("img")
+    assert kw["mem_limit"] == "256m"
+    assert kw["memswap_limit"] == "256m"
+    assert kw["pids_limit"] == 64
+    assert kw["nano_cpus"] == 1_000_000_000
+    assert kw["tmpfs"] == {"/tmp": "rw,noexec,nosuid,size=16m"}
+def test_keep_root_writable_escape_hatch():
+    kw = DockerSandbox(workdir="/w", keep_root_writable=True).container_kwargs("i")
+    assert kw["read_only"] is False
+def test_boot_scrubs_host_tree_before_container(monkeypatch):
+    """PRIMARY reward-hack control: scrub_tree runs on the HOST workdir in boot()
+    BEFORE the container starts (the bind mount is shared)."""
+    with tempfile.TemporaryDirectory() as d:
+        os.makedirs(os.path.join(d, "__pycache__"))
+        with open(os.path.join(d, "__pycache__", "x.cpython-311.pyc"), "wb") as f:
+            f.write(b"\x00stale-bytecode")
+        os.makedirs(os.path.join(d, ".git"))
+        with open(os.path.join(d, "mod.pyc"), "wb") as f:
+            f.write(b"\x00")
+        with open(os.path.join(d, "keep.py"), "w") as f:
+            f.write("x = 1\n")
+        container = _MockContainer()
+        client = _MockClient(container)
+        _patch_client(monkeypatch, client)
+        sb = DockerSandbox(workdir=d)
+        sb.boot(_TEST_IMAGE)
+        assert not os.path.exists(os.path.join(d, "__pycache__"))
+        assert not os.path.exists(os.path.join(d, ".git"))
+        assert not os.path.exists(os.path.join(d, "mod.pyc"))
+        assert os.path.exists(os.path.join(d, "keep.py"))  # real source survives
+        # the container was actually started with the hardened kwargs
+        assert client.containers.run_kwargs["network_disabled"] is True
+        assert sb.booted_image == _TEST_IMAGE
+def test_exec_uses_timeout_and_workdir_and_denylist(monkeypatch):
+    """exec() wraps the command with coreutils `timeout`, runs in /work, and
+    short-circuits denied commands without touching the container."""
+    container = _MockContainer(
+        exec_script=lambda argv: (0, b"hello\n", b"")
+    )
+    client = _MockClient(container)
+    _patch_client(monkeypatch, client)
+    sb = DockerSandbox(workdir="/w", exec_timeout_s=42)
+    sb._container = container  # skip boot for this focused unit
+    out = sb.exec({"command": "echo hello"})
+    assert out == "hello\n"
+    cmd_argv, kw = container.exec_calls[-1]
+    assert cmd_argv == ["/bin/sh", "-c", "timeout 42 echo hello"]
+    assert kw["workdir"] == "/work"
+    assert kw["demux"] is True
+    # a denylisted first token never reaches the container
+    n_before = len(container.exec_calls)
+    denied = sorted(SANDBOX_DENYLIST)[0]
+    msg = sb.exec({"command": f"{denied} something"})
+    assert "not allowed" in msg
+    assert len(container.exec_calls) == n_before  # no new exec
+def test_exec_decodes_non_utf8_bytes(monkeypatch):
+    """Untrusted code can emit invalid UTF-8 on stdout; we must not crash."""
+    container = _MockContainer(
+        exec_script=lambda argv: (0, b"\xff\xfe bad bytes", b"")
+    )
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    out = sb.exec({"command": "echo x"})
+    assert "bad bytes" in out  # replaced, not crashed
+    assert "�" in out  # U+FFFD replacement char
+def test_run_tests_parses_pytest_summary(monkeypatch):
+    """run_tests applies the SAME conservative parse as LocalSubprocessSandbox:
+    a node id is passed iff '<nodeid> PASSED' appears."""
+    tests = ("t.py::test_a", "t.py::test_b")
+    out = b"t.py::test_a PASSED\nt.py::test_b FAILED\n1 failed, 1 passed\n"
+    container = _MockContainer(exec_script=lambda argv: (1, out, b""))
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    res = sb.run_tests("pytest -v", tests)
+    assert res.passed == frozenset({"t.py::test_a"})
+    assert res.failed == frozenset({"t.py::test_b"})
+    assert res.collected_ok is True
+def test_run_tests_collection_error(monkeypatch):
+    tests = ("t.py::test_a",)
+    out = b"ERROR collecting t.py\n!!! errors during collection !!!\n"
+    container = _MockContainer(exec_script=lambda argv: (2, out, b""))
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    res = sb.run_tests("pytest -v", tests)
+    assert res.collected_ok is False
+    assert res.failed == frozenset({"t.py::test_a"})
+def test_run_tests_quotes_node_ids(monkeypatch):
+    """Parametrized node ids with spaces/brackets must be shlex-quoted (shell
+    injection guard the repo already fixed for the local sandbox)."""
+    captured = {}
+    def script(argv):
+        captured["argv"] = argv
+        return (0, b"", b"")
+    container = _MockContainer(exec_script=script)
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    sb.run_tests("pytest -v", ("t.py::test_x[a b]",))
+    # the dangerous node id is quoted inside the timeout-wrapped sh -c string
+    shell_cmd = captured["argv"][-1]
+    assert "'t.py::test_x[a b]'" in shell_cmd
+def test_exec_before_boot_raises():
+    sb = DockerSandbox(workdir="/w")
+    with pytest.raises(RuntimeError, match="before boot"):
+        sb.exec({"command": "echo hi"})
+def test_trajectory_records_actions(monkeypatch):
+    container = _MockContainer()
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    sb.exec({"command": "echo a"})
+    sb.exec({"command": "echo b"})
+    traj = sb.trajectory()
+    assert [a["command"] for a in traj] == ["echo a", "echo b"]
+def test_close_removes_container_force(monkeypatch):
+    container = _MockContainer()
+    sb = DockerSandbox(workdir="/w")
+    sb._container = container
+    sb.close()
+    assert container.removed is True
+    assert container.remove_force is True
+    # idempotent
+    sb.close()
+def test_context_manager_closes(monkeypatch):
+    container = _MockContainer()
+    client = _MockClient(container)
+    _patch_client(monkeypatch, client)
+    with tempfile.TemporaryDirectory() as d:
+        with DockerSandbox(workdir=d) as sb:
+            sb.boot(_TEST_IMAGE)
+            assert sb._container is container
+    assert container.removed is True
+def test_reap_leaked_sweeps_labelled_containers(monkeypatch):
+    leaked = [_MockContainer(), _MockContainer()]
+    container = _MockContainer()
+    client = _MockClient(container)
+    client.containers._listed = leaked
+    n = DockerSandbox.reap_leaked(client)
+    assert n == 2
+    assert all(c.removed for c in leaked)
+def test_boot_image_not_found_raises_runtimeerror(monkeypatch):
+    container = _MockContainer()
+    client = _MockClient(container)
+    fake_docker = _patch_client(monkeypatch, client)
+    # make run() raise ImageNotFound
+    client.containers._run_raises = fake_docker.errors.ImageNotFound("nope")
+    with tempfile.TemporaryDirectory() as d:
+        sb = DockerSandbox(workdir=d)
+        with pytest.raises(RuntimeError, match="not found locally"):
+            sb.boot("ghost:latest")
+def test_boot_api_error_raises_runtimeerror(monkeypatch):
+    container = _MockContainer()
+    client = _MockClient(container)
+    fake_docker = _patch_client(monkeypatch, client)
+    client.containers._run_raises = fake_docker.errors.APIError("bad runtime")
+    with tempfile.TemporaryDirectory() as d:
+        sb = DockerSandbox(workdir=d, runtime="runsc")
+        with pytest.raises(RuntimeError, match="Docker API error"):
+            sb.boot(_TEST_IMAGE)
+def test_require_docker_missing_sdk_raises(monkeypatch):
+    """If the docker SDK is absent, a clear RuntimeError is raised (lazy import
+    means the FakeSandbox/core path never needs it)."""
+    import builtins
+    real_import = builtins.__import__
+    def fake_import(name, *args, **kwargs):
+        if name == "docker":
+            raise ImportError("No module named 'docker'")
+        return real_import(name, *args, **kwargs)
+    monkeypatch.setattr(builtins, "__import__", fake_import)
+    with pytest.raises(RuntimeError, match="requires the 'docker' Python SDK"):
+        ds_mod._require_docker()
+def test_make_client_dead_daemon_raises(monkeypatch):
+    """A dead/unreachable daemon surfaces a clear RuntimeError at client build."""
+    fake_docker = types.ModuleType("docker")
+    def from_env():
+        raise RuntimeError("Cannot connect to the Docker daemon")
+    fake_docker.from_env = from_env
+    monkeypatch.setattr(ds_mod, "_require_docker", lambda: fake_docker)
+    with pytest.raises(RuntimeError, match="could not reach a Docker daemon"):
+        ds_mod._make_client()
+def test_runsc_available_false_when_only_runc(monkeypatch):
+    client = _MockClient(_MockContainer(), runtimes={"runc": {}})
+    monkeypatch.setattr(ds_mod, "_make_client", lambda: client)
+    assert ds_mod.runsc_available() is False
+def test_runsc_available_true_when_registered(monkeypatch):
+    client = _MockClient(_MockContainer(), runtimes={"runc": {}, "runsc": {}})
+    monkeypatch.setattr(ds_mod, "_make_client", lambda: client)
+    assert ds_mod.runsc_available() is True
+# ===========================================================================
+# TIER 2 — LIVE DOCKER (skipif on daemon availability)
+# ===========================================================================
+live = pytest.mark.skipif(
+    not _docker_available(),
+    reason="Docker daemon not available — DockerSandbox live tests are "
+    "hardware-gated (mirror test_docker_substrate_e2e.py).",
+)
+# Minimal synthetic FD task (same shape as test_docker_substrate_e2e.py).
+_MODULE_SOLVED = textwrap.dedent('''\
+    def add(a, b):
+        return a + b
+    def mul(a, b):
+        return a * b
+''')
+_MODULE_BROKEN = textwrap.dedent('''\
+    def add(a, b):
+        return a + b
+''')
+# A stdlib-only pytest substitute: NO pip install, so --network none holds.
+# Writes a tiny runner that imports `feature`, checks both add/mul, and prints
+# pytest-style '<nodeid> PASSED/FAILED' lines that run_tests parses. We pass the
+# NODE IDs to check on argv (trusted, test-author-controlled) and the runner
+# evaluates FIXED expressions per node id — no eval() of untrusted input.
+_RUNNER_TMPL = '''\
+import sys
+# Fixed expectations keyed by node id. The deleted-feature episode is detected
+# by import-time AttributeError on `feature.mul`, never by evaluating a string.
+CHECKS = {
+    "feature.py::test_add": lambda m: m.add(2, 3) == 5,
+    "feature.py::test_mul": lambda m: m.mul(2, 3) == 6,
+}
+nodeid = sys.argv[1]
+try:
+    import feature
+    ok = bool(CHECKS[nodeid](feature))
+except Exception as e:
+    print(nodeid, "FAILED", "(exc:", type(e).__name__, e, ")")
+    sys.exit(1)
+print(nodeid, "PASSED" if ok else "FAILED")
+sys.exit(0 if ok else 1)
+'''
+# Host-side network probe written into the workdir, then run inside the
+# container as a plain file (avoids fragile inline `python -c` quoting through
+# `sh -c`). Prints CONNECTED if egress works, BLOCKED otherwise.
+_NETPROBE = '''\
+import socket
+s = socket.socket()
+s.settimeout(3)
+try:
+    s.connect(("1.1.1.1", 53))
+    print("CONNECTED")
+except Exception as e:
+    print("BLOCKED", type(e).__name__)
+'''
+def _materialize(d: str, module_src: str) -> None:
+    with open(os.path.join(d, "feature.py"), "w") as f:
+        f.write(module_src)
+    with open(os.path.join(d, "runner.py"), "w") as f:
+        f.write(_RUNNER_TMPL)
+@live
+def test_live_image_present_guard():
+    """The live tests run --network none and cannot pull; assert the image is
+    already on the host so a missing-image failure reads clearly."""
+    if not _image_present(_TEST_IMAGE):
+        pytest.skip(f"{_TEST_IMAGE} not present locally; `docker pull {_TEST_IMAGE}` to enable")
+@live
+def test_live_four_inversion_gates_in_hardened_container():
+    """The 4 ADR-010 gates against a REAL hardened DockerSandbox container."""
+    if not _image_present(_TEST_IMAGE):
+        pytest.skip(f"{_TEST_IMAGE} not present locally")
+    target = "feature.py::test_mul"  # FAIL_TO_PASS — exercises the deleted symbol
+    guard = "feature.py::test_add"   # PASS_TO_PASS — must survive the deletion
+    def _run(module_src, node):
+        with tempfile.TemporaryDirectory() as d:
+            _materialize(d, module_src)
+            sb = DockerSandbox(workdir=d, exec_timeout_s=60)
+            sb.boot(_TEST_IMAGE)
+            try:
+                # run_tests appends the shlex-quoted node id to the command, and
+                # the runner uses it to pick which FIXED check to run.
+                res = sb.run_tests("python runner.py", (node,))
+                return node in res.passed, res.stdout
+            finally:
+                sb.close()
+    # Gate 1 — solved: both pass.
+    g1t, _ = _run(_MODULE_SOLVED, target)
+    g1g, _ = _run(_MODULE_SOLVED, guard)
+    assert g1t and g1g, "Gate 1 (baseline green) failed in hardened container"
+    # Gate 2 — broken: target FAILS (mul gone).
+    g2t, out2 = _run(_MODULE_BROKEN, target)
+    assert not g2t, f"Gate 2 (deletion breaks target) failed:\n{out2}"
+    # Gate 3 — broken: guard still PASSES.
+    g3g, out3 = _run(_MODULE_BROKEN, guard)
+    assert g3g, f"Gate 3 (remains functional) failed:\n{out3}"
+    # Gate 4 — gold restores: target passes again.
+    g4t, _ = _run(_MODULE_SOLVED, target)
+    assert g4t, "Gate 4 (gold restores) failed"
+@live
+def test_live_network_is_disabled():
+    """--network none / network_disabled actually blocks egress in the live
+    container — the reward-hack egress kill-switch."""
+    if not _image_present(_TEST_IMAGE):
+        pytest.skip(f"{_TEST_IMAGE} not present locally")
+    with tempfile.TemporaryDirectory() as d:
+        _materialize(d, _MODULE_SOLVED)
+        with open(os.path.join(d, "netprobe.py"), "w") as f:
+            f.write(_NETPROBE)
+        sb = DockerSandbox(workdir=d, exec_timeout_s=30)
+        sb.boot(_TEST_IMAGE)
+        try:
+            out = sb.exec({"command": "python netprobe.py"})
+            assert "CONNECTED" not in out, f"network egress was NOT blocked:\n{out}"
+            assert "BLOCKED" in out, f"unexpected network probe output:\n{out}"
+        finally:
+            sb.close()
+@live
+def test_live_cache_scrub_removes_bytecode():
+    """The cache scrub primary control holds on a real container: a stale .pyc
+    on the host mount is removed by boot() before the (broken) episode."""
+    with tempfile.TemporaryDirectory() as d:
+        _materialize(d, _MODULE_BROKEN)
+        os.makedirs(os.path.join(d, "__pycache__"), exist_ok=True)
+        with open(os.path.join(d, "__pycache__", "feature.cpython-311.pyc"), "wb") as f:
+            f.write(b"\x00stale-bytecode-with-mul-signature")
+        if not _image_present(_TEST_IMAGE):
+            # scrub is host-side and needs no daemon, but keep the live gate honest
+            pass
+        container = None
+        try:
+            sb = DockerSandbox(workdir=d)
+            sb.boot(_TEST_IMAGE)
+            container = sb
+            assert not os.path.exists(os.path.join(d, "__pycache__")), \
+                "cache scrub did not remove __pycache__ in DockerSandbox.boot()"
+        finally:
+            if container is not None:
+                container.close()
+@live
+def test_live_runsc_runtime():
+    """If gVisor is registered, boot with runtime='runsc' and run a test in it;
+    else skip (runsc is not installed on most hosts)."""
+    if not ds_mod.runsc_available():
+        pytest.skip("gVisor 'runsc' runtime not registered with this daemon")
+    if not _image_present(_TEST_IMAGE):
+        pytest.skip(f"{_TEST_IMAGE} not present locally")
+    with tempfile.TemporaryDirectory() as d:
+        _materialize(d, _MODULE_SOLVED)
+        sb = DockerSandbox(workdir=d, runtime="runsc", exec_timeout_s=60)
+        sb.boot(_TEST_IMAGE)
+        try:
+            res = sb.run_tests("python runner.py", ("feature.py::test_mul",))
+            assert "feature.py::test_mul" in res.passed
+        finally:
+            sb.close()

composer_replication/diloco/serverless/__init__.py CHANGED Viewed

@@ -47,6 +47,7 @@ from composer_replication.diloco.serverless.allreduce import (
     MockManager,
     ObjectStoreAllReduce,
 )
 from composer_replication.diloco.serverless.executor import (
     LocalProcessExecutor,
     ReplicaHandle,
@@ -55,13 +56,16 @@ from composer_replication.diloco.serverless.executor import (
 from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
 from composer_replication.diloco.serverless.modal import ModalExecutor
 from composer_replication.diloco.serverless.modal_spawn import ModalSpawnExecutor
 __all__ = [
     "HFJobsExecutor",
     "LocalProcessExecutor",
     "MockManager",
     "ModalExecutor",
     "ModalSpawnExecutor",
     "ObjectStoreAllReduce",
     "ReplicaHandle",
     "ServerlessExecutor",

     MockManager,
     ObjectStoreAllReduce,
 )
+from composer_replication.diloco.serverless.eks import EKSExecutor
 from composer_replication.diloco.serverless.executor import (
     LocalProcessExecutor,
     ReplicaHandle,
 from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
 from composer_replication.diloco.serverless.modal import ModalExecutor
 from composer_replication.diloco.serverless.modal_spawn import ModalSpawnExecutor
+from composer_replication.diloco.serverless.sagemaker import SageMakerExecutor
 __all__ = [
+    "EKSExecutor",
     "HFJobsExecutor",
     "LocalProcessExecutor",
     "MockManager",
     "ModalExecutor",
     "ModalSpawnExecutor",
+    "SageMakerExecutor",
     "ObjectStoreAllReduce",
     "ReplicaHandle",
     "ServerlessExecutor",

composer_replication/diloco/serverless/eks.py ADDED Viewed

	@@ -0,0 +1,674 @@

+"""EKSExecutor — production Amazon EKS / Kubernetes-backed serverless executor.
+This is the v0-finished k8s sibling of `ModalSpawnExecutor`. It implements
+the `ServerlessExecutor` Protocol against the Kubernetes ``BatchV1Api`` using
+the **single Indexed Job** topology recommended for gang-scheduled DiLoCo
+replicas.
+Topology (the load-bearing design choice)
+------------------------------------------
+There are two ways to map N replicas onto k8s:
+  (A) ONE Indexed Job — ``completions=N, parallelism=N,
+      completionMode='Indexed'``. The control plane assigns each pod a
+      ``JOB_COMPLETION_INDEX`` 0..N-1 which IS the rank, all pods share one
+      rendezvous URI, scheduling is atomic, and a single delete cancels the
+      whole cohort.
+  (B) N separate non-indexed Jobs, one per rank.
+`EKSExecutor` uses **(A)** because it is the better fit for DiLoCo: rank
+assignment is free, scheduling is gang-atomic, and one delete tears down the
+cohort — which matches ``ObjectStoreAllReduce``'s all-or-nothing barrier. The
+reconciliation with the per-replica ``ReplicaHandle`` contract: ``launch_replicas``
+creates ONE Indexed Job but still returns N ``ReplicaHandle`` objects
+(``handles[i].rank == i``) whose ``metadata`` stores the SHARED
+``job_name``/``namespace`` plus that rank.
+This is materially different from ``ModalSpawnExecutor`` where each handle is
+an independent ``FunctionCall``:
+  * ``poll(handle)`` reads the single Job status and checks whether
+    ``handle.rank`` is in the run-length-compressed ``completed_indexes`` /
+    ``failed_indexes`` strings.
+  * ``cancel(handle)`` on ANY handle deletes the WHOLE Job (intentional gang
+    semantics — cancelling one rank tears down the whole replica cohort).
+Rank plumbing
+-------------
+The repo's ``replica_entrypoint`` reads ``REPLICA_RANK``. We bridge the k8s
+completion-index to that env var via the downward API rather than relying on
+the auto-injected ``JOB_COMPLETION_INDEX``::
+    V1EnvVar(
+        name="REPLICA_RANK",
+        value_from=V1EnvVarSource(field_ref=V1ObjectFieldSelector(
+            field_path="metadata.annotations['batch.kubernetes.io/job-completion-index']")),
+    )
+so the unchanged entrypoint's ``REPLICA_RANK`` read just works. ``WORLD_SIZE``
+is set as a literal env var.
+S3 rendezvous via IRSA / Pod Identity
+-------------------------------------
+``EKSExecutor`` accepts ``service_account_name`` and references it on the
+PodSpec. The EKS Pod Identity / IRSA mutating webhook then injects
+``AWS_ROLE_ARN`` + ``AWS_WEB_IDENTITY_TOKEN_FILE`` (and a projected token
+volume) into the pod, so ``boto3``/``s3fs``/``fsspec`` pick up credentials via
+the web-identity provider with ZERO code change inside the replica — the
+``s3://`` rendezvous works out of the box. ``EKSExecutor`` only REFERENCES a
+pre-annotated ServiceAccount; it never creates IAM/OIDC resources.
+Sandboxing (advanced, optional)
+-------------------------------
+``runtime_class_name`` references a pre-existing cluster-scoped ``RuntimeClass``
+(``runsc`` for gVisor, ``kata`` for Kata). It defaults to ``None``.
+.. warning::
+   Combining ``gpu`` with ``runtime_class_name`` is advanced and unverified.
+   gVisor (runsc) needs ``nvproxy`` enabled and only supports a fixed allowlist
+   of NVIDIA driver families; Kata runs a microVM that caps CPU/mem and needs
+   GPU passthrough (PCIe/IOMMU + NVIDIA Kata Manager + CDI). Do not silently
+   combine the two without operator validation. ``EKSExecutor`` cannot create
+   the RuntimeClass — it only references one that already exists.
+References
+----------
+- k8s Indexed Jobs: https://kubernetes.io/docs/tasks/job/indexed-parallel-processing-static/
+- kubernetes-client/python job_crud example + V1JobSpec / V1JobStatus docs
+- EKS IRSA: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
+- ADR-005 (executor protocol design)
+"""
+from __future__ import annotations
+import time
+import uuid
+from collections.abc import Callable, Mapping
+from typing import Any
+from composer_replication.diloco.serverless.executor import (
+    ReplicaHandle,
+)
+# Logical GPU spec ("A100"/"H100") -> (gpu_count_string, node_selector merge).
+# The Protocol's `gpu` arg is a logical name; map it to a concrete EKS node
+# class + GPU count rather than passing the opaque string straight through.
+_GPU_SPEC_TABLE: dict[str, tuple[str, dict[str, str]]] = {
+    "A100": ("1", {"node.kubernetes.io/instance-type": "p4d.24xlarge"}),
+    "H100": ("1", {"node.kubernetes.io/instance-type": "p5.48xlarge"}),
+    "A10G": ("1", {"node.kubernetes.io/instance-type": "g5.xlarge"}),
+    "T4": ("1", {"node.kubernetes.io/instance-type": "g4dn.xlarge"}),
+}
+def _expand_indexes(spec: str | None) -> set[int]:
+    """Expand a run-length-compressed completion-index string to a set.
+    The k8s ``V1JobStatus.completed_indexes`` / ``failed_indexes`` fields are
+    strings like ``"1,3-5,7"`` (comma-separated singletons and ``a-b`` ranges).
+    ``_expand_indexes("1,3-5,7") == {1, 3, 4, 5, 7}``. Empty/None -> empty set.
+    """
+    out: set[int] = set()
+    if not spec:
+        return out
+    for token in spec.split(","):
+        token = token.strip()
+        if not token:
+            continue
+        if "-" in token:
+            lo_s, _, hi_s = token.partition("-")
+            try:
+                lo, hi = int(lo_s), int(hi_s)
+            except ValueError:
+                continue
+            if hi < lo:
+                lo, hi = hi, lo
+            out.update(range(lo, hi + 1))
+        else:
+            try:
+                out.add(int(token))
+            except ValueError:
+                continue
+    return out
+class EKSExecutor:
+    """Run N DiLoCo replicas as a single Kubernetes Indexed Job on EKS.
+    Implements the `ServerlessExecutor` Protocol. ``launch_replicas`` creates
+    ONE Indexed Job (``completions == parallelism == n_replicas``,
+    ``completionMode='Indexed'``) and returns N ``ReplicaHandle`` objects that
+    all share the same ``job_name``/``namespace`` (gang semantics).
+    Args:
+        image: container image that has ``composer_replication`` installed and
+            runs the replica entrypoint.
+        namespace: k8s namespace for the Job. Default ``"default"``.
+        service_account_name: ServiceAccount to attach to the PodSpec for IRSA /
+            EKS Pod Identity S3 access. ``EKSExecutor`` references it; it does
+            NOT create it or any IAM/OIDC resources.
+        node_selector: extra node selector merged into the GPU node selector.
+        tolerations: PodSpec tolerations. If GPU is requested and the caller did
+            not supply tolerations, the standard ``nvidia.com/gpu`` NoSchedule
+            toleration is added automatically.
+        runtime_class_name: optional pre-existing RuntimeClass (e.g. ``"gvisor"``
+            / ``"kata"``). Default ``None``. See the module-level warning before
+            combining with ``gpu``.
+        command: container command. Defaults to the repo replica entrypoint
+            module ``["python", "-m",
+            "composer_replication.diloco.serverless.replica_entrypoint"]``.
+        cpu_request / memory_request: PodSpec resource requests.
+        ttl_seconds_after_finished: auto-delete the finished Job (and its pods,
+            cascadingly) after this many seconds. Default 3600.
+        backoff_limit: Job retry budget. Default 0 (fail-fast — RL gangs usually
+            do NOT want the k8s default of 6 retries).
+        gpu_resource_key: the GPU resource key. Default ``"nvidia.com/gpu"``.
+        run_id: optional run id baked into the generated Job name.
+        batch_api / core_api: dependency-injected ``BatchV1Api`` / ``CoreV1Api``
+            instances. When ``None`` (the default), they are built lazily on
+            first use via in-cluster or kube-config loading. Tests inject mocks.
+    Raises:
+        RuntimeError: if the ``kubernetes`` client is not installed AND no api
+            was injected (the import is needed to construct V1 model objects).
+    """
+    backend_name = "eks"
+    # Pods are network-isolated by default; rendezvous is S3 (ObjectStoreAllReduce).
+    supports_inter_replica_network = False
+    def __init__(
+        self,
+        image: str,
+        *,
+        namespace: str = "default",
+        service_account_name: str | None = None,
+        node_selector: dict[str, str] | None = None,
+        tolerations: list[Any] | None = None,
+        runtime_class_name: str | None = None,
+        command: list[str] | None = None,
+        cpu_request: str = "4",
+        memory_request: str = "16Gi",
+        ttl_seconds_after_finished: int = 3600,
+        backoff_limit: int = 0,
+        gpu_resource_key: str = "nvidia.com/gpu",
+        run_id: str | None = None,
+        batch_api: Any = None,
+        core_api: Any = None,
+    ) -> None:
+        # `kubernetes` is only strictly required when we have to BUILD V1 model
+        # objects ourselves (launch_replicas) or load cluster config (when no
+        # api is injected). We surface a clear error here only if we definitely
+        # need it and it is absent — i.e. when no api was injected. When apis
+        # ARE injected (tests, or callers that pre-built clients), we tolerate a
+        # missing top-level package and lazy-import `client` per call.
+        if batch_api is None or core_api is None:
+            try:
+                import kubernetes  # noqa: F401
+            except ImportError as e:
+                raise RuntimeError(
+                    'EKSExecutor requires the kubernetes client: '
+                    'pip install "kubernetes>=29" (or '
+                    "`pip install -e .[serverless]`). Got: " + repr(e)
+                ) from e
+        self.image = image
+        self.namespace = namespace
+        self.service_account_name = service_account_name
+        self.node_selector = dict(node_selector) if node_selector else None
+        self.tolerations = list(tolerations) if tolerations else None
+        self.runtime_class_name = runtime_class_name
+        self.command = command or [
+            "python",
+            "-m",
+            "composer_replication.diloco.serverless.replica_entrypoint",
+        ]
+        self.cpu_request = cpu_request
+        self.memory_request = memory_request
+        self.ttl_seconds_after_finished = ttl_seconds_after_finished
+        self.backoff_limit = backoff_limit
+        self.gpu_resource_key = gpu_resource_key
+        self.run_id = run_id or "diloco"
+        self._batch_api = batch_api
+        self._core_api = core_api
+        # rank -> {"job_name", "namespace", "result"}; lets poll/collect cache.
+        self._handles: dict[int, dict[str, Any]] = {}
+    # -----------------------------------------------------------------
+    # Lazy client construction (config loading only when not injected)
+    # -----------------------------------------------------------------
+    def _load_config(self) -> None:
+        """Load k8s config once: in-cluster first, then ~/.kube/config."""
+        from kubernetes import config
+        try:
+            config.load_incluster_config()
+        except config.ConfigException:
+            config.load_kube_config()
+    def _batch(self) -> Any:
+        if self._batch_api is None:
+            from kubernetes import client
+            self._load_config()
+            self._batch_api = client.BatchV1Api()
+        return self._batch_api
+    def _core(self) -> Any:
+        if self._core_api is None:
+            from kubernetes import client
+            self._load_config()
+            self._core_api = client.CoreV1Api()
+        return self._core_api
+    # -----------------------------------------------------------------
+    # Job-spec construction
+    # -----------------------------------------------------------------
+    def _build_env(
+        self, world_size: int, entrypoint_args: Mapping[str, Any]
+    ) -> list[Any]:
+        """Build the container env list, including the downward-API rank var."""
+        from kubernetes import client
+        env: list[Any] = [
+            # REPLICA_RANK from the per-pod completion-index annotation via the
+            # downward API — bridges k8s indexing to the repo entrypoint's
+            # REPLICA_RANK read with no entrypoint change.
+            client.V1EnvVar(
+                name="REPLICA_RANK",
+                value_from=client.V1EnvVarSource(
+                    field_ref=client.V1ObjectFieldSelector(
+                        field_path=(
+                            "metadata.annotations["
+                            "'batch.kubernetes.io/job-completion-index']"
+                        )
+                    )
+                ),
+            ),
+            client.V1EnvVar(name="WORLD_SIZE", value=str(world_size)),
+        ]
+        # rendezvous_uri (and any other scalar kwargs) passed as literal env so
+        # the entrypoint / user code can read them. `rank_env` is the
+        # LocalProcessExecutor convention — drop it (same as ModalSpawnExecutor).
+        for key, value in entrypoint_args.items():
+            if key == "rank_env":
+                continue
+            if isinstance(value, (str, int, float, bool)):
+                env.append(
+                    client.V1EnvVar(name=key.upper(), value=str(value))
+                )
+        return env
+    def _build_resources(self, gpu: str | None) -> tuple[Any, dict[str, str], list[Any]]:
+        """Build V1ResourceRequirements + (node_selector, tolerations) for GPU.
+        Returns (resources, node_selector, tolerations). The GPU count is
+        ALWAYS a STRING ('1', not int 1) — the OpenAPI type for the limits map
+        is dict[str, str] and an int can serialize wrong or raise.
+        """
+        from kubernetes import client
+        requests = {"cpu": self.cpu_request, "memory": self.memory_request}
+        limits: dict[str, str] = {}
+        node_selector: dict[str, str] = dict(self.node_selector or {})
+        tolerations: list[Any] = list(self.tolerations or [])
+        if gpu is not None:
+            gpu_count, gpu_node_selector = _GPU_SPEC_TABLE.get(
+                gpu, ("1", {})
+            )
+            # STRING, always.
+            limits[self.gpu_resource_key] = str(gpu_count)
+            # Merge the mapped node selector under any caller-supplied one
+            # (caller wins on key conflicts).
+            for k, v in gpu_node_selector.items():
+                node_selector.setdefault(k, v)
+            # Auto-add the GPU NoSchedule toleration unless the caller overrode
+            # tolerations explicitly.
+            if not self.tolerations:
+                tolerations.append(
+                    client.V1Toleration(
+                        key=self.gpu_resource_key,
+                        operator="Exists",
+                        effect="NoSchedule",
+                    )
+                )
+        resources = client.V1ResourceRequirements(
+            requests=requests,
+            limits=limits or None,
+        )
+        return resources, node_selector, tolerations
+    def _build_job(
+        self,
+        *,
+        job_name: str,
+        n_replicas: int,
+        gpu: str | None,
+        timeout: int,
+        entrypoint_args: Mapping[str, Any],
+    ) -> Any:
+        """Assemble the full V1Job (Indexed) bottom-up."""
+        from kubernetes import client
+        env = self._build_env(n_replicas, entrypoint_args)
+        resources, node_selector, tolerations = self._build_resources(gpu)
+        container = client.V1Container(
+            name="replica",
+            image=self.image,
+            command=list(self.command),
+            env=env,
+            resources=resources,
+        )
+        pod_spec = client.V1PodSpec(
+            restart_policy="Never",  # required for Indexed jobs / fail-fast RL
+            containers=[container],
+            service_account_name=self.service_account_name,
+            node_selector=node_selector or None,
+            tolerations=tolerations or None,
+            runtime_class_name=self.runtime_class_name,
+        )
+        labels = {"app": "composer-diloco", "job-name": job_name}
+        pod_template = client.V1PodTemplateSpec(
+            metadata=client.V1ObjectMeta(labels=labels),
+            spec=pod_spec,
+        )
+        job_spec = client.V1JobSpec(
+            template=pod_template,
+            completions=n_replicas,
+            parallelism=n_replicas,
+            completion_mode="Indexed",
+            backoff_limit=self.backoff_limit,
+            ttl_seconds_after_finished=self.ttl_seconds_after_finished,
+            active_deadline_seconds=timeout,
+        )
+        return client.V1Job(
+            api_version="batch/v1",
+            kind="Job",
+            metadata=client.V1ObjectMeta(name=job_name, labels=labels),
+            spec=job_spec,
+        )
+    # -----------------------------------------------------------------
+    # ServerlessExecutor Protocol
+    # -----------------------------------------------------------------
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = None,
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        """Create ONE Indexed Job of N pods and return N rank-ordered handles.
+        ``entrypoint`` is ignored when it names a Callable (k8s runs a container
+        command, not an in-process callable); the container command is fixed at
+        construction (``command`` ctor arg). The repo entrypoint module is the
+        default. ``entrypoint_args`` scalar kwargs are passed as upper-cased env
+        vars so ``replica_entrypoint`` / user code can read them. ``gpu`` maps to
+        a ``nvidia.com/gpu`` limit + node selector; ``timeout`` becomes the Job's
+        ``active_deadline_seconds`` hard wall-clock kill.
+        """
+        del entrypoint  # k8s runs a container command, not an in-process fn
+        if n_replicas < 1:
+            raise ValueError(f"n_replicas must be >= 1, got {n_replicas}")
+        job_name = f"{self.run_id}-{uuid.uuid4().hex[:8]}"
+        job = self._build_job(
+            job_name=job_name,
+            n_replicas=n_replicas,
+            gpu=gpu,
+            timeout=timeout,
+            entrypoint_args=entrypoint_args,
+        )
+        self._batch().create_namespaced_job(namespace=self.namespace, body=job)
+        handles: list[ReplicaHandle] = []
+        for rank in range(n_replicas):
+            handles.append(
+                ReplicaHandle(
+                    rank=rank,
+                    backend_name=self.backend_name,
+                    metadata={
+                        "job_name": job_name,
+                        "namespace": self.namespace,
+                        "rank": rank,
+                    },
+                )
+            )
+            self._handles[rank] = {
+                "job_name": job_name,
+                "namespace": self.namespace,
+                "result": None,
+            }
+        return handles
+    def poll(self, handle: ReplicaHandle) -> str:
+        """Poll this rank's status off the shared Indexed Job.
+        Reads ``read_namespaced_job_status`` once, then maps the whole-job
+        status to this rank: ``rank in completed_indexes`` -> ``succeeded``;
+        ``rank in failed_indexes`` -> ``failed``; ``active > 0`` -> ``running``;
+        else ``pending``. A 404 (Job deleted/cancelled) -> ``cancelled``.
+        Returns one of: ``pending`` | ``running`` | ``succeeded`` | ``failed`` |
+        ``cancelled``.
+        """
+        from kubernetes.client.exceptions import ApiException
+        job_name = handle.metadata["job_name"]
+        namespace = handle.metadata["namespace"]
+        rank = handle.metadata.get("rank", handle.rank)
+        try:
+            status = self._batch().read_namespaced_job_status(
+                name=job_name, namespace=namespace
+            ).status
+        except ApiException as e:
+            if getattr(e, "status", None) == 404:
+                return "cancelled"
+            raise
+        completed = _expand_indexes(getattr(status, "completed_indexes", None))
+        if rank in completed:
+            return "succeeded"
+        failed = _expand_indexes(getattr(status, "failed_indexes", None))
+        if rank in failed:
+            return "failed"
+        # Whole-job terminal Failed (e.g. DeadlineExceeded / backoff) with no
+        # per-index attribution -> treat this rank as failed.
+        for cond in (getattr(status, "conditions", None) or []):
+            if (
+                getattr(cond, "type", None) == "Failed"
+                and getattr(cond, "status", None) == "True"
+            ):
+                return "failed"
+        active = getattr(status, "active", None) or 0
+        if active > 0:
+            return "running"
+        return "pending"
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        """Read recent logs for this rank's pod.
+        Finds the pod whose ``batch.kubernetes.io/job-completion-index``
+        annotation (or label) equals the rank, then reads its log tail. Returns
+        a placeholder string (rather than raising) when the pod has not started
+        or the Job is gone — mirrors ``LocalProcessExecutor``.
+        """
+        from kubernetes.client.exceptions import ApiException
+        job_name = handle.metadata["job_name"]
+        namespace = handle.metadata["namespace"]
+        rank = handle.metadata.get("rank", handle.rank)
+        idx_key = "batch.kubernetes.io/job-completion-index"
+        try:
+            pods = self._core().list_namespaced_pod(
+                namespace=namespace, label_selector=f"job-name={job_name}"
+            )
+        except ApiException:
+            return f"<rank {rank}: job not found / no pods yet>"
+        pod_name = None
+        for pod in getattr(pods, "items", None) or []:
+            meta = getattr(pod, "metadata", None)
+            annotations = getattr(meta, "annotations", None) or {}
+            labels = getattr(meta, "labels", None) or {}
+            if annotations.get(idx_key) == str(rank) or labels.get(idx_key) == str(rank):
+                pod_name = getattr(meta, "name", None)
+                break
+        if pod_name is None:
+            # Fall back to the deterministic name prefix on k8s >= 1.28.
+            prefix = f"{job_name}-{rank}-"
+            for pod in getattr(pods, "items", None) or []:
+                name = getattr(getattr(pod, "metadata", None), "name", "") or ""
+                if name.startswith(prefix):
+                    pod_name = name
+                    break
+        if pod_name is None:
+            return f"<rank {rank}: pod not started / no logs yet>"
+        try:
+            return self._core().read_namespaced_pod_log(
+                name=pod_name,
+                namespace=namespace,
+                container="replica",
+                tail_lines=n_lines,
+            )
+        except ApiException as e:
+            if getattr(e, "status", None) in (400, 404):
+                return f"<rank {rank}: pod not started / no logs yet>"
+            raise
+    def cancel(self, handle: ReplicaHandle) -> None:
+        """Delete the WHOLE shared Indexed Job (gang teardown).
+        Because ``EKSExecutor`` uses one shared Indexed Job, cancelling ANY rank
+        tears down the entire replica cohort — intentional gang semantics for
+        the DiLoCo all-reduce barrier (a single straggler being cancelled should
+        not leave the rest spinning and burning GPU).
+        Uses ``propagation_policy='Background'`` so the pods are cascadingly
+        deleted (the k8s default ORPHANS pods, which would keep burning GPU —
+        the exact failure mode for RL). Idempotent: a 404 (already deleted) is
+        swallowed, and an unknown handle never raises, honoring the Protocol's
+        "no exception if already terminated" contract.
+        """
+        from kubernetes import client
+        from kubernetes.client.exceptions import ApiException
+        job_name = handle.metadata.get("job_name")
+        namespace = handle.metadata.get("namespace", self.namespace)
+        if not job_name:
+            return  # unknown handle — no-op
+        try:
+            self._batch().delete_namespaced_job(
+                name=job_name,
+                namespace=namespace,
+                body=client.V1DeleteOptions(
+                    propagation_policy="Background",
+                    grace_period_seconds=0,
+                ),
+            )
+        except ApiException as e:
+            if getattr(e, "status", None) == 404:
+                return  # already deleted
+            # Best-effort: swallow other API errors (network blip, etc.).
+            return
+        except Exception:
+            return
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        """Poll until every rank reaches a terminal state or the deadline.
+        Sleeps between polls (Job status is eventually consistent — do not
+        hammer the API server). Returns per-rank result dicts in handles order::
+            {"rank", "status", "exit_code", "error", "job_name"}
+        ``exit_code`` is 0 for succeeded, 1 for failed, ``None`` for
+        running/pending/cancelled — matching the Protocol's documented shape.
+        """
+        deadline = time.time() + (timeout if timeout is not None else 86400)
+        poll_interval = float(self._collect_poll_interval())
+        terminal = {"succeeded", "failed", "cancelled"}
+        results_by_rank: dict[int, dict[str, Any]] = {}
+        pending = list(handles)
+        while pending and time.time() < deadline:
+            still_pending: list[ReplicaHandle] = []
+            for h in pending:
+                state = self.poll(h)
+                if state in terminal:
+                    results_by_rank[h.rank] = self._result_dict(h, state)
+                else:
+                    still_pending.append(h)
+            pending = still_pending
+            if not pending:
+                break
+            remaining = deadline - time.time()
+            if remaining <= 0:
+                break
+            time.sleep(min(poll_interval, max(0.0, remaining)))
+        # Any rank still non-terminal at the deadline -> report its last state.
+        for h in pending:
+            state = self.poll(h)
+            results_by_rank[h.rank] = self._result_dict(h, state)
+        return [results_by_rank[h.rank] for h in handles]
+    # -----------------------------------------------------------------
+    # Internals
+    # -----------------------------------------------------------------
+    def _collect_poll_interval(self) -> float:
+        """Seconds between collect() polls. Overridable in tests."""
+        return 5.0
+    @staticmethod
+    def _result_dict(handle: ReplicaHandle, state: str) -> dict[str, Any]:
+        exit_code = {"succeeded": 0, "failed": 1}.get(state, None)
+        error = None
+        if state == "failed":
+            error = f"rank {handle.rank} reported failed by Job status"
+        elif state == "cancelled":
+            error = f"rank {handle.rank} Job no longer exists (cancelled)"
+        elif state in ("running", "pending"):
+            error = f"rank {handle.rank} not terminal at deadline (state={state})"
+        return {
+            "rank": handle.rank,
+            "status": state,
+            "exit_code": exit_code,
+            "error": error,
+            "job_name": handle.metadata.get("job_name"),
+        }
+__all__ = ["EKSExecutor"]

composer_replication/diloco/serverless/executor.py CHANGED Viewed

@@ -36,9 +36,10 @@ class ReplicaHandle:
 class ServerlessExecutor(Protocol):
     """Uniform interface for launching N replicas on a serverless backend.
-    Implementations: `LocalProcessExecutor` (test/dev), `ModalExecutor`
-    (Modal, v0), `HFJobsExecutor` (HuggingFace Jobs, v0). Future:
-    `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`.
     Note on rank assignment: the Protocol guarantees that handles are
     returned in rank order (`handles[i].rank == i`). The replica entrypoint

 class ServerlessExecutor(Protocol):
     """Uniform interface for launching N replicas on a serverless backend.
+    Implementations: `LocalProcessExecutor` (test/dev), `ModalSpawnExecutor`
+    (Modal, production), `EKSExecutor` (Amazon EKS / Kubernetes Indexed Job,
+    production), `ModalExecutor` / `HFJobsExecutor` (v0 skeletons). Future
+    adapters: `RunPodExecutor`, `SageMakerExecutor`.
     Note on rank assignment: the Protocol guarantees that handles are
     returned in rank order (`handles[i].rank == i`). The replica entrypoint

composer_replication/diloco/serverless/sagemaker.py ADDED Viewed

	@@ -0,0 +1,619 @@

+"""SageMakerExecutor — production boto3-backed serverless executor.
+This is a fully-working cloud adapter (the sibling of `ModalSpawnExecutor`,
+not the loud-failing `modal.py` / `hf_jobs.py` skeletons). It implements the
+`ServerlessExecutor` Protocol against Amazon SageMaker Training Jobs via the
+boto3 low-level `sagemaker` client.
+Design choices
+--------------
+1. **N independent single-instance jobs, NOT one multi-instance job.**
+   SageMaker's *native* distributed training (``ResourceConfig.InstanceCount > 1``)
+   groups instances into ONE job with an in-cluster NCCL/MPI fabric wired via
+   ``/opt/ml/input/config/resourceconfig.json``. That is the WRONG model for
+   DiLoCo replicas — it would couple replicas through SageMaker's intra-job
+   network and break the "each replica is an independent DiLoCo worker that
+   syncs only through S3" design. So ``launch_replicas`` submits N **separate**
+   training jobs, each with ``ResourceConfig.InstanceCount == 1``, tagged with
+   ``REPLICA_RANK=i`` / ``WORLD_SIZE=N`` via the ``Environment`` map. This
+   mirrors ``ModalSpawnExecutor`` spawning N independent Modal calls.
+2. **Same S3 ``ObjectStoreAllReduce`` rendezvous — DiLoCo math untouched.**
+   Cross-replica communication is EXCLUSIVELY the object-store rendezvous; the
+   executor passes ``rendezvous_uri`` (an ``s3://...`` URI) through to
+   ``replica_entrypoint.py`` unchanged. ``allreduce.py`` / ``MockManager`` /
+   ``make_diloco_outer_loop`` / the trainer all stay byte-for-byte identical.
+3. **Stateless after launch; rank via ``Environment``.** Handle metadata is the
+   ``training_job_name`` (plus submit timestamp). ``replica_entrypoint.py``
+   already reads ``REPLICA_RANK`` from ``os.environ``, so the cleanest channel
+   is the ``Environment`` map (string->string, max 100 entries, value <= 512
+   chars). The container command is baked into the image entrypoint and the
+   rendezvous args are passed via ``AlgorithmSpecification.ContainerArguments``.
+4. **``supports_inter_replica_network = False``.** Separate single-instance
+   training jobs have no mutual network path by design — they rendezvous only
+   through S3. (SageMaker's algo-N container fabric and
+   ``EnableInterContainerTrafficEncryption`` only exist WITHIN a single
+   multi-instance job, which this design deliberately does not use.)
+Load-bearing gotcha — ``EnableNetworkIsolation`` MUST stay ``False``
+--------------------------------------------------------------------
+When ``EnableNetworkIsolation=True`` the training *container* has no outbound
+network access. SageMaker's host-side processes still stage input channels and
+ship CloudWatch logs, but the container itself cannot make S3 GET/PUT calls.
+``ObjectStoreAllReduce`` needs live S3 PUT+GET every outer round, so network
+isolation would silently dead-lock the allreduce poll loop until its timeout.
+This executor pins ``EnableNetworkIsolation=False`` (the API default) and never
+exposes it as a knob. The rendezvous bucket access must instead be granted on
+the execution ``RoleArn`` — the SageMaker analog of EKS IRSA.
+HyperPod <-> EKS 1:1 control-plane mapping (recommended hybrid)
+---------------------------------------------------------------
+Per the SageMaker docs: *"The high-level architecture of Amazon EKS support in
+HyperPod involves a 1-to-1 mapping between an EKS cluster (control plane) and a
+HyperPod cluster (worker nodes) within a VPC."*
+(https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html)
+Consequence for this repo's hybrid: "use HyperPod for the inner GRPO trainer"
+does NOT mean leaving EKS — it means attaching a HyperPod-managed
+(auto-recovering, deep-health-checked, PyTorch-job auto-resume) node-group to
+the SAME EKS control plane that runs the outer loop. A future ``EKSExecutor``
+(kubernetes client, Indexed Jobs) therefore targets both plain Karpenter GPU
+nodes AND HyperPod nodes transparently. ``SageMakerExecutor`` (ephemeral
+Training Jobs via boto3) is the SEPARATE bursty-fallback inner-loop path for
+when you don't want a persistent cluster: Training Jobs suit periodic /
+smaller-model / pay-per-use runs; HyperPod suits continuous / large-model /
+persistent runs. Both share the IDENTICAL S3 rendezvous, so a run can move
+between them with zero trainer / loss / DiLoCo changes.
+References
+----------
+- create_training_job: https://docs.aws.amazon.com/boto3/latest/reference/services/sagemaker/client/create_training_job.html
+- describe_training_job: https://docs.aws.amazon.com/boto3/latest/reference/services/sagemaker/client/describe_training_job.html
+- stop_training_job: https://docs.aws.amazon.com/boto3/latest/reference/services/sagemaker/client/stop_training_job.html
+- network isolation: https://repost.aws/knowledge-center/sagemaker-access-network-isolation
+- HyperPod-EKS: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html
+- ADR-005 (executor protocol design)
+"""
+from __future__ import annotations
+import json
+import time
+import uuid
+from collections.abc import Callable, Mapping
+from typing import Any
+from composer_replication.diloco.serverless.executor import (
+    ReplicaHandle,
+)
+# SageMaker TrainingJobStatus -> Protocol status vocabulary.
+# describe_training_job's TrainingJobStatus is EXACTLY one of:
+#   'InProgress' | 'Completed' | 'Failed' | 'Stopping' | 'Stopped'.
+# We map Stopping -> 'running' (transient; still terminating, so collect()
+# keeps waiting) and Stopped -> 'cancelled'.
+_STATUS_MAP = {
+    "InProgress": "running",
+    "Completed": "succeeded",
+    "Failed": "failed",
+    "Stopping": "running",
+    "Stopped": "cancelled",
+}
+# SecondaryStatus values that mean "queued / not yet executing user code" —
+# used to refine an InProgress job into the Protocol's 'pending'.
+_PENDING_SECONDARY = frozenset(
+    {"Starting", "Pending", "LaunchingMLInstances", "PreparingTrainingStack"}
+)
+# Abstract Protocol GPU strings -> SageMaker instance types.
+_GPU_INSTANCE_MAP = {
+    "A100": "ml.p4d.24xlarge",
+    "H100": "ml.p5.48xlarge",
+    "H200": "ml.p5e.48xlarge",
+    "B200": "ml.p6-b200.48xlarge",
+    "L40S": "ml.g6e.12xlarge",
+    "A10G": "ml.g5.2xlarge",
+    "L4": "ml.g6.2xlarge",
+}
+_CLOUDWATCH_LOG_GROUP = "/aws/sagemaker/TrainingJobs"
+class SageMakerExecutor:
+    """Run replicas as N independent SageMaker Training Jobs.
+    Implements the `ServerlessExecutor` Protocol against the boto3
+    ``sagemaker`` client. Each replica is one single-instance training job;
+    cross-replica communication happens only through the shared S3
+    ``ObjectStoreAllReduce`` rendezvous.
+    Args:
+        role_arn: IAM execution role SageMaker assumes for the job. Must grant
+            S3 access to the rendezvous + output buckets (the boto3 analog of
+            EKS IRSA). The caller's credentials need ``iam:PassRole`` on it.
+        image_uri: ECR image URI for the training container. The image must
+            bake an entrypoint that runs
+            ``python -m composer_replication.diloco.serverless.replica_entrypoint``
+            (this executor also passes ``ContainerEntrypoint`` explicitly so a
+            generic image works too).
+        output_s3_path: ``s3://...`` prefix for ``OutputDataConfig.S3OutputPath``
+            (model artifacts / failure output).
+        instance_type: default SageMaker instance type when ``gpu`` is not
+            mapped (e.g. ``"ml.g5.2xlarge"``). ``gpu=None`` at launch falls
+            back to ``cpu_instance_type``.
+        cpu_instance_type: instance type used when ``gpu`` is ``None`` (CPU
+            smoke tests). Default ``"ml.m5.xlarge"``.
+        volume_size_gb: ``ResourceConfig.VolumeSizeInGB`` per job.
+        run_id: prefix for generated training-job names. Defaults to a short
+            random token so names are unique per region+account.
+        region: AWS region for the lazily-constructed boto3 clients. ``None``
+            uses the ambient boto3 default-region resolution.
+        sagemaker_client: inject a pre-built ``boto3.client('sagemaker')`` (or a
+            mock) instead of constructing one. Used by tests.
+        logs_client: inject a pre-built ``boto3.client('logs')`` (or a mock).
+    Raises:
+        RuntimeError: if boto3 is not installed and no client was injected.
+    """
+    backend_name = "sagemaker"
+    # Separate single-instance jobs have no mutual network path — S3 only.
+    supports_inter_replica_network = False
+    def __init__(
+        self,
+        *,
+        role_arn: str,
+        image_uri: str,
+        output_s3_path: str,
+        instance_type: str = "ml.g5.2xlarge",
+        cpu_instance_type: str = "ml.m5.xlarge",
+        volume_size_gb: int = 100,
+        run_id: str | None = None,
+        region: str | None = None,
+        sagemaker_client: Any = None,
+        logs_client: Any = None,
+    ) -> None:
+        self.role_arn = role_arn
+        self.image_uri = image_uri
+        self.output_s3_path = output_s3_path
+        self.instance_type = instance_type
+        self.cpu_instance_type = cpu_instance_type
+        self.volume_size_gb = volume_size_gb
+        self.run_id = run_id or f"diloco-{uuid.uuid4().hex[:8]}"
+        self._region = region
+        # Lazy boto3 — only constructed if the caller didn't inject a client.
+        # This keeps `import composer_replication.diloco.serverless` free of a
+        # hard boto3 dependency (boto3 lives in the optional [aws] extra), and
+        # lets tests inject a _MockSMClient with zero AWS calls.
+        if sagemaker_client is None:
+            sagemaker_client = self._make_boto3_client("sagemaker")
+        self._client = sagemaker_client
+        self._logs_client = logs_client  # built lazily on first stream_logs()
+        # rank -> {"job_name": str, "result": dict | None}
+        self._handles: dict[int, dict[str, Any]] = {}
+    # -----------------------------------------------------------------
+    # boto3 plumbing (lazy)
+    # -----------------------------------------------------------------
+    def _make_boto3_client(self, service: str) -> Any:
+        try:
+            import boto3
+        except ImportError as e:
+            raise RuntimeError(
+                "SageMakerExecutor requires boto3. Install with "
+                "`pip install -e .[aws]` (or `pip install boto3`). "
+                f"Got: {e!r}"
+            ) from e
+        if self._region is not None:
+            return boto3.client(service, region_name=self._region)
+        return boto3.client(service)
+    def _map_gpu(self, gpu: str | None) -> str:
+        """Translate the Protocol's abstract gpu string to an instance type.
+        ``gpu=None`` -> ``cpu_instance_type`` (smoke tests). Unrecognized gpu
+        strings fall back to ``instance_type`` (so a caller can pass a literal
+        SageMaker instance type and it's honoured if not in the map).
+        """
+        if gpu is None:
+            return self.cpu_instance_type
+        if gpu in _GPU_INSTANCE_MAP:
+            return _GPU_INSTANCE_MAP[gpu]
+        # Caller may have passed a literal "ml.*" instance type.
+        if gpu.startswith("ml."):
+            return gpu
+        return self.instance_type
+    def _job_name(self, rank: int) -> str:
+        """Build a unique, regex-safe training-job name (<= 63 chars).
+        Pattern required by the API: ``[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}``.
+        """
+        name = f"{self.run_id}-r{rank:04d}-{int(time.time())}"
+        return name[:63]
+    # -----------------------------------------------------------------
+    # ServerlessExecutor Protocol
+    # -----------------------------------------------------------------
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = None,
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        """Submit N independent single-instance SageMaker Training Jobs.
+        Args:
+            n_replicas: number of replicas (= number of training jobs).
+            entrypoint: ignored — the container command is baked into the
+                image / passed as ``ContainerEntrypoint``. Kept for Protocol
+                compatibility.
+            entrypoint_args: must contain ``rendezvous_uri`` (``s3://...``) and
+                ``trainer_module``. Optional: ``trainer_fn`` (default
+                ``"train"``), ``trainer_kwargs`` (dict, JSON-encoded into the
+                container args). The conventional ``rank_env`` key (from
+                ``LocalProcessExecutor``) is ignored — rank goes through the
+                ``Environment`` map instead.
+            gpu: abstract GPU spec mapped to an instance type via ``_map_gpu``.
+                ``None`` -> CPU instance.
+            timeout: ``StoppingCondition.MaxRuntimeInSeconds`` per job.
+        Returns:
+            ``list[ReplicaHandle]`` of length ``n_replicas`` in rank order
+            (``handles[i].rank == i``).
+        """
+        del entrypoint  # container command is baked / passed explicitly
+        if n_replicas < 1:
+            raise ValueError(f"n_replicas must be >= 1, got {n_replicas}")
+        rendezvous_uri = entrypoint_args.get("rendezvous_uri")
+        if not rendezvous_uri:
+            raise ValueError(
+                "entrypoint_args must include 'rendezvous_uri' (the s3:// "
+                "ObjectStoreAllReduce rendezvous prefix)."
+            )
+        trainer_module = entrypoint_args.get("trainer_module")
+        if not trainer_module:
+            raise ValueError(
+                "entrypoint_args must include 'trainer_module' (importable "
+                "module path of the user's train function)."
+            )
+        trainer_fn = entrypoint_args.get("trainer_fn", "train")
+        trainer_kwargs = entrypoint_args.get("trainer_kwargs", {})
+        instance_type = self._map_gpu(gpu)
+        # Container args: each element is a SINGLE token (StackOverflow
+        # 77994925 — `['--world-size', '4']` NOT `['--world-size 4']`).
+        container_args = [
+            "--rendezvous", str(rendezvous_uri),
+            "--world-size", str(n_replicas),
+            "--trainer-module", str(trainer_module),
+            "--trainer-fn", str(trainer_fn),
+            "--trainer-kwargs-json", json.dumps(trainer_kwargs),
+        ]
+        handles: list[ReplicaHandle] = []
+        for rank in range(n_replicas):
+            job_name = self._job_name(rank)
+            request = {
+                "TrainingJobName": job_name,
+                "AlgorithmSpecification": {
+                    "TrainingImage": self.image_uri,
+                    "TrainingInputMode": "File",
+                    "ContainerEntrypoint": [
+                        "python", "-m",
+                        "composer_replication.diloco.serverless.replica_entrypoint",
+                    ],
+                    "ContainerArguments": container_args,
+                },
+                "RoleArn": self.role_arn,
+                # InputDataConfig intentionally omitted — the replica pulls
+                # data via its own code / the S3 rendezvous, not SM channels.
+                "OutputDataConfig": {"S3OutputPath": self.output_s3_path},
+                "ResourceConfig": {
+                    "InstanceType": instance_type,
+                    "InstanceCount": 1,
+                    "VolumeSizeInGB": self.volume_size_gb,
+                },
+                "StoppingCondition": {"MaxRuntimeInSeconds": int(timeout)},
+                # REPLICA_RANK / WORLD_SIZE injected as container env vars;
+                # replica_entrypoint.py reads os.environ['REPLICA_RANK'].
+                "Environment": {
+                    "REPLICA_RANK": str(rank),
+                    "WORLD_SIZE": str(n_replicas),
+                    "RENDEZVOUS_URI": str(rendezvous_uri),
+                },
+                # MUST stay False — True severs the container's S3 access and
+                # dead-locks the allreduce poll loop. See module docstring.
+                "EnableNetworkIsolation": False,
+            }
+            try:
+                self._client.create_training_job(**request)
+            except Exception as e:
+                # Best-effort stop of already-launched siblings, then raise.
+                for prior in handles:
+                    try:
+                        self.cancel(prior)
+                    except Exception:
+                        pass
+                raise RuntimeError(
+                    f"SageMakerExecutor.launch_replicas failed at rank={rank} "
+                    f"of {n_replicas} (already-launched siblings stopped). "
+                    f"Underlying error: {e!r}"
+                ) from e
+            handle = ReplicaHandle(
+                rank=rank,
+                backend_name=self.backend_name,
+                metadata={
+                    "training_job_name": job_name,
+                    "submit_ts": time.time(),
+                },
+            )
+            self._handles[rank] = {"job_name": job_name, "result": None}
+            handles.append(handle)
+        return handles
+    def poll(self, handle: ReplicaHandle) -> str:
+        """Poll a training job's status.
+        Returns one of: ``"pending"`` | ``"running"`` | ``"succeeded"`` |
+        ``"failed"`` | ``"cancelled"``.
+        Maps ``describe_training_job``'s ``TrainingJobStatus`` via
+        ``_STATUS_MAP``; refines ``InProgress`` to ``"pending"`` while the job
+        is still queued (``SecondaryStatus`` in ``_PENDING_SECONDARY``). A
+        vanished job (``ResourceNotFound``) is treated as ``"cancelled"``.
+        """
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return "cancelled"
+        if meta["result"] is not None:
+            return meta["result"]["status"]
+        job_name = meta["job_name"]
+        try:
+            resp = self._client.describe_training_job(TrainingJobName=job_name)
+        except Exception as e:
+            if self._is_resource_not_found(e):
+                return "cancelled"
+            raise
+        sm_status = resp.get("TrainingJobStatus", "InProgress")
+        mapped = _STATUS_MAP.get(sm_status, "running")
+        if sm_status == "InProgress":
+            if resp.get("SecondaryStatus") in _PENDING_SECONDARY:
+                return "pending"
+            return "running"
+        # Terminal — cache a result dict so collect()/repeat-poll are cheap.
+        meta["result"] = self._terminal_result(handle.rank, sm_status, resp)
+        return mapped
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        """Read recent CloudWatch logs for this replica's training job.
+        SageMaker writes container stdout/stderr to the
+        ``/aws/sagemaker/TrainingJobs`` log group, stream
+        ``<job-name>/algo-<n>-<epoch>``. We discover the exact stream name by
+        prefix then read the tail. Falls back to a CloudWatch console pointer
+        on any error (mirrors ModalSpawnExecutor's dashboard-URL fallback).
+        """
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return f"<replica {handle.rank}: no metadata>"
+        job_name = meta["job_name"]
+        try:
+            logs = self._logs()
+            prefix = f"{job_name}/"
+            streams = logs.describe_log_streams(
+                logGroupName=_CLOUDWATCH_LOG_GROUP,
+                logStreamNamePrefix=prefix,
+                orderBy="LastEventTime",
+                descending=True,
+                limit=1,
+            )
+            stream_list = streams.get("logStreams", [])
+            if not stream_list:
+                return (
+                    f"[rank {handle.rank}] job={job_name}: no CloudWatch log "
+                    f"stream yet (job pending / not started)."
+                )
+            stream_name = stream_list[0]["logStreamName"]
+            events = logs.get_log_events(
+                logGroupName=_CLOUDWATCH_LOG_GROUP,
+                logStreamName=stream_name,
+                limit=n_lines,
+                startFromHead=False,
+            )
+            lines = [e.get("message", "") for e in events.get("events", [])]
+            body = "\n".join(lines) if lines else "<no log events>"
+            return f"[rank {handle.rank}] job={job_name} stream={stream_name}\n{body}"
+        except Exception as e:
+            region = self._region or "<region>"
+            url = (
+                f"https://{region}.console.aws.amazon.com/cloudwatch/home"
+                f"?region={region}#logsV2:log-groups/log-group/"
+                f"$252Faws$252Fsagemaker$252FTrainingJobs"
+            )
+            return (
+                f"[rank {handle.rank}] job={job_name}: log fetch failed "
+                f"({type(e).__name__}: {e!r}).\n  CloudWatch console: {url}"
+            )
+    def cancel(self, handle: ReplicaHandle) -> None:
+        """Best-effort stop of a training job.
+        Calls ``stop_training_job`` (SIGTERM + 120s grace), swallowing
+        ``ResourceNotFound`` and "already terminal" ``ValidationException`` so
+        the contract — "no exception if already terminated" — holds.
+        """
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return
+        try:
+            self._client.stop_training_job(TrainingJobName=meta["job_name"])
+        except Exception:
+            # ResourceNotFound, already-Completed/Stopped ValidationException,
+            # transient network blip — all best-effort no-ops.
+            pass
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        """Block until all replicas finish; return per-replica result dicts.
+        Polls ``describe_training_job`` per handle until the job reaches a
+        terminal status (``Completed`` / ``Failed`` / ``Stopped``) or the
+        shared deadline elapses. Returns results aligned to the input handle
+        order (Protocol contract; mirrors ``LocalProcessExecutor.collect``).
+        Each result dict has at least
+        ``{"rank", "status", "exit_code", "error"}``.
+        """
+        deadline = time.time() + (timeout if timeout is not None else 86400)
+        poll_interval = 30.0
+        results: list[dict[str, Any]] = []
+        for h in handles:
+            meta = self._handles.get(h.rank)
+            if meta is None:
+                results.append({
+                    "rank": h.rank,
+                    "status": "cancelled",
+                    "exit_code": None,
+                    "error": "handle has no metadata (cancelled or unknown)",
+                    "result": None,
+                    "training_job_name": h.metadata.get("training_job_name"),
+                })
+                continue
+            # Already cached by an earlier poll()/collect().
+            if meta["result"] is not None:
+                results.append(meta["result"])
+                continue
+            job_name = meta["job_name"]
+            result_dict: dict[str, Any] | None = None
+            while True:
+                try:
+                    resp = self._client.describe_training_job(
+                        TrainingJobName=job_name
+                    )
+                except Exception as e:
+                    if self._is_resource_not_found(e):
+                        result_dict = {
+                            "rank": h.rank,
+                            "status": "cancelled",
+                            "exit_code": None,
+                            "error": "training job not found (deleted?)",
+                            "result": None,
+                            "training_job_name": job_name,
+                        }
+                        break
+                    raise
+                sm_status = resp.get("TrainingJobStatus", "InProgress")
+                if sm_status in ("Completed", "Failed", "Stopped"):
+                    result_dict = self._terminal_result(h.rank, sm_status, resp)
+                    break
+                if time.time() >= deadline:
+                    result_dict = {
+                        "rank": h.rank,
+                        "status": "running",
+                        "exit_code": None,
+                        "error": "timeout before terminal",
+                        "result": None,
+                        "training_job_name": job_name,
+                    }
+                    break
+                # Sleep, but never overrun the deadline.
+                time.sleep(min(poll_interval, max(0.0, deadline - time.time())))
+            # Cache only terminal results (not the timeout 'running' sentinel,
+            # so a later collect() can re-check the job).
+            if result_dict["status"] in ("succeeded", "failed", "cancelled"):
+                meta["result"] = result_dict
+            results.append(result_dict)
+        return results
+    # -----------------------------------------------------------------
+    # Helpers
+    # -----------------------------------------------------------------
+    def _logs(self) -> Any:
+        """Lazily build the CloudWatch Logs client (separate from sagemaker)."""
+        if self._logs_client is None:
+            self._logs_client = self._make_boto3_client("logs")
+        return self._logs_client
+    @staticmethod
+    def _terminal_result(
+        rank: int, sm_status: str, resp: Mapping[str, Any]
+    ) -> dict[str, Any]:
+        """Build a result dict from a terminal describe_training_job response."""
+        mapped = _STATUS_MAP.get(sm_status, "failed")
+        if sm_status == "Completed":
+            exit_code: int | None = 0
+            error = None
+        elif sm_status == "Stopped":
+            exit_code = None
+            error = resp.get("FailureReason")
+        else:  # Failed
+            exit_code = 1
+            error = resp.get("FailureReason") or "training job failed"
+        artifacts = resp.get("ModelArtifacts", {}) or {}
+        return {
+            "rank": rank,
+            "status": mapped,
+            "exit_code": exit_code,
+            "error": error,
+            "result": artifacts.get("S3ModelArtifacts"),
+            "training_job_name": resp.get("TrainingJobName"),
+        }
+    def _is_resource_not_found(self, exc: Exception) -> bool:
+        """True if ``exc`` is the boto3 ResourceNotFound for the sagemaker client.
+        Handles both the typed client exception
+        (``client.exceptions.ResourceNotFound``) and a generic botocore
+        ``ClientError`` whose error code is ``ResourceNotFound`` /
+        ``ValidationException`` naming a missing job — robust across whether a
+        real boto3 client or a mock is in use.
+        """
+        rnf = getattr(getattr(self._client, "exceptions", None),
+                      "ResourceNotFound", None)
+        if rnf is not None and isinstance(exc, rnf):
+            return True
+        # Generic botocore ClientError fallback.
+        resp = getattr(exc, "response", None)
+        if isinstance(resp, Mapping):
+            code = resp.get("Error", {}).get("Code", "")
+            if code in ("ResourceNotFound", "ValidationException"):
+                return True
+        return False
+__all__ = ["SageMakerExecutor"]

composer_replication/diloco/serverless/tests/test_eks_executor.py ADDED Viewed

	@@ -0,0 +1,625 @@

+"""Tests for EKSExecutor — the Kubernetes Indexed-Job-backed executor.
+These tests exercise the executor's contract WITHOUT a live cluster and
+WITHOUT the `kubernetes` client actually being installed. They:
+  * inject a fake `kubernetes` module into ``sys.modules`` so the executor's
+    lazy ``from kubernetes import client`` / ``...client.exceptions`` calls
+    resolve to recording stand-in V1* model classes (this is the k8s analogue
+    of the modal test's ``_MockFunctionCall``), and
+  * pass mock ``batch_api`` / ``core_api`` via dependency injection (the
+    constructor's ``batch_api=`` / ``core_api=`` args), so no config loading or
+    cluster contact happens.
+For real-cluster integration testing you would gate behind cluster
+availability (e.g. ``config.load_kube_config()`` succeeding), exactly like
+``test_modal_spawn_executor.py`` gates on ``_is_modal_installed()``.
+Run: ``.venv/bin/python -m pytest <thisfile> -q``
+"""
+from __future__ import annotations
+import sys
+import types
+import pytest
+from composer_replication.diloco.serverless import EKSExecutor, ReplicaHandle
+from composer_replication.diloco.serverless.eks import _expand_indexes
+# ---------------------------------------------------------------------
+# Fake `kubernetes` module — recording V1* model stand-ins + ApiException
+# ---------------------------------------------------------------------
+class _Rec:
+    """Generic recording model: stores all ctor kwargs as attributes.
+    Stands in for the kubernetes client's ``V1*`` model classes (V1Job,
+    V1JobSpec, V1Container, V1EnvVar, ...). Every attr the executor sets is
+    inspectable by tests. Mirrors how the modal mock records ``.spawn`` args.
+    """
+    def __init__(self, **kwargs):
+        # Default the common optional model fields to None so attribute
+        # access in assertions never raises AttributeError.
+        for k, v in kwargs.items():
+            setattr(self, k, v)
+    def __getattr__(self, name):  # only called when attr is genuinely absent
+        return None
+class _ApiException(Exception):  # noqa: N818 — mirrors kubernetes.client.exceptions.ApiException name
+    """Stand-in for kubernetes.client.exceptions.ApiException."""
+    def __init__(self, status=None, reason=None, body=None):
+        super().__init__(f"ApiException(status={status})")
+        self.status = status
+        self.reason = reason
+        self.body = body
+# The set of V1* names the executor constructs. Each maps to _Rec.
+_V1_NAMES = [
+    "V1Job",
+    "V1JobSpec",
+    "V1ObjectMeta",
+    "V1PodTemplateSpec",
+    "V1PodSpec",
+    "V1Container",
+    "V1EnvVar",
+    "V1EnvVarSource",
+    "V1ObjectFieldSelector",
+    "V1ResourceRequirements",
+    "V1Toleration",
+    "V1DeleteOptions",
+]
+@pytest.fixture
+def fake_kubernetes(monkeypatch):
+    """Install a fake `kubernetes` package into sys.modules for the test.
+    Provides:
+      - kubernetes.client.<V1*>           -> recording _Rec classes
+      - kubernetes.client.exceptions.ApiException
+      - kubernetes.client.BatchV1Api / CoreV1Api (unused — apis are injected)
+      - kubernetes.config.load_incluster_config / load_kube_config / ConfigException
+    """
+    kubernetes = types.ModuleType("kubernetes")
+    client = types.ModuleType("kubernetes.client")
+    exceptions = types.ModuleType("kubernetes.client.exceptions")
+    config = types.ModuleType("kubernetes.config")
+    for name in _V1_NAMES:
+        setattr(client, name, _Rec)
+    # Default api classes (only hit if NOT injected — we always inject).
+    client.BatchV1Api = lambda *a, **k: pytest.fail("BatchV1Api should be injected")
+    client.CoreV1Api = lambda *a, **k: pytest.fail("CoreV1Api should be injected")
+    exceptions.ApiException = _ApiException
+    client.exceptions = exceptions
+    class _ConfigException(Exception):  # noqa: N818 — mirrors kubernetes.config.ConfigException name
+        pass
+    config.ConfigException = _ConfigException
+    config.load_incluster_config = lambda *a, **k: (_ for _ in ()).throw(
+        _ConfigException("not in cluster")
+    )
+    config.load_kube_config = lambda *a, **k: None
+    kubernetes.client = client
+    kubernetes.config = config
+    monkeypatch.setitem(sys.modules, "kubernetes", kubernetes)
+    monkeypatch.setitem(sys.modules, "kubernetes.client", client)
+    monkeypatch.setitem(sys.modules, "kubernetes.client.exceptions", exceptions)
+    monkeypatch.setitem(sys.modules, "kubernetes.config", config)
+    return kubernetes
+# ---------------------------------------------------------------------
+# Mock BatchV1Api / CoreV1Api (the _MockBatchV1 the task asks for)
+# ---------------------------------------------------------------------
+class _MockBatchV1Api:
+    """Records create/read-status/delete calls; returns a settable status."""
+    def __init__(self):
+        self.created_jobs: list[tuple[str, object]] = []
+        self.delete_calls: list[dict] = []
+        # status object returned by read_namespaced_job_status().status
+        self.status_obj = _Rec(
+            active=None,
+            succeeded=None,
+            failed=None,
+            completed_indexes=None,
+            failed_indexes=None,
+            conditions=None,
+        )
+        # Optional: raise this ApiException on read (e.g. 404 -> cancelled)
+        self.read_raises: Exception | None = None
+    def create_namespaced_job(self, namespace, body):
+        self.created_jobs.append((namespace, body))
+        return body
+    def read_namespaced_job_status(self, name, namespace):
+        if self.read_raises is not None:
+            raise self.read_raises
+        return _Rec(status=self.status_obj)
+    def delete_namespaced_job(self, name, namespace, body=None):
+        self.delete_calls.append(
+            {
+                "name": name,
+                "namespace": namespace,
+                "propagation_policy": getattr(body, "propagation_policy", None),
+                "grace_period_seconds": getattr(body, "grace_period_seconds", None),
+            }
+        )
+        return _Rec(status="Success")
+class _MockCoreV1Api:
+    """Canned list_namespaced_pod + read_namespaced_pod_log."""
+    def __init__(self, pods=None, logs="line1\nline2\n"):
+        self._pods = pods if pods is not None else []
+        self._logs = logs
+        self.log_calls: list[dict] = []
+        self.list_calls: list[dict] = []
+        self.log_raises: Exception | None = None
+    def list_namespaced_pod(self, namespace, label_selector=None):
+        self.list_calls.append({"namespace": namespace, "label_selector": label_selector})
+        return _Rec(items=list(self._pods))
+    def read_namespaced_pod_log(self, name, namespace, container=None, tail_lines=None):
+        self.log_calls.append(
+            {
+                "name": name,
+                "namespace": namespace,
+                "container": container,
+                "tail_lines": tail_lines,
+            }
+        )
+        if self.log_raises is not None:
+            raise self.log_raises
+        return self._logs
+def _make_pod(name, rank):
+    """Build a fake pod with the completion-index annotation set."""
+    return _Rec(
+        metadata=_Rec(
+            name=name,
+            annotations={"batch.kubernetes.io/job-completion-index": str(rank)},
+            labels={"job-name": name.rsplit("-", 2)[0]},
+        ),
+        status=_Rec(phase="Running"),
+    )
+def _make_executor(fake_kubernetes, *, batch=None, core=None, **kwargs):
+    batch = batch or _MockBatchV1Api()
+    core = core or _MockCoreV1Api()
+    ex = EKSExecutor(
+        image="myrepo/composer-replica:latest",
+        batch_api=batch,
+        core_api=core,
+        **kwargs,
+    )
+    # Speed up collect() loops in tests.
+    ex._collect_poll_interval = lambda: 0.0
+    return ex, batch, core
+# ---------------------------------------------------------------------
+# _expand_indexes — the run-length-range parser
+# ---------------------------------------------------------------------
+def test_expand_indexes_singletons_and_ranges():
+    assert _expand_indexes("1,3-5,7") == {1, 3, 4, 5, 7}
+    assert _expand_indexes("0") == {0}
+    assert _expand_indexes("0-3") == {0, 1, 2, 3}
+    assert _expand_indexes("") == set()
+    assert _expand_indexes(None) == set()
+    # Reversed range is tolerated.
+    assert _expand_indexes("5-3") == {3, 4, 5}
+    # Whitespace / junk tolerated.
+    assert _expand_indexes(" 2 , 4-6 ") == {2, 4, 5, 6}
+# ---------------------------------------------------------------------
+# Construction / preconditions
+# ---------------------------------------------------------------------
+def test_missing_kubernetes_raises_runtime_error_when_no_api_injected():
+    """With kubernetes absent AND no injected api, ctor must raise clearly.
+    The import-guard path can ONLY be exercised when `kubernetes` is genuinely
+    not importable in this interpreter. When it IS installed (e.g. via the
+    `[eks]`/`[serverless]` extra in CI), the lazy import succeeds and the ctor
+    legitimately does not raise — so skip rather than assert a false precondition.
+    """
+    import importlib.util
+    if importlib.util.find_spec("kubernetes") is not None:
+        pytest.skip("kubernetes is importable in this interpreter; the absent-path cannot be exercised")
+    with pytest.raises(RuntimeError, match="kubernetes"):
+        EKSExecutor(image="x")
+def test_construction_with_injected_apis_does_not_need_kubernetes():
+    """When both apis are injected, ctor must not require the kubernetes import."""
+    batch = _MockBatchV1Api()
+    core = _MockCoreV1Api()
+    ex = EKSExecutor(image="img", batch_api=batch, core_api=core)
+    assert ex.backend_name == "eks"
+    assert ex.supports_inter_replica_network is False
+    assert ex.image == "img"
+# ---------------------------------------------------------------------
+# launch_replicas — N handles, indexed-job spec correctness
+# ---------------------------------------------------------------------
+def test_launch_returns_n_rank_ordered_handles(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    handles = ex.launch_replicas(
+        n_replicas=4,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://b/run42/", "world_size": 4},
+    )
+    assert len(handles) == 4
+    for i, h in enumerate(handles):
+        assert isinstance(h, ReplicaHandle)
+        assert h.rank == i
+        assert h.backend_name == "eks"
+        assert h.metadata["rank"] == i
+        # ALL handles share the same job_name / namespace (gang).
+        assert h.metadata["job_name"] == handles[0].metadata["job_name"]
+        assert h.metadata["namespace"] == "default"
+    # Exactly ONE job was created (single Indexed Job topology).
+    assert len(batch.created_jobs) == 1
+def test_launch_creates_indexed_job_spec(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=3,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://b/r/", "world_size": 3},
+    )
+    ns, job = batch.created_jobs[0]
+    assert ns == "default"
+    assert job.api_version == "batch/v1"
+    assert job.kind == "Job"
+    spec = job.spec
+    assert spec.completions == 3
+    assert spec.parallelism == 3
+    assert spec.completion_mode == "Indexed"
+    assert spec.backoff_limit == 0
+    assert spec.ttl_seconds_after_finished == 3600
+    # active_deadline_seconds == timeout (default 3600 here).
+    assert spec.active_deadline_seconds == 3600
+    # restart_policy Never (required for Indexed jobs).
+    assert spec.template.spec.restart_policy == "Never"
+def test_launch_rank_env_uses_downward_api_field_ref(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=2,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://b/r/", "world_size": 2},
+    )
+    _, job = batch.created_jobs[0]
+    env = job.spec.template.spec.containers[0].env
+    by_name = {e.name: e for e in env}
+    # REPLICA_RANK from the downward-API annotation (NOT a literal value).
+    rr = by_name["REPLICA_RANK"]
+    assert rr.value is None
+    field_ref = rr.value_from.field_ref
+    assert (
+        field_ref.field_path
+        == "metadata.annotations['batch.kubernetes.io/job-completion-index']"
+    )
+    # WORLD_SIZE is a literal string.
+    assert by_name["WORLD_SIZE"].value == "2"
+    # rendezvous_uri passed through as an upper-cased literal env var.
+    assert by_name["RENDEZVOUS_URI"].value == "s3://b/r/"
+def test_launch_strips_rank_env_kwarg(fake_kubernetes):
+    """`rank_env` is the LocalProcessExecutor convention — must not become env."""
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=1,
+        entrypoint="ignored",
+        entrypoint_args={"rank_env": "REPLICA_RANK", "rendezvous_uri": "s3://x/"},
+    )
+    _, job = batch.created_jobs[0]
+    env_names = {e.name for e in job.spec.template.spec.containers[0].env}
+    assert "RANK_ENV" not in env_names
+    assert "RENDEZVOUS_URI" in env_names
+def test_launch_gpu_limit_is_string(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=2,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://x/"},
+        gpu="A100",
+    )
+    _, job = batch.created_jobs[0]
+    container = job.spec.template.spec.containers[0]
+    limits = container.resources.limits
+    assert limits["nvidia.com/gpu"] == "1"
+    # MUST be a string, not an int.
+    assert isinstance(limits["nvidia.com/gpu"], str)
+    # GPU node selector merged in.
+    node_selector = job.spec.template.spec.node_selector
+    assert node_selector["node.kubernetes.io/instance-type"] == "p4d.24xlarge"
+    # GPU NoSchedule toleration auto-added.
+    tols = job.spec.template.spec.tolerations
+    assert any(
+        t.key == "nvidia.com/gpu" and t.effect == "NoSchedule" for t in tols
+    )
+def test_launch_cpu_only_omits_gpu_limit(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=2,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://x/"},
+        gpu=None,
+    )
+    _, job = batch.created_jobs[0]
+    limits = job.spec.template.spec.containers[0].resources.limits
+    # No GPU -> no nvidia.com/gpu key at all (limits is None or empty).
+    assert not limits or "nvidia.com/gpu" not in (limits or {})
+def test_launch_passes_service_account_and_runtime_class(fake_kubernetes):
+    ex, batch, _ = _make_executor(
+        fake_kubernetes,
+        service_account_name="diloco-irsa-sa",
+        runtime_class_name="gvisor",
+    )
+    ex.launch_replicas(
+        n_replicas=1,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://x/"},
+    )
+    _, job = batch.created_jobs[0]
+    pod_spec = job.spec.template.spec
+    assert pod_spec.service_account_name == "diloco-irsa-sa"
+    assert pod_spec.runtime_class_name == "gvisor"
+def test_launch_timeout_becomes_active_deadline(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=1,
+        entrypoint="ignored",
+        entrypoint_args={"rendezvous_uri": "s3://x/"},
+        timeout=7200,
+    )
+    _, job = batch.created_jobs[0]
+    assert job.spec.active_deadline_seconds == 7200
+def test_launch_uses_default_entrypoint_command(fake_kubernetes):
+    ex, batch, _ = _make_executor(fake_kubernetes)
+    ex.launch_replicas(
+        n_replicas=1, entrypoint="ignored", entrypoint_args={"rendezvous_uri": "s3://x/"}
+    )
+    _, job = batch.created_jobs[0]
+    cmd = job.spec.template.spec.containers[0].command
+    assert cmd == [
+        "python",
+        "-m",
+        "composer_replication.diloco.serverless.replica_entrypoint",
+    ]
+def test_launch_rejects_zero_or_negative(fake_kubernetes):
+    ex, _, _ = _make_executor(fake_kubernetes)
+    with pytest.raises(ValueError, match="n_replicas"):
+        ex.launch_replicas(n_replicas=0, entrypoint="x", entrypoint_args={})
+    with pytest.raises(ValueError, match="n_replicas"):
+        ex.launch_replicas(n_replicas=-1, entrypoint="x", entrypoint_args={})
+# ---------------------------------------------------------------------
+# poll — state mapping from completed/failed indexes + active count
+# ---------------------------------------------------------------------
+def _launch_two(fake_kubernetes, batch=None, core=None):
+    ex, batch, core = _make_executor(fake_kubernetes, batch=batch, core=core)
+    handles = ex.launch_replicas(
+        n_replicas=4, entrypoint="x", entrypoint_args={"rendezvous_uri": "s3://x/"}
+    )
+    return ex, batch, core, handles
+def test_poll_pending_when_nothing_active(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.status_obj = _Rec(active=0, completed_indexes=None, failed_indexes=None)
+    assert ex.poll(handles[0]) == "pending"
+def test_poll_running_when_active(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.status_obj = _Rec(active=4, completed_indexes=None, failed_indexes=None)
+    assert ex.poll(handles[2]) == "running"
+def test_poll_succeeded_when_rank_in_completed_indexes(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    # completed_indexes "0,2-3" -> ranks {0,2,3} succeeded; rank 1 still running.
+    batch.status_obj = _Rec(
+        active=1, completed_indexes="0,2-3", failed_indexes=None
+    )
+    assert ex.poll(handles[0]) == "succeeded"
+    assert ex.poll(handles[2]) == "succeeded"
+    assert ex.poll(handles[3]) == "succeeded"
+    assert ex.poll(handles[1]) == "running"
+def test_poll_failed_when_rank_in_failed_indexes(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.status_obj = _Rec(
+        active=0, completed_indexes="0", failed_indexes="1,3"
+    )
+    assert ex.poll(handles[1]) == "failed"
+    assert ex.poll(handles[3]) == "failed"
+    assert ex.poll(handles[0]) == "succeeded"
+def test_poll_failed_on_whole_job_failed_condition(fake_kubernetes):
+    """DeadlineExceeded etc.: a Failed condition with no per-index info -> failed."""
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.status_obj = _Rec(
+        active=0,
+        completed_indexes=None,
+        failed_indexes=None,
+        conditions=[_Rec(type="Failed", status="True", reason="DeadlineExceeded")],
+    )
+    assert ex.poll(handles[0]) == "failed"
+def test_poll_cancelled_on_404(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.read_raises = _ApiException(status=404)
+    assert ex.poll(handles[0]) == "cancelled"
+def test_poll_reraises_non_404_api_exception(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    batch.read_raises = _ApiException(status=500)
+    with pytest.raises(_ApiException):
+        ex.poll(handles[0])
+# ---------------------------------------------------------------------
+# cancel — Background propagation on the shared job, idempotent
+# ---------------------------------------------------------------------
+def test_cancel_uses_background_propagation_on_shared_job(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    ex.cancel(handles[2])
+    assert len(batch.delete_calls) == 1
+    call = batch.delete_calls[0]
+    assert call["propagation_policy"] == "Background"
+    assert call["grace_period_seconds"] == 0
+    # Cancelling ANY rank deletes the WHOLE shared job (gang semantics).
+    assert call["name"] == handles[0].metadata["job_name"]
+    assert call["namespace"] == "default"
+def test_cancel_swallows_404(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    def _raise_404(name, namespace, body=None):
+        raise _ApiException(status=404)
+    batch.delete_namespaced_job = _raise_404
+    # Must NOT raise (already deleted == success per the Protocol).
+    ex.cancel(handles[0])
+def test_cancel_unknown_handle_is_noop(fake_kubernetes):
+    ex, batch, _, _ = _launch_two(fake_kubernetes)
+    fake = ReplicaHandle(rank=99, backend_name="eks", metadata={})
+    ex.cancel(fake)  # no job_name in metadata -> no-op, no delete call
+    assert len(batch.delete_calls) == 0
+# ---------------------------------------------------------------------
+# stream_logs — find pod by completion-index annotation
+# ---------------------------------------------------------------------
+def test_stream_logs_reads_pod_for_rank(fake_kubernetes):
+    pods = [
+        _make_pod("diloco-abcd1234-0-xyz", 0),
+        _make_pod("diloco-abcd1234-1-xyz", 1),
+    ]
+    core = _MockCoreV1Api(pods=pods, logs="hello from rank 1\n")
+    ex, _, core2, handles = _launch_two(fake_kubernetes, core=core)
+    out = ex.stream_logs(handles[1], n_lines=50)
+    assert out == "hello from rank 1\n"
+    # Read the right pod, container 'replica', tail_lines honored.
+    last = core.log_calls[-1]
+    assert last["name"] == "diloco-abcd1234-1-xyz"
+    assert last["container"] == "replica"
+    assert last["tail_lines"] == 50
+def test_stream_logs_placeholder_when_pod_missing(fake_kubernetes):
+    core = _MockCoreV1Api(pods=[])  # no pods yet
+    ex, _, _, handles = _launch_two(fake_kubernetes, core=core)
+    out = ex.stream_logs(handles[0])
+    assert "rank 0" in out
+    assert "not started" in out or "no logs" in out
+def test_stream_logs_placeholder_on_400(fake_kubernetes):
+    pods = [_make_pod("diloco-abcd1234-0-xyz", 0)]
+    core = _MockCoreV1Api(pods=pods)
+    core.log_raises = _ApiException(status=400)  # pod not started yet
+    ex, _, _, handles = _launch_two(fake_kubernetes, core=core)
+    out = ex.stream_logs(handles[0])
+    assert "rank 0" in out
+# ---------------------------------------------------------------------
+# collect — per-rank result dicts in handles order
+# ---------------------------------------------------------------------
+def test_collect_returns_terminal_results_in_order(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    # All four ranks done: 0-2 succeeded, 3 failed.
+    batch.status_obj = _Rec(
+        active=0, completed_indexes="0-2", failed_indexes="3"
+    )
+    results = ex.collect(handles, timeout=5)
+    assert len(results) == 4
+    for i, r in enumerate(results):
+        assert r["rank"] == i
+        assert r["job_name"] == handles[0].metadata["job_name"]
+    assert results[0]["status"] == "succeeded" and results[0]["exit_code"] == 0
+    assert results[1]["status"] == "succeeded"
+    assert results[2]["status"] == "succeeded"
+    assert results[3]["status"] == "failed" and results[3]["exit_code"] == 1
+    assert results[3]["error"] is not None
+def test_collect_returns_non_terminal_state_at_deadline(fake_kubernetes):
+    ex, batch, _, handles = _launch_two(fake_kubernetes)
+    # Never finishes: active stays > 0.
+    batch.status_obj = _Rec(active=4, completed_indexes=None, failed_indexes=None)
+    results = ex.collect(handles, timeout=0)  # immediate deadline
+    assert len(results) == 4
+    for r in results:
+        assert r["status"] in ("running", "pending")
+        assert r["exit_code"] is None

composer_replication/diloco/serverless/tests/test_sagemaker_executor.py ADDED Viewed

	@@ -0,0 +1,244 @@

+"""Tests for SageMakerExecutor (composer_replication.diloco.serverless.sagemaker).
+The executor is exercised with an INJECTED mock boto3 sagemaker client (the
+`sagemaker_client=` ctor arg), so these run on any host without boto3 or AWS
+credentials — mirroring the _MockFunctionCall pattern in
+test_modal_spawn_executor.py and the _MockBatchV1Api pattern in
+test_eks_executor.py.
+Closes the test-coverage gap left when the SageMakerExecutor was first written
+without a test module (caught during Wave-2 integration, 2026-06-09).
+"""
+from __future__ import annotations
+import importlib.util
+import pytest
+from composer_replication.diloco.serverless import SageMakerExecutor
+from composer_replication.diloco.serverless.executor import ReplicaHandle
+# ---------------------------------------------------------------------
+# Mock boto3 sagemaker client
+# ---------------------------------------------------------------------
+class _MockSMClient:
+    """Records create/stop calls and serves a scripted status per job name."""
+    def __init__(self):
+        self.created: list[dict] = []
+        self.stopped: list[str] = []
+        # job_name -> (TrainingJobStatus, SecondaryStatus)
+        self._status: dict[str, tuple[str, str]] = {}
+        self.raise_not_found_on: set[str] = set()
+    def create_training_job(self, **request):
+        self.created.append(request)
+        # default a newly-created job to InProgress/Starting (== pending)
+        self._status[request["TrainingJobName"]] = ("InProgress", "Starting")
+        return {"TrainingJobArn": f"arn:aws:sagemaker:::training-job/{request['TrainingJobName']}"}
+    def describe_training_job(self, TrainingJobName):  # noqa: N803 (boto3 casing)
+        if TrainingJobName in self.raise_not_found_on:
+            raise _ResourceNotFoundError(f"job {TrainingJobName} not found")
+        status, secondary = self._status.get(TrainingJobName, ("InProgress", "Training"))
+        return {
+            "TrainingJobName": TrainingJobName,
+            "TrainingJobStatus": status,
+            "SecondaryStatus": secondary,
+            "TrainingJobArn": f"arn:aws:sagemaker:::training-job/{TrainingJobName}",
+        }
+    def stop_training_job(self, TrainingJobName):  # noqa: N803
+        self.stopped.append(TrainingJobName)
+    # test helper
+    def set_status(self, job_name, status, secondary="Completed"):
+        self._status[job_name] = (status, secondary)
+class _ResourceNotFoundError(Exception):
+    """Stand-in for botocore ResourceNotFound (the executor matches on name/text)."""
+    def __init__(self, msg):
+        super().__init__(msg)
+        # botocore-style response shape some impls check
+        self.response = {"Error": {"Code": "ResourceNotFound", "Message": msg}}
+def _make_executor(client=None):
+    return SageMakerExecutor(
+        image_uri="123.dkr.ecr.us-east-1.amazonaws.com/trainer:latest",
+        role_arn="arn:aws:iam::123:role/SMRole",
+        output_s3_path="s3://bucket/out/",
+        region="us-east-1",
+        sagemaker_client=client or _MockSMClient(),
+    )
+_VALID_ARGS = {
+    "rendezvous_uri": "s3://bucket/rendezvous/run1/",
+    "trainer_module": "my_pkg.trainer",
+}
+# ---------------------------------------------------------------------
+# Construction
+# ---------------------------------------------------------------------
+def test_backend_identity():
+    ex = _make_executor()
+    assert ex.backend_name == "sagemaker"
+    assert ex.supports_inter_replica_network is False
+def test_missing_boto3_raises_when_no_client_injected():
+    """The import-guard path only fires when boto3 is genuinely absent."""
+    if importlib.util.find_spec("boto3") is not None:
+        pytest.skip("boto3 importable; absent-path cannot be exercised")
+    with pytest.raises(RuntimeError, match="boto3"):
+        SageMakerExecutor(
+            image_uri="x", role_arn="r", output_s3_path="s3://b/o/",
+        )
+def test_construction_with_injected_client_needs_no_boto3():
+    ex = _make_executor()
+    assert ex is not None
+# ---------------------------------------------------------------------
+# launch_replicas
+# ---------------------------------------------------------------------
+def test_launch_returns_rank_ordered_handles():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    handles = ex.launch_replicas(3, entrypoint="ignored", entrypoint_args=_VALID_ARGS)
+    assert len(handles) == 3
+    assert [h.rank for h in handles] == [0, 1, 2]
+    assert all(isinstance(h, ReplicaHandle) and h.backend_name == "sagemaker" for h in handles)
+    assert len(client.created) == 3
+def test_launch_injects_rank_world_size_and_rendezvous_env():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    ex.launch_replicas(2, entrypoint="ignored", entrypoint_args=_VALID_ARGS)
+    for rank, req in enumerate(client.created):
+        env = req["Environment"]
+        assert env["REPLICA_RANK"] == str(rank)
+        assert env["WORLD_SIZE"] == "2"
+        assert env["RENDEZVOUS_URI"] == _VALID_ARGS["rendezvous_uri"]
+        # network isolation MUST stay False (else S3 rendezvous deadlocks)
+        assert req["EnableNetworkIsolation"] is False
+        assert req["OutputDataConfig"]["S3OutputPath"] == "s3://bucket/out/"
+        assert req["ResourceConfig"]["InstanceCount"] == 1
+def test_launch_validates_n_replicas():
+    ex = _make_executor()
+    with pytest.raises(ValueError, match="n_replicas"):
+        ex.launch_replicas(0, entrypoint="x", entrypoint_args=_VALID_ARGS)
+def test_launch_requires_rendezvous_and_trainer_module():
+    ex = _make_executor()
+    with pytest.raises(ValueError, match="rendezvous_uri"):
+        ex.launch_replicas(1, entrypoint="x", entrypoint_args={"trainer_module": "m"})
+    with pytest.raises(ValueError, match="trainer_module"):
+        ex.launch_replicas(1, entrypoint="x", entrypoint_args={"rendezvous_uri": "s3://b/r/"})
+def test_launch_partial_failure_stops_siblings_and_raises():
+    class _FailingClient(_MockSMClient):
+        def create_training_job(self, **request):
+            if len(self.created) >= 2:  # 3rd create fails
+                raise RuntimeError("ThrottlingException")
+            return super().create_training_job(**request)
+    client = _FailingClient()
+    ex = _make_executor(client)
+    with pytest.raises(RuntimeError, match="rank=2"):
+        ex.launch_replicas(3, entrypoint="x", entrypoint_args=_VALID_ARGS)
+    # the two already-launched siblings were best-effort stopped
+    assert len(client.stopped) == 2
+# ---------------------------------------------------------------------
+# poll status mapping
+# ---------------------------------------------------------------------
+def test_poll_status_mapping():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    handles = ex.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)
+    h = handles[0]
+    job = client.created[0]["TrainingJobName"]
+    client.set_status(job, "InProgress", "Starting")
+    assert ex.poll(h) == "pending"
+    client.set_status(job, "InProgress", "Training")
+    assert ex.poll(h) == "running"
+    client.set_status(job, "Completed")
+    assert ex.poll(h) == "succeeded"
+def test_poll_failed_and_stopped():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    h = ex.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)[0]
+    job = client.created[0]["TrainingJobName"]
+    client.set_status(job, "Failed")
+    assert ex.poll(h) == "failed"
+    client2 = _MockSMClient()
+    ex2 = _make_executor(client2)
+    h2 = ex2.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)[0]
+    job2 = client2.created[0]["TrainingJobName"]
+    client2.set_status(job2, "Stopped")
+    assert ex2.poll(h2) == "cancelled"
+def test_poll_vanished_job_is_cancelled():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    h = ex.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)[0]
+    client.raise_not_found_on.add(client.created[0]["TrainingJobName"])
+    assert ex.poll(h) == "cancelled"
+def test_poll_unknown_handle_is_cancelled():
+    ex = _make_executor()
+    orphan = ReplicaHandle(rank=99, backend_name="sagemaker", metadata={})
+    assert ex.poll(orphan) == "cancelled"
+# ---------------------------------------------------------------------
+# cancel
+# ---------------------------------------------------------------------
+def test_cancel_calls_stop_training_job():
+    client = _MockSMClient()
+    ex = _make_executor(client)
+    h = ex.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)[0]
+    ex.cancel(h)
+    assert client.stopped == [client.created[0]["TrainingJobName"]]
+def test_cancel_swallows_errors():
+    class _RaisingStop(_MockSMClient):
+        def stop_training_job(self, TrainingJobName):  # noqa: N803
+            raise _ResourceNotFoundError("already terminal")
+    client = _RaisingStop()
+    ex = _make_executor(client)
+    h = ex.launch_replicas(1, entrypoint="x", entrypoint_args=_VALID_ARGS)[0]
+    ex.cancel(h)  # must not raise
+    # unknown handle must also be a no-op
+    ex.cancel(ReplicaHandle(rank=42, backend_name="sagemaker", metadata={}))

composer_replication/safety/__init__.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""composer_replication.safety — run-level collapse safeguards.
+The #2 collapse safeguard for the self-evolving RL flywheel: a held-out disjoint
+eval + a depth/generation kill-switch. The per-task controls live in
+``composer_replication.datagen`` (4-gate validator, ``HackMonitor`` provenance,
+sandbox denylist); this package adds the missing ACROSS-GENERATION / run-level
+control that watches in-loop (proxy) reward against a disjoint held-out (real)
+eval and HALTS the run when collapse / reward-hacking is caught in the act.
+Public surface:
+  - HeldOutGuard   — the stateful kill-switch (kill_switch.py)
+  - TripwireStatus — the structured per-update verdict (.fire / .halt / .reason /
+                     .proxy_real_gap)
+  - CollapseStopError   — typed exception for exception-based trainer control flow
+  - kl_token_trust_filter — per-token KL trust-region mask (torchrl KL-Mask analog)
+Pure-Python, no torch / cloud deps. See docs/adrs/ADR-015-*.md and the
+'holdout-killswitch' research digest.
+"""
+from __future__ import annotations
+from composer_replication.safety.kill_switch import (
+    CollapseStopError,
+    HeldOutGuard,
+    TripwireStatus,
+    kl_token_trust_filter,
+)
+__all__ = [
+    "HeldOutGuard",
+    "TripwireStatus",
+    "CollapseStopError",
+    "kl_token_trust_filter",
+]

composer_replication/safety/kill_switch.py ADDED Viewed

	@@ -0,0 +1,447 @@

+"""kill_switch.py — held-out collapse tripwire (the #2 collapse safeguard).
+This is the missing RUN-LEVEL / across-generation control for the self-evolving
+RL flywheel. The per-task controls already exist in ``composer_replication.datagen``
+(the 4-gate solvability validator, the ``HackMonitor`` provenance check, and the
+sandbox denylist); this module sits ABOVE them and watches the whole run.
+Rationale (the literature is unambiguous that a held-out eval + hard stop is the
+load-bearing control, not a nice-to-have):
+  - **Reward hacking rises monotonically with optimization depth.** Zhao et al.,
+    "Reward Hacking in Self-Improving Code Agents" (ICLR 2026 Workshop on RSI,
+    OpenReview ``ikrQWGgxYg``) show that going from 10 -> 100 optimization steps
+    drives the hacking rate from 26.4% to 57.8% (+31.4 points), and that
+    73.8% of KernelBench / 46.8% of ALE-Bench optimizations show *proxy gains
+    without real gains*. They define **Hacking Gap = proxy gain - real gain**;
+    this module's ``proxy_real_gap()`` is exactly that quantity. They label an
+    optimization reward-hacking when it "improves the public metric WITHOUT
+    improving the private metric" — the canonical signature this tripwire fires on.
+  - **Self-critique alone is insufficient.** The same paper's "retrospection"
+    self-critique sometimes *increased* hacking; their conclusion: "mitigating
+    reward hacking likely requires stronger evaluations and constraints beyond
+    self-critique alone." So we build a genuinely disjoint held-out eval plus a
+    hard stop, not a critique hook.
+  - **Held-out eval is necessary but NOT sufficient by itself.** EvilGenie
+    (arXiv 2511.21654) found "only minimal improvement from the use of held out
+    test cases" in isolation and that "holdout tests have many surprising failure
+    modes." This module is therefore explicitly *defense-in-depth*, layered ON
+    TOP of ``HackMonitor`` (provenance) — neither is sufficient alone, matching
+    the repo's existing defense-in-depth framing in ``datagen/monitor.py``.
+  - **Closed-loop RL on self-generated data collapses.** The self-evolving-agents
+    survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) §8.3 names "model
+    collapse from closed-loop RL on static synthetic data" and prescribes
+    "continuous monitoring ... to detect long-horizon value drift" — i.e. a
+    per-generation online tripwire, not a one-time eval. Shumailov et al. (Nature
+    2024, "AI models collapse when trained on recursively generated data") show
+    self-training first loses the distribution tails, then converges to a
+    low-variance point estimate; the mitigation that matters here is that the
+    held-out eval must stay anchored to REAL tasks that are NEVER fed back to the
+    generator (see ``HeldoutSplit``), otherwise the eval drifts with the train set.
+  - **KL-to-init hard stop.** The GRPO "healthy progression" band (Orchestra
+    Research GRPO SKILL) climbs 0.02 -> 0.05 -> 0.08 -> 0.12 nats/token over a
+    run, with 0.08 the top of the "good progression" band and just below the
+    code-generation drift zone (0.05-0.15 per-token); >0.5 is "diverging too
+    much." So 0.08 nats/token is a sound HARD-STOP default. Catastrophic Goodhart
+    (OpenReview ``UXuBzWoZGK``) proves KL regularization alone does NOT prevent
+    heavy-tailed reward misspecification, so the KL hard stop is ONE tripwire
+    among several, never the sole control.
+UNITS GOTCHA (load-bearing): the ``kl_to_init`` this module consumes is
+**token-mean KL in nats/token**, matching the repo convention in
+``composer_replication.integrations.altered_minds.kl_logging.token_mean_kl``.
+A token-mean KL is NOT comparable to a sequence-level / sequence-summed KL
+(whose healthy band is ~0.05-10). The 0.08 default is per-token. Do not pass a
+sequence-summed KL into the per-token hard stop — it will fire instantly.
+This module is pure-Python: no torch, no cloud deps. ``kl_to_init`` is just a
+float the caller passes (computed upstream by ``token_mean_kl``). It is fully
+CPU-testable.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+class CollapseStopError(RuntimeError):
+    """Raised (by the caller, optionally) when the tripwire fires a hard stop.
+    The trainer loop can either check ``TripwireStatus.fire`` and stop softly,
+    or call ``HeldOutGuard.raise_if_fired(status)`` to convert a fired verdict
+    into this typed exception. Carries the structured verdict for logging.
+    """
+    def __init__(self, status: TripwireStatus) -> None:
+        super().__init__(status.reason)
+        self.status = status
+@dataclass(frozen=True)
+class TripwireStatus:
+    """Structured verdict returned by every ``HeldOutGuard.update(...)`` call.
+    Attributes:
+        fire: True => the run should HALT (collapse / reward-hacking detected).
+        reason: human-readable WHY (empty string when ``fire`` is False), so the
+            trainer can log exactly which tripwire tripped, mirroring how
+            ``datagen/monitor.py`` logs suspected hacks for review.
+        step: the round/generation index this verdict was computed at.
+        proxy_real_gap: the RSI "Hacking Gap" at this step = (in-loop reward gain
+            since baseline) - (held-out score gain since baseline). Positive and
+            widening => proxy improving faster than (or while) real declines.
+        in_loop_ema: EMA of the in-loop / proxy reward at this step.
+        heldout_ema: EMA of the held-out / real eval score at this step.
+        kl_ema: EMA of ``kl_to_init`` (nats/token), or None if never supplied.
+    """
+    fire: bool
+    reason: str
+    step: int
+    proxy_real_gap: float
+    in_loop_ema: float
+    heldout_ema: float
+    kl_ema: float | None = None
+    # `halt` is a documented alias for `fire` — the task spec describes a
+    # `should_halt()` / verdict with a `halt` field; expose both names so callers
+    # reading either convention work.
+    @property
+    def halt(self) -> bool:
+        return self.fire
+@dataclass
+class HeldOutGuard:
+    """Across-generation collapse / reward-hacking kill-switch (HeldOutGuard).
+    Tracks, per generation/round: in-loop (proxy) oracle reward, held-out (real)
+    eval score, and optional KL-to-init / entropy / reward-std. Computes the
+    proxy-minus-real "Hacking Gap" tripwire and fires a structured ``halt``
+    verdict when collapse is caught in the act.
+    The guard is **stateful**: call ``update(round_idx, ...)`` once per checkpoint
+    in the trainer loop (the same cadence at which ``DifficultyCurriculum.update``
+    is called). It maintains denoised EMAs of every metric (raw single-step
+    values are too noisy to threshold — theneuralbase early-stopping guidance) and
+    returns a ``TripwireStatus``.
+    Fires (``fire=True``) when ANY of:
+      (a) **collapse-caught-in-the-act** — the in-loop reward EMA is RISING while
+          the held-out score EMA has DECLINED for >= ``decline_patience``
+          consecutive checkpoints (default 3, matching the "monotone for >=3
+          checkpoints" rule). This is the canonical reward-hacking signature.
+      (b) **KL breach** — the ``kl_to_init`` EMA exceeds ``kl_hard_stop`` (default
+          0.08 nats/token) on/after ``min_steps``.
+      (c) **proxy-real gap blowout** — the Hacking Gap (proxy gain - real gain
+          since baseline) widens beyond ``max_proxy_real_gap``, even if held-out
+          has not strictly declined for the full patience window (a fast
+          single-generation divergence).
+    No tripwire fires before ``min_steps`` (avoids halting on early-run noise,
+    when both signals are still warming up).
+    The guard is idempotent in the sense that re-querying ``last_status`` or
+    calling ``should_halt()`` does not advance state — only ``update`` does.
+    """
+    # --- thresholds (calibratable; see calibrate_kl_threshold) ---------------
+    kl_hard_stop: float = 0.08          # nats/token; top of GRPO "good" band
+    max_proxy_real_gap: float = 0.10    # absolute Hacking-Gap blowout ceiling
+    # --- temporal gates ------------------------------------------------------
+    min_steps: int = 20                 # no fire before this many updates
+    decline_patience: int = 3           # consecutive held-out declines to fire (a)
+    # --- denoising -----------------------------------------------------------
+    ema_alpha: float = 0.9              # EMA weight on the PRIOR (0.9 => slow)
+    rise_eps: float = 1e-4              # min EMA delta to count as "rising"/"declining"
+    # --- internal state (do not set directly) --------------------------------
+    _n: int = field(default=0, init=False)
+    _in_loop_ema: float | None = field(default=None, init=False)
+    _heldout_ema: float | None = field(default=None, init=False)
+    _kl_ema: float | None = field(default=None, init=False)
+    _entropy_ema: float | None = field(default=None, init=False)
+    _reward_std_ema: float | None = field(default=None, init=False)
+    _in_loop_baseline: float | None = field(default=None, init=False)
+    _heldout_baseline: float | None = field(default=None, init=False)
+    _prev_in_loop_ema: float | None = field(default=None, init=False)
+    _prev_heldout_ema: float | None = field(default=None, init=False)
+    _heldout_decline_streak: int = field(default=0, init=False)
+    _last_status: TripwireStatus | None = field(default=None, init=False)
+    _fired: bool = field(default=False, init=False)
+    def __post_init__(self) -> None:
+        if not (0.0 <= self.ema_alpha < 1.0):
+            raise ValueError(
+                f"ema_alpha must be in [0, 1), got {self.ema_alpha!r} "
+                "(it is the weight on the PRIOR EMA)."
+            )
+        if self.kl_hard_stop <= 0.0:
+            raise ValueError(f"kl_hard_stop must be > 0, got {self.kl_hard_stop!r}")
+        if self.decline_patience < 1:
+            raise ValueError(
+                f"decline_patience must be >= 1, got {self.decline_patience!r}"
+            )
+    # ------------------------------------------------------------------------
+    # core API
+    # ------------------------------------------------------------------------
+    def update(
+        self,
+        round_idx: int,
+        in_loop_reward: float,
+        heldout_score: float,
+        kl_to_init: float | None = None,
+        entropy: float | None = None,
+        reward_std: float | None = None,
+    ) -> TripwireStatus:
+        """Fold one checkpoint's metrics in and return the current verdict.
+        Args:
+            round_idx: the generation / round index (for logging; not used for
+                gating — the internal update counter ``_n`` drives ``min_steps``
+                so the guard is robust to non-contiguous round indices).
+            in_loop_reward: mean in-loop (proxy / oracle) reward this round. This
+                is what the policy is optimizing against.
+            heldout_score: mean score on the DISJOINT held-out eval pool this
+                round — REAL tasks the generator never trains on. See
+                ``composer_replication.safety.holdout`` design notes / the
+                ``HeldoutSplit`` discipline; if held-out drifts with the train
+                set the gap signal is meaningless.
+            kl_to_init: optional token-mean KL(policy || init) in nats/token
+                (this repo's ``token_mean_kl`` convention). NOT sequence-level KL.
+            entropy: optional policy entropy (early-warning of entropy collapse,
+                "the silent killer of RLVR generalization"). Tracked + exposed,
+                not currently a hard gate.
+            reward_std: optional std of the reward distribution (tracked; a
+                collapsing std is an early collapse signal).
+        Returns:
+            A ``TripwireStatus``. Once the guard has fired, every subsequent
+            ``update`` keeps ``fire=True`` (latched) so a transient recovery
+            after a detected collapse cannot silently un-halt the run.
+        """
+        self._n += 1
+        # --- EMA folds (alpha on the prior; first sample seeds the EMA) -------
+        self._in_loop_ema = self._fold(self._in_loop_ema, float(in_loop_reward))
+        self._heldout_ema = self._fold(self._heldout_ema, float(heldout_score))
+        if kl_to_init is not None:
+            self._kl_ema = self._fold(self._kl_ema, float(kl_to_init))
+        if entropy is not None:
+            self._entropy_ema = self._fold(self._entropy_ema, float(entropy))
+        if reward_std is not None:
+            self._reward_std_ema = self._fold(self._reward_std_ema, float(reward_std))
+        # --- baselines: seed on the first update so gains are measured from
+        #     run start (the RSI Hacking-Gap is a gain-since-baseline quantity). -
+        if self._in_loop_baseline is None:
+            self._in_loop_baseline = self._in_loop_ema
+        if self._heldout_baseline is None:
+            self._heldout_baseline = self._heldout_ema
+        # --- track the held-out decline streak (uses EMA deltas, denoised) ----
+        in_loop_rising = (
+            self._prev_in_loop_ema is not None
+            and (self._in_loop_ema - self._prev_in_loop_ema) > self.rise_eps
+        )
+        heldout_declining = (
+            self._prev_heldout_ema is not None
+            and (self._heldout_ema - self._prev_heldout_ema) < -self.rise_eps
+        )
+        # The collapse signature is held-out DOWN while in-loop UP. We only count
+        # a decline toward the streak when in-loop is simultaneously rising — a
+        # held-out dip during an in-loop dip is just noise / a hard batch, not
+        # reward hacking.
+        if heldout_declining and in_loop_rising:
+            self._heldout_decline_streak += 1
+        elif not heldout_declining:
+            self._heldout_decline_streak = 0
+        # (if held-out declines but in-loop is flat/down we neither grow nor reset
+        #  the streak immediately — but the elif above resets on any non-decline,
+        #  so a single clean checkpoint clears it.)
+        gap = self.proxy_real_gap()
+        status = self._evaluate(round_idx, gap)
+        # advance "previous EMA" trackers AFTER evaluation
+        self._prev_in_loop_ema = self._in_loop_ema
+        self._prev_heldout_ema = self._heldout_ema
+        self._last_status = status
+        if status.fire:
+            self._fired = True
+        return status
+    def _evaluate(self, round_idx: int, gap: float) -> TripwireStatus:
+        """Decide the verdict from current state. Pure (no state mutation)."""
+        assert self._in_loop_ema is not None and self._heldout_ema is not None
+        base = dict(
+            step=round_idx,
+            proxy_real_gap=gap,
+            in_loop_ema=self._in_loop_ema,
+            heldout_ema=self._heldout_ema,
+            kl_ema=self._kl_ema,
+        )
+        # Latched: once fired, stay fired (cannot silently un-halt).
+        if self._fired:
+            prev_reason = self._last_status.reason if self._last_status else "collapse"
+            return TripwireStatus(fire=True, reason=f"latched: {prev_reason}", **base)
+        # Warm-up guard: never fire on early-run noise.
+        if self._n < self.min_steps:
+            return TripwireStatus(fire=False, reason="", **base)
+        # (b) KL hard stop — checked first; it's the cheapest unambiguous breach.
+        if self._kl_ema is not None and self._kl_ema > self.kl_hard_stop:
+            return TripwireStatus(
+                fire=True,
+                reason=(
+                    f"kl_to_init EMA {self._kl_ema:.4f} nats/token exceeds hard "
+                    f"stop {self.kl_hard_stop:.4f} (policy drifting from init)"
+                ),
+                **base,
+            )
+        # (a) collapse caught in the act — held-out declines while in-loop rises.
+        if self._heldout_decline_streak >= self.decline_patience:
+            return TripwireStatus(
+                fire=True,
+                reason=(
+                    f"reward-hacking signature: held-out score declined while "
+                    f"in-loop reward rose for {self._heldout_decline_streak} "
+                    f"consecutive checkpoints (Hacking Gap {gap:.4f})"
+                ),
+                **base,
+            )
+        # (c) proxy-real gap blowout — fast single-generation divergence.
+        if gap > self.max_proxy_real_gap:
+            return TripwireStatus(
+                fire=True,
+                reason=(
+                    f"proxy-real Hacking Gap {gap:.4f} exceeds ceiling "
+                    f"{self.max_proxy_real_gap:.4f} (proxy reward improving far "
+                    f"faster than real held-out eval)"
+                ),
+                **base,
+            )
+        return TripwireStatus(fire=False, reason="", **base)
+    # ------------------------------------------------------------------------
+    # query helpers (do NOT advance state — idempotent)
+    # ------------------------------------------------------------------------
+    def should_halt(self) -> bool:
+        """True if the most recent ``update`` produced a halt verdict.
+        Idempotent: querying does not advance the EMA state.
+        """
+        return self._last_status is not None and self._last_status.fire
+    @property
+    def last_status(self) -> TripwireStatus | None:
+        """The most recent verdict, or None if ``update`` was never called."""
+        return self._last_status
+    def raise_if_fired(self, status: TripwireStatus | None = None) -> None:
+        """Convert a fired verdict into a typed ``CollapseStopError`` exception.
+        Pass the status returned by ``update`` (or omit to use ``last_status``).
+        Trainer loops that prefer exception-based control flow call this right
+        after ``update``; loops that prefer flag-checking just read
+        ``status.fire`` / ``should_halt()``.
+        """
+        st = status if status is not None else self._last_status
+        if st is not None and st.fire:
+            raise CollapseStopError(st)
+    def proxy_real_gap(self) -> float:
+        """The RSI Hacking Gap = (in-loop gain) - (held-out gain), both measured
+        as EMA-minus-baseline since run start.
+        Returns 0.0 before the first ``update`` (no baseline yet). A positive,
+        widening value is the reward-hacking fingerprint: the proxy the policy
+        optimizes is improving more than the real held-out objective.
+        """
+        if (
+            self._in_loop_ema is None
+            or self._heldout_ema is None
+            or self._in_loop_baseline is None
+            or self._heldout_baseline is None
+        ):
+            return 0.0
+        in_loop_gain = self._in_loop_ema - self._in_loop_baseline
+        heldout_gain = self._heldout_ema - self._heldout_baseline
+        return in_loop_gain - heldout_gain
+    # ------------------------------------------------------------------------
+    # calibration
+    # ------------------------------------------------------------------------
+    def calibrate_kl_threshold(
+        self, baseline_kls: list[float], factor: float = 3.0
+    ) -> float:
+        """Set ``kl_hard_stop`` to ``factor`` x the mean of early-run baseline KLs.
+        theneuralbase guidance: "Record baseline KL during the first ~100 steps,
+        set max to 3x that." Single fixed thresholds are dataset-dependent; this
+        adapts to the run's own KL scale.
+        SAFETY CLAMP: calibration may only ever TIGHTEN the hard stop, never
+        loosen it past the documented collapse band. The returned (and stored)
+        threshold is ``min(3x baseline, current kl_hard_stop)`` — so a noisy /
+        already-drifting baseline cannot raise the ceiling above 0.08 nats/token.
+        Args:
+            baseline_kls: per-step token-mean KL values from early in the run.
+            factor: multiplier on the baseline mean (default 3.0).
+        Returns:
+            The new ``kl_hard_stop`` (also stored on the instance).
+        Raises:
+            ValueError: if ``baseline_kls`` is empty.
+        """
+        if not baseline_kls:
+            raise ValueError("baseline_kls must be non-empty to calibrate")
+        mean_kl = sum(baseline_kls) / len(baseline_kls)
+        calibrated = factor * mean_kl
+        # Only tighten: never let calibration loosen past the current ceiling.
+        self.kl_hard_stop = min(calibrated, self.kl_hard_stop)
+        return self.kl_hard_stop
+    # ------------------------------------------------------------------------
+    # internals
+    # ------------------------------------------------------------------------
+    def _fold(self, prev: float | None, x: float) -> float:
+        """EMA fold; the first observation seeds the EMA (no warm-up bias)."""
+        if prev is None:
+            return x
+        return self.ema_alpha * prev + (1.0 - self.ema_alpha) * x
+def kl_token_trust_filter(logratio_sq_half: float, threshold: float = 0.08) -> bool:
+    """Per-token KL trust-region mask, mirroring torchrl's GRPO "KL-Mask".
+    torchrl masks any TOKEN whose ``0.5 * (log pi/pi_ref)^2`` (the Schulman k2
+    estimator of per-token KL) exceeds a threshold, forming a per-token trust
+    region. This helper returns True when the token should be MASKED OUT (its
+    KL contribution is too large), so it can be wired into a loss later without
+    pulling torch into this module — the caller computes ``0.5 * logratio**2``.
+    Args:
+        logratio_sq_half: ``0.5 * (log pi/pi_ref)^2`` for one token (nats).
+        threshold: per-token KL ceiling (default 0.08 nats, the same band as the
+            run-level hard stop).
+    Returns:
+        True if the token exceeds the trust region and should be masked.
+    """
+    return logratio_sq_half > threshold

composer_replication/safety/tests/__init__.py ADDED Viewed

File without changes

composer_replication/safety/tests/test_kill_switch.py ADDED Viewed

	@@ -0,0 +1,320 @@

+"""Tests for the held-out collapse kill-switch (HeldOutGuard).
+CPU-only, pure-Python — no torch, no cloud. Mirrors the
+``datagen/tests/test_feature_deletion.py`` style (small helpers, behavioral
+asserts). Covers:
+  - no-halt on a healthy co-rising run (the held-out-twin "within noise" case);
+  - HALT on the canonical signature: held-out declines while in-loop rises;
+  - HALT on KL-to-init hard-stop breach;
+  - HALT on a fast proxy-real Hacking-Gap blowout;
+  - window / patience behavior (min_steps warm-up; decline_patience streak);
+  - calibration tightens-only;
+  - idempotent query + latched-fire edge cases.
+"""
+from __future__ import annotations
+import pytest
+from composer_replication.safety import (
+    CollapseStopError,
+    HeldOutGuard,
+    TripwireStatus,
+    kl_token_trust_filter,
+)
+def _guard(**kw) -> HeldOutGuard:
+    # Small min_steps keeps tests fast while still exercising the warm-up gate.
+    base = dict(min_steps=3, decline_patience=3, ema_alpha=0.5, kl_hard_stop=0.08)
+    base.update(kw)
+    return HeldOutGuard(**base)
+# --- healthy run: both rise => never halt -----------------------------------
+def test_no_halt_when_both_rise():
+    """Clean run: in-loop and held-out rise together, KL stays in band. The
+    held-out twin scores within noise of the proxy => no fire (the well-behaved
+    case the literature says a clean model exhibits)."""
+    g = _guard()
+    status = None
+    for i in range(30):
+        status = g.update(
+            i,
+            in_loop_reward=0.30 + 0.01 * i,
+            heldout_score=0.28 + 0.01 * i,  # tracks proxy within noise
+            kl_to_init=0.03,
+        )
+        assert not status.fire, f"fired unexpectedly at step {i}: {status.reason}"
+    assert not g.should_halt()
+    # Gap stays near zero because both gained equally.
+    assert abs(g.proxy_real_gap()) < 0.05
+# --- canonical signature: held-out declines while in-loop rises -------------
+def test_halt_on_heldout_declines_while_reward_rises():
+    g = _guard(max_proxy_real_gap=10.0)  # disable gap-blowout path to isolate (a)
+    # Warm up past min_steps with a stable healthy stretch.
+    for i in range(6):
+        s = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+        assert not s.fire
+    # Now: proxy reward climbs, held-out eval falls — the reward-hacking
+    # fingerprint. Should fire once the decline streak hits decline_patience (3).
+    fired_at = None
+    for j, i in enumerate(range(6, 12)):
+        s = g.update(
+            i,
+            in_loop_reward=0.40 + 0.05 * (j + 1),   # rising
+            heldout_score=0.40 - 0.05 * (j + 1),    # declining
+            kl_to_init=0.03,                          # KL stays in band
+        )
+        if s.fire:
+            fired_at = i
+            break
+    assert fired_at is not None, "tripwire never fired on the collapse signature"
+    assert g.should_halt()
+    s = g.last_status
+    assert "held-out" in s.reason and "consecutive" in s.reason
+    assert s.proxy_real_gap > 0.0  # proxy gained while real lost
+def test_does_not_fire_before_patience_window():
+    """Held-out declining while in-loop rises for FEWER than decline_patience
+    checkpoints must NOT fire (window behavior)."""
+    g = _guard(decline_patience=3, max_proxy_real_gap=10.0)
+    for i in range(6):
+        g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+    # Only 2 divergent checkpoints (< patience of 3) => no fire.
+    s1 = g.update(6, in_loop_reward=0.45, heldout_score=0.35, kl_to_init=0.03)
+    s2 = g.update(7, in_loop_reward=0.50, heldout_score=0.30, kl_to_init=0.03)
+    assert not s1.fire and not s2.fire
+def test_decline_streak_resets_on_recovery():
+    """A clean checkpoint (held-out recovers) resets the decline streak, so a
+    later short divergence does not inherit prior declines."""
+    g = _guard(decline_patience=3, max_proxy_real_gap=10.0)
+    for i in range(6):
+        g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+    # 2 declines...
+    g.update(6, in_loop_reward=0.45, heldout_score=0.35, kl_to_init=0.03)
+    g.update(7, in_loop_reward=0.50, heldout_score=0.30, kl_to_init=0.03)
+    # ...then held-out recovers (resets streak)...
+    s = g.update(8, in_loop_reward=0.50, heldout_score=0.45, kl_to_init=0.03)
+    assert not s.fire
+    # ...one more decline is only streak=1, still below patience.
+    s = g.update(9, in_loop_reward=0.55, heldout_score=0.40, kl_to_init=0.03)
+    assert not s.fire
+# --- KL hard-stop ------------------------------------------------------------
+def test_halt_on_kl_hard_stop_breach():
+    g = _guard(kl_hard_stop=0.08, max_proxy_real_gap=10.0)
+    # Healthy KL through the warm-up; both metrics flat so only KL can fire.
+    for i in range(5):
+        s = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.04)
+        assert not s.fire
+    # KL spikes well above 0.08; EMA climbs across a couple steps then crosses.
+    fired = False
+    for i in range(5, 12):
+        s = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.20)
+        if s.fire:
+            fired = True
+            assert "kl_to_init" in s.reason and "hard stop" in s.reason
+            break
+    assert fired, "KL hard-stop never fired despite KL EMA crossing the ceiling"
+def test_kl_none_never_fires_kl_path():
+    """If the caller never supplies kl_to_init, the KL path must be inert (and
+    kl_ema stays None) — KL is an optional float."""
+    g = _guard(max_proxy_real_gap=10.0)
+    s = None
+    for i in range(20):
+        s = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=None)
+    assert s is not None and not s.fire
+    assert s.kl_ema is None
+# --- proxy-real gap blowout (fast divergence) -------------------------------
+def test_halt_on_proxy_real_gap_blowout():
+    """A large single-generation divergence (proxy jumps, real stays flat) fires
+    via the gap-blowout path even before the decline streak reaches patience."""
+    g = _guard(max_proxy_real_gap=0.10, decline_patience=100)  # disable (a)
+    for i in range(5):
+        g.update(i, in_loop_reward=0.30, heldout_score=0.30, kl_to_init=0.03)
+    # Proxy blows up; held-out flat. With ema_alpha=0.5 the gap crosses 0.10 fast.
+    fired = False
+    for i in range(5, 12):
+        s = g.update(i, in_loop_reward=0.90, heldout_score=0.30, kl_to_init=0.03)
+        if s.fire:
+            fired = True
+            assert "Hacking Gap" in s.reason
+            assert s.proxy_real_gap > 0.10
+            break
+    assert fired, "gap-blowout tripwire never fired"
+# --- warm-up window (min_steps) ---------------------------------------------
+def test_respects_min_steps_no_early_fire():
+    """Even with every signal tripped, no fire before min_steps (avoids halting
+    on early-run noise)."""
+    g = _guard(min_steps=10, decline_patience=2, kl_hard_stop=0.08,
+               max_proxy_real_gap=0.01)
+    # Egregiously bad signals from step 0: KL huge, proxy up, held-out down.
+    for i in range(9):  # 9 updates, all < min_steps=10
+        s = g.update(i, in_loop_reward=0.10 + 0.1 * i, heldout_score=0.90 - 0.1 * i,
+                     kl_to_init=0.9)
+        assert not s.fire, f"fired during warm-up at step {i}: {s.reason}"
+    # The 10th update (n==10, not < min_steps) is now allowed to fire.
+    s = g.update(9, in_loop_reward=1.5, heldout_score=0.0, kl_to_init=0.9)
+    assert s.fire
+# --- calibration -------------------------------------------------------------
+def test_calibrate_kl_threshold_tightens_only():
+    g = _guard(kl_hard_stop=0.08)
+    # Baseline mean 0.01 => 3x = 0.03 < 0.08 => tightens to 0.03.
+    new = g.calibrate_kl_threshold([0.008, 0.010, 0.012], factor=3.0)
+    assert new == pytest.approx(0.03, abs=1e-9)
+    assert g.kl_hard_stop == pytest.approx(0.03, abs=1e-9)
+def test_calibrate_never_loosens_past_band():
+    g = _guard(kl_hard_stop=0.08)
+    # A drifting baseline (mean 0.05 => 3x = 0.15) must NOT loosen past 0.08.
+    new = g.calibrate_kl_threshold([0.05, 0.05, 0.05], factor=3.0)
+    assert new == pytest.approx(0.08, abs=1e-9)
+    assert g.kl_hard_stop == pytest.approx(0.08, abs=1e-9)
+def test_calibrate_empty_raises():
+    g = _guard()
+    with pytest.raises(ValueError, match="non-empty"):
+        g.calibrate_kl_threshold([])
+# --- proxy_real_gap definition ----------------------------------------------
+def test_proxy_real_gap_is_gain_difference():
+    g = _guard(min_steps=100, max_proxy_real_gap=10.0)  # disable firing
+    g.update(0, in_loop_reward=0.20, heldout_score=0.20, kl_to_init=0.02)  # baseline
+    # With ema_alpha=0.5 the second sample moves each EMA halfway.
+    g.update(1, in_loop_reward=0.60, heldout_score=0.30, kl_to_init=0.02)
+    # in_loop EMA: 0.5*0.20 + 0.5*0.60 = 0.40; gain = 0.40-0.20 = 0.20
+    # heldout EMA: 0.5*0.20 + 0.5*0.30 = 0.25; gain = 0.25-0.20 = 0.05
+    # gap = 0.20 - 0.05 = 0.15
+    assert g.proxy_real_gap() == pytest.approx(0.15, abs=1e-9)
+def test_proxy_real_gap_zero_before_update():
+    g = _guard()
+    assert g.proxy_real_gap() == 0.0
+# --- idempotency / edge cases -----------------------------------------------
+def test_should_halt_is_idempotent_query():
+    g = _guard(max_proxy_real_gap=10.0)
+    for i in range(6):
+        g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+    # Querying repeatedly must not advance state or change the verdict.
+    snap_gap = g.proxy_real_gap()
+    assert g.should_halt() is False
+    assert g.should_halt() is False
+    assert g.proxy_real_gap() == snap_gap  # unchanged by querying
+    assert g.last_status is not None and not g.last_status.fire
+def test_fire_is_latched():
+    """Once fired, a subsequent recovery cannot silently un-halt the run."""
+    g = _guard(kl_hard_stop=0.08, max_proxy_real_gap=10.0)
+    for i in range(5):
+        g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.04)
+    # Drive a KL breach.
+    fired = False
+    for i in range(5, 12):
+        s = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.30)
+        if s.fire:
+            fired = True
+            break
+    assert fired
+    # Now KL recovers to healthy — verdict must stay fired (latched).
+    s = g.update(99, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.01)
+    assert s.fire and s.reason.startswith("latched:")
+    assert g.should_halt()
+def test_raise_if_fired_raises_typed_exception():
+    g = _guard(kl_hard_stop=0.08, max_proxy_real_gap=10.0)
+    for i in range(5):
+        g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.04)
+    status = None
+    for i in range(5, 12):
+        status = g.update(i, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.30)
+        if status.fire:
+            break
+    assert status is not None and status.fire
+    with pytest.raises(CollapseStopError) as exc:
+        g.raise_if_fired(status)
+    assert exc.value.status is status
+    assert isinstance(str(exc.value), str) and str(exc.value)
+def test_raise_if_fired_noop_when_clean():
+    g = _guard(max_proxy_real_gap=10.0)
+    s = g.update(0, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+    # No fire => no raise (uses last_status when arg omitted).
+    g.raise_if_fired(s)
+    g.raise_if_fired()
+def test_status_halt_alias_matches_fire():
+    g = _guard(max_proxy_real_gap=10.0)
+    s = g.update(0, in_loop_reward=0.40, heldout_score=0.40, kl_to_init=0.03)
+    assert s.halt == s.fire is False
+    assert isinstance(s, TripwireStatus)
+def test_non_contiguous_round_idx_uses_internal_counter():
+    """min_steps gates on the internal update counter, not round_idx, so a caller
+    that logs sparse / non-contiguous round indices still warms up correctly."""
+    g = _guard(min_steps=3, max_proxy_real_gap=0.01, decline_patience=1)
+    # Pass huge round_idx values; only the 3rd UPDATE clears warm-up.
+    g.update(1000, in_loop_reward=0.10, heldout_score=0.90, kl_to_init=0.9)
+    g.update(2000, in_loop_reward=0.50, heldout_score=0.50, kl_to_init=0.9)
+    s = g.update(3000, in_loop_reward=0.90, heldout_score=0.10, kl_to_init=0.9)
+    assert s.fire  # 3rd update, n==3 not < min_steps
+# --- config validation -------------------------------------------------------
+def test_bad_ema_alpha_rejected():
+    with pytest.raises(ValueError, match="ema_alpha"):
+        HeldOutGuard(ema_alpha=1.0)
+    with pytest.raises(ValueError, match="ema_alpha"):
+        HeldOutGuard(ema_alpha=-0.1)
+def test_bad_kl_hard_stop_rejected():
+    with pytest.raises(ValueError, match="kl_hard_stop"):
+        HeldOutGuard(kl_hard_stop=0.0)
+def test_bad_decline_patience_rejected():
+    with pytest.raises(ValueError, match="decline_patience"):
+        HeldOutGuard(decline_patience=0)
+# --- kl_token_trust_filter helper -------------------------------------------
+def test_kl_token_trust_filter_masks_above_threshold():
+    # 0.5 * logratio^2; mask when it exceeds the per-token KL ceiling.
+    assert kl_token_trust_filter(0.20, threshold=0.08) is True   # too large -> mask
+    assert kl_token_trust_filter(0.05, threshold=0.08) is False  # within trust region
+    assert kl_token_trust_filter(0.08, threshold=0.08) is False  # boundary, not masked

docs/BACKLOG_RESOLUTION_2026-06-09.md CHANGED Viewed

@@ -52,6 +52,31 @@ Goal-driven systematic resolution of every pending item. This doc is the live au
 | F1 (`…-cb74`) | **ROTATE exposed HF write-token** — USER-ONLY (requires HF account access). AUDIT done: no live token in tracked tree (only env-var reads). Action = user rotates on huggingface.co. | P1 | DOCUMENTED (user-only) |
 | F2 | Real 8B LMA run (A2/A3/A4 arms `…-42f5`,`…-dd7b`) + higher-lr sweep RUNS — GPU + budget + user go/no-go. Harness buildable (E1/E2); the spend is user-only. | — | GATED (harness only) |
 ## Wave plan
 - **Wave 1 (parallel):** B1, B2, B3, B4, B5, B6, B7, B8 (bugs + doc debt) ‖ D1 (Docker E2E) ‖ research fan-out (Tavily/Exa/DeepWiki) for C1/C2/E1/E2 best practices.
 - **Wave 2 (parallel, after research):** C1 (held-out eval + kill-switch) ‖ C2 (EKSExecutor) ‖ C3 (containerized sandbox) ‖ E1/E2/E3 harnesses.

 | F1 (`…-cb74`) | **ROTATE exposed HF write-token** — USER-ONLY (requires HF account access). AUDIT done: no live token in tracked tree (only env-var reads). Action = user rotates on huggingface.co. | P1 | DOCUMENTED (user-only) |
 | F2 | Real 8B LMA run (A2/A3/A4 arms `…-42f5`,`…-dd7b`) + higher-lr sweep RUNS — GPU + budget + user go/no-go. Harness buildable (E1/E2); the spend is user-only. | — | GATED (harness only) |
+## Status log
+**Wave 1 — DONE (commit `c11cf49`):** B1 ✅ (fixture generated, 8 tests pass), B2 ✅ ([dev] installs on arm64), B3 ✅ ([serverless] deps), B4 ✅ (266/62 canonical), B5 ✅ (WSL footers), B6 ✅ (dead ADR link), B7 ✅ (config factories re-exported + documented), B8 ✅ (refine-summary + OVERVIEW xref), **D1 ✅ (Docker substrate E2E GREEN — 2/2 gates on real container; long-blocked item closed)**. F1 (token rotation) audited — no live token in tracked tree; user-only action documented.
+**Wave 2 — DONE (built + integrated + tested):** C1 ✅ HeldOutGuard kill-switch (`composer_replication/safety/`, 23 tests), C2 ✅ EKSExecutor (single Indexed Job → N handles, gang-cancel; `eks.py` + 28 tests), C3 ✅ DockerSandbox (`docker_sandbox.py` + shared `scrub_tree` refactor; live Docker tests pass), E3 ✅ SageMakerExecutor (`sagemaker.py`; +13-test module I added — the build agent shipped it test-less, gap closed during integration). All 4 modules lint-clean, re-exported, 90/3 on targeted suite. Grounded in Phase-3 research.
+**Wave 3 — Phase-7 reconciliation (from the concurrent review team `research/review-*.json`):**
+| ID | Item | Sev | Status |
+|---|---|---|---|
+| R1 | **Wire `HeldOutGuard` into `composer_trainer.py`** at per-checkpoint cadence (alongside `DifficultyCurriculum.update`), feeding `token_mean_kl` as `kl_to_init`, converting a fired verdict to halt via `raise_if_fired`. Currently dead code — the #2 safeguard never fires in production. | HIGH | OPEN |
+| R2 | **Build `composer_replication/safety/holdout.py` `HeldoutSplit`** disjointness enforcer (id/hash set-difference, raises on train↔held-out overlap) — the un-built second half of C1; the guard's gap signal is meaningless without it. | HIGH | OPEN |
+| R3 | **EKS contract bug:** `launch_replicas` default container command runs `replica_entrypoint __main__` (argparse needs `--rendezvous/--world-size/--trainer-module`) but the indexed-job spec passes rank/world via env, not argv → a real run would fail arg-parsing. Reconcile the entrypoint contract. | HIGH | OPEN |
+| R4 | `calibrate_kl_threshold` can yield a NEGATIVE `kl_hard_stop` on `factor<=0`/negative baseline → fires every healthy step. Guard inputs / clamp to positive floor. | LOW | OPEN |
+| R5 | EKS/SageMaker `cancel()` swallow ALL exceptions (report success on real failure). Narrow to already-terminated (404/ResourceNotFound). | LOW | OPEN |
+| R6 | `EKSExecutor.collect()` result dicts miss the `result` key the other backends include — cross-backend shape uniformity. | LOW | OPEN |
+| R7 | **Doc-debt:** the 4 new Wave-2 public symbols (EKSExecutor, SageMakerExecutor, DockerSandbox, HeldOutGuard/safety) are undocumented in API_REFERENCE.md; add §12 + `.eks`/`.aws` extras. | MED | OPEN |
+| R8 | **ADR-015** for the held-out kill-switch — referenced by `safety/__init__.py:17` + kill_switch docstrings but doesn't exist (dangling refs). Author it or drop the refs. | LOW | OPEN |
+| R9 | Re-measure + refresh canonical test count in V1_V8_COVERAGE (Wave 2 added ~93 tests; 328→~420 collected). | LOW | OPEN |
+| R10 | Add a test pinning the kill-switch path-(c) both-rising gap-blowout behavior; document path-(c) as a divergence-rate gate. | LOW | OPEN |
+| R11 | Flaky test `spikes/006-real-hf-model-smoke/tests/test_strict.py::test_alternating_batches_loss_decreases` — fails under CPU contention (full suite w/ concurrent pytest + Docker), PASSES in isolation (verified 3x). Loss-trend assertion is timing/noise-sensitive. Pin seed / widen tolerance / mark flaky. Pre-existing, not a Wave-2 regression. | LOW | OPEN |
+| R12 | B7-complete ✅ (top-level `__all__` now includes the 3 factories) + B4-complete ✅ (the 4 surviving "115" claims → 266/62). | — | DONE |
+Sandbox refactor verdict: **clean** (no regression to LocalSubprocessSandbox/FeatureDeletionEnv).
 ## Wave plan
 - **Wave 1 (parallel):** B1, B2, B3, B4, B5, B6, B7, B8 (bugs + doc debt) ‖ D1 (Docker E2E) ‖ research fan-out (Tavily/Exa/DeepWiki) for C1/C2/E1/E2 best practices.
 - **Wave 2 (parallel, after research):** C1 (held-out eval + kill-switch) ‖ C2 (EKSExecutor) ‖ C3 (containerized sandbox) ‖ E1/E2/E3 harnesses.

docs/OVERVIEW.md CHANGED Viewed

@@ -52,8 +52,8 @@ where channel 1 is real GRPO rather than the LM-CE stub. See
   trainer on a real reasoning benchmark.
 - **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
   errors (Spike 001).
-- **Installable + tested.** `pip install -e .` works; **115 passing tests + 1 skip-marked**
-  (canonical count: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
 ## What's gapped (honest, NOT closed)

   trainer on a real reasoning benchmark.
 - **Economic feasibility of channel 3.** 150 real OpenRouter calls, $0.98/trace mean, 0
   errors (Spike 001).
+- **Installable + tested.** `pip install -e .` works; **266 passing / 62 skipped** (measured 2026-06-09;
+  canonical count + why skips vary by env: [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md)).
 ## What's gapped (honest, NOT closed)

docs/VISION_VALIDATION.md CHANGED Viewed

@@ -1,7 +1,7 @@
 # Vision Validation: Does the Framework Encapsulate the Original Brief?
 > **## Status as of 2026-06 (current through ADR-014)**
-> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 115 passing
 > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
 > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
 > `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,

 # Vision Validation: Does the Framework Encapsulate the Original Brief?
 > **## Status as of 2026-06 (current through ADR-014)**
+> The framework is past-skeleton: 8 subpackages (`composer_replication/*`), 266 passing (canonical count + env-variance note in docs/V1_V8_COVERAGE.md)
 > tests + 1 skip-marked (see [`docs/V1_V8_COVERAGE.md`](V1_V8_COVERAGE.md) for the
 > canonical count), and operational end-to-end examples (`gsm8k_grpo`,
 > `sdpo_with_real_traces_production`). The 3-channel loss, layered hint-generation,

pyproject.toml CHANGED Viewed

@@ -69,6 +69,15 @@ serverless = [
     "boto3>=1.34",             # SageMakerExecutor (create_training_job) + S3 IAM
     "kubernetes>=29.0",        # EKSExecutor (indexed k8s Jobs via BatchV1Api)
 ]
 # Replaysim dataset normalization (per ADR-004)
 #
 # NOTE: data-juicer is intentionally NOT pinned as an extra. The package

     "boto3>=1.34",             # SageMakerExecutor (create_training_job) + S3 IAM
     "kubernetes>=29.0",        # EKSExecutor (indexed k8s Jobs via BatchV1Api)
 ]
+# Amazon EKS / Kubernetes Indexed-Job executor (EKSExecutor, per ADR-005).
+# kubernetes is lazy-imported at adapter-init/method time (not at package import).
+eks = [
+    "kubernetes>=29",
+]
+# Amazon SageMaker training-job executor (SageMakerExecutor, per ADR-005).
+aws = [
+    "boto3>=1.34",
+]
 # Replaysim dataset normalization (per ADR-004)
 #
 # NOTE: data-juicer is intentionally NOT pinned as an extra. The package

research/review-executors.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "area": "composer_replication/diloco/serverless: EKSExecutor + SageMakerExecutor vs ServerlessExecutor Protocol",
+  "verdict": "minor-issues",
+  "findings": [
+    {
+      "severity": "high",
+      "what": "EKS rank/arg plumbing contract mismatch: launch_replicas defaults the container command to ['python','-m','composer_replication.diloco.serverless.replica_entrypoint'] with NO container args, and plumbs rendezvous_uri/world_size as UPPER-CASED env vars (RENDEZVOUS_URI, WORLD_SIZE). But replica_entrypoint.py's __main__ block uses argparse with --rendezvous, --world-size, --trainer-module ALL required=True and reads NONE of those env vars (only REPLICA_RANK via os.environ). It also reads trainer_module via --trainer-module, which EKS never plumbs in any form. A pod launched with the documented EKS defaults therefore SystemExits at startup ('the following arguments are required: --rendezvous, --world-size, --trainer-module'). SageMakerExecutor does this correctly (passes ContainerArguments=['--rendezvous',...,'--world-size',...,'--trainer-module',...,'--trainer-fn',...,'--trainer-kwargs-json',...] matching the entrypoint argparse exactly). This is an end-to-end run correctness bug, not a Protocol-signature gap, and it is untested (test_launch_uses_default_entrypoint_command only asserts the command vector; no test asserts the entrypoint can actually parse what EKS supplies; trainer_module is never asserted to reach the container).",
+      "where": "composer_replication/diloco/serverless/eks.py:220-224 (default command), :296-303 (_build_env upper-cases scalars, drops nothing else), :405-458 (launch_replicas passes no ContainerArguments); contract owner composer_replication/diloco/serverless/replica_entrypoint.py:91-109 (argparse required=True, no env fallback)",
+      "recommendation": "Pick ONE: (a) make EKS pass the same arg vector SageMaker does by appending args to self.command (e.g. command + ['--rendezvous', uri, '--world-size', N, '--trainer-module', tm, ...]); OR (b) add an env-var fallback to replica_entrypoint.__main__ (read RENDEZVOUS_URI/WORLD_SIZE/TRAINER_MODULE/etc. from os.environ when CLI args are absent) so the env-only EKS plumbing works unchanged. Add a test that constructs the entrypoint argv/env exactly as EKS would and asserts main() can be invoked (e.g. argparse parse over the supplied tokens, or env-driven path). Ensure trainer_module is plumbed in whichever channel is chosen."
+    },
+    {
+      "severity": "low",
+      "what": "EKS cancel() swallows ALL non-404 ApiExceptions and even generic Exceptions with a bare 'return', reporting success even when the gang delete genuinely failed (e.g. 403 RBAC-denied, 409 conflict). Because the whole point of gang-cancel is to stop the entire GPU-burning cohort, a silently-swallowed real teardown failure leaves the cohort running while the caller believes it was cancelled — the exact failure mode the design calls out. SageMakerExecutor.cancel has the same broad swallow. The Protocol only requires 'no exception if already terminated', which a 404 satisfies; swallowing 403/409 is broader than the contract needs.",
+      "where": "composer_replication/diloco/serverless/eks.py:594-600 (except ApiException -> swallow non-404; except Exception -> swallow); composer_replication/diloco/serverless/sagemaker.py:470-475 (bare except Exception: pass)",
+      "recommendation": "Narrow the swallow to the 'already terminated' cases (404 / ResourceNotFound, and SageMaker's already-terminal ValidationException) and at minimum log/warn (or re-raise) on other API errors so a failed gang-teardown of GPU resources is observable rather than silent. Best-effort can still mean 'do not raise', but it should emit a warning on a non-idempotent failure."
+    },
+    {
+      "severity": "low",
+      "what": "Result-dict shape inconsistency across executors in collect(): SageMaker and Modal/Local include a 'result' key (SageMaker surfaces ModelArtifacts.S3ModelArtifacts path; Local/Modal include the in-process return value), but EKS _result_dict omits 'result' entirely and instead adds 'job_name'. The Protocol only mandates {rank,status,exit_code,error} 'at least', so this is conformant, but the divergence makes a backend-agnostic caller that reads result['result'] KeyError on EKS. Note: this is NOT a 'collect() not reading S3' Protocol violation — the Protocol/ADR-005 do not require collect() to read S3 contents; the payload flows through ObjectStoreAllReduce/S3 written by the replica itself, and collect() correctly returns status metadata (the reference LocalProcessExecutor returns an in-process value, not S3). SageMaker surfacing the S3 artifact path is a nice-to-have, not a requirement.",
+      "where": "composer_replication/diloco/serverless/eks.py:655-671 (_result_dict: no 'result' key, adds 'job_name'); compare sagemaker.py:588-595 (includes 'result': artifacts.get('S3ModelArtifacts')) and executor.py:104-107 (Protocol documents only the 4 required keys)",
+      "recommendation": "For cross-backend uniformity, add a 'result': None (or the rendezvous output URI if known) key to EKS _result_dict so callers can read result['result'] uniformly across executors. Optionally document in the Protocol docstring that 'result' is an optional, backend-specific extra key so callers use .get('result')."
+    }
+  ],
+  "confirmed_good": [
+    "Both EKSExecutor and SageMakerExecutor satisfy the runtime_checkable ServerlessExecutor Protocol: isinstance(EKSExecutor(image=...,batch_api=...,core_api=...), ServerlessExecutor) is True and isinstance(SageMakerExecutor(...), ServerlessExecutor) is True (verified at runtime); both expose backend_name ('eks'/'sagemaker'), supports_inter_replica_network (both False, correct — S3-only rendezvous), and all five methods launch_replicas/poll/stream_logs/cancel/collect.",
+    "Both are exported from serverless/__init__.py and present in __all__ (EKSExecutor line 50/62, SageMakerExecutor line 59/68).",
+    "EKS single-Indexed-Job -> N-handles topology is correct: exactly one create_namespaced_job, completions==parallelism==n_replicas, completionMode='Indexed', restartPolicy='Never', backoffLimit=0, active_deadline_seconds==timeout, ttl_seconds_after_finished set; returns N rank-ordered handles (handles[i].rank==i) all sharing job_name/namespace (test_launch_returns_n_rank_ordered_handles, test_launch_creates_indexed_job_spec).",
+    "EKS gang-cancel is correct: cancel(any handle) deletes the WHOLE shared Job with propagation_policy='Background' (cascading pod deletion, not the k8s default Orphan) and grace_period_seconds=0; idempotent on 404 (test_cancel_uses_background_propagation_on_shared_job, test_cancel_swallows_404, test_cancel_unknown_handle_is_noop).",
+    "EKS rank plumbing via downward API is correct: REPLICA_RANK set via V1EnvVarSource.field_ref field_path metadata.annotations['batch.kubernetes.io/job-completion-index'] (value is None, value_from set), bridging k8s completion-index to the entrypoint's REPLICA_RANK read without modifying the entrypoint; rank_env LocalProcessExecutor convention is stripped (test_launch_rank_env_uses_downward_api_field_ref, test_launch_strips_rank_env_kwarg). NOTE: this rank channel works; the BROKEN channel is rendezvous_uri/trainer_module (see high finding).",
+    "EKS poll status mapping covers all five Protocol states: rank in completed_indexes->succeeded (checked first, so a succeeded rank is not mis-flagged by a whole-job Failed condition), rank in failed_indexes->failed, whole-job Failed condition->failed (DeadlineExceeded/backoff), active>0->running, else pending, 404->cancelled, non-404 ApiException re-raised; run-length index strings expanded correctly incl. reversed ranges and whitespace (test_poll_* x7, test_expand_indexes_*).",
+    "EKS GPU resource limit is always a STRING ('1' not int 1) per OpenAPI dict[str,str] typing; GPU node selector merged (caller wins) and nvidia.com/gpu NoSchedule toleration auto-added; CPU-only omits the gpu limit (test_launch_gpu_limit_is_string, test_launch_cpu_only_omits_gpu_limit).",
+    "EKS partial-failure sibling cleanup is correctly N/A: launch issues exactly ONE create_namespaced_job (atomic gang scheduling), so there are no siblings to clean up if it fails — a genuine advantage of the single-Indexed-Job topology over N-job designs.",
+    "SageMaker correctly uses N independent single-instance training jobs (ResourceConfig.InstanceCount==1) with rank via the Environment map (REPLICA_RANK/WORLD_SIZE/RENDEZVOUS_URI), and correctly passes the entrypoint args via ContainerArguments matching replica_entrypoint argparse; EnableNetworkIsolation pinned False (else S3 rendezvous deadlocks) — verified in test_launch_injects_rank_world_size_and_rendezvous_env.",
+    "SageMaker partial-failure sibling cleanup is correct: a create_training_job failure at rank k best-effort stops the k already-launched siblings then raises with rank context (test_launch_partial_failure_stops_siblings_and_raises asserts 2 siblings stopped).",
+    "SageMaker poll status mapping covers all 5 documented TrainingJobStatus values (InProgress->running, with SecondaryStatus refinement to pending for Starting/Pending/LaunchingMLInstances/PreparingTrainingStack; Completed->succeeded; Failed->failed; Stopping->running; Stopped->cancelled), vanished job (ResourceNotFound)->cancelled, unknown handle->cancelled; collect() correctly checks RAW SM status for terminality so Stopping keeps polling until Stopped (test_poll_status_mapping, test_poll_failed_and_stopped, test_poll_vanished_job_is_cancelled, test_poll_unknown_handle_is_cancelled).",
+    "collect() reading S3: NOT a violation. The Protocol (executor.py:96-108) and ADR-005 require collect() to return status/exit metadata, not S3 contents — the result payload flows through ObjectStoreAllReduce written to S3 by each replica. SageMaker even surfaces the ModelArtifacts S3 path in result['result']. The reference LocalProcessExecutor returns an in-process value, confirming collect is not contractually an S3 reader.",
+    "Full suite green: .venv/bin/python -m pytest composer_replication/diloco/serverless -q => 53 passed, 17 skipped (skips are the boto3/kubernetes/modal absent-path guards that cannot fire when the package is importable in this interpreter, plus integration gates)."
+  ],
+  "new_backlog_items": [
+    "EKS end-to-end run bug: default container command runs replica_entrypoint __main__ (argparse --rendezvous/--world-size/--trainer-module required) but EKSExecutor supplies env vars + no args and never plumbs trainer_module -> pod crashes on startup. Fix by passing ContainerArguments-equivalent args OR adding an env-var fallback to replica_entrypoint.__main__; add a test that the EKS-supplied argv/env actually parses. (Not in BACKLOG_RESOLUTION_2026-06-09; C2 only tracked building EKSExecutor, not the entrypoint contract.)",
+    "Tighten EKSExecutor.cancel and SageMakerExecutor.cancel exception handling: only swallow 'already-terminated' errors (404/ResourceNotFound, already-terminal ValidationException); log/warn on other API errors so a failed gang-teardown of GPU resources is observable instead of silently leaving the cohort burning compute.",
+    "Add a 'result' key to EKSExecutor.collect() result dicts (None or the rendezvous output URI) for cross-backend uniformity with Local/Modal/SageMaker, OR document in the Protocol that 'result' is an optional backend-specific extra so callers use .get('result')."
+  ]
+}

research/review-newgaps.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "area": "Wave-1+2 broad sweep for NEW gaps (imports/laziness, unfinished-work markers, doc-debt, ADR-015, optional-dep eager-load)",
+  "verdict": "minor-issues",
+  "findings": [
+    {
+      "severity": "medium",
+      "what": "Doc-debt: the 4 NEW Wave-2 public symbols are entirely undocumented in docs/API_REFERENCE.md. grep for EKSExecutor / SageMakerExecutor / DockerSandbox / HeldOutGuard / TripwireStatus / CollapseStopError / kl_token_trust_filter all return 0 hits. API_REFERENCE §12 (serverless) header (line 23) lists `.modal`, `.hf_jobs` but not `.eks` / `.sagemaker`, and documents the loud-failing ModalExecutor/HFJobsExecutor stubs while omitting the two NEW *production* executors. There is no `safety` section at all, and no `datagen` section (DockerSandbox + its LocalSubprocessSandbox/FakeSandbox siblings are all undocumented). All four are real, exported public API (in their package __all__) and Protocol-conformant (isinstance(eks, ServerlessExecutor) == True).",
+      "where": "docs/API_REFERENCE.md (§12 line 1153-1376; header line 23); new public symbols in composer_replication/diloco/serverless/{eks,sagemaker}.py, composer_replication/datagen/docker_sandbox.py, composer_replication/safety/kill_switch.py",
+      "recommendation": "Add API_REFERENCE entries: under §12 add `class EKSExecutor` and `class SageMakerExecutor` (and update the §12 line-23 module list to include `.eks`, `.sagemaker`); add a `composer_replication.safety` section documenting HeldOutGuard / TripwireStatus / CollapseStopError / kl_token_trust_filter; and a `composer_replication.datagen` section documenting DockerSandbox (alongside the existing-but-also-undocumented LocalSubprocessSandbox/FakeSandbox)."
+    },
+    {
+      "severity": "low",
+      "what": "Dangling ADR reference: composer_replication/safety/__init__.py:17 says 'See docs/adrs/ADR-015-*.md' but no ADR-015 file exists (docs/adrs/ stops at ADR-014). The research plan called for ADR-015 to document the safety/kill-switch design decision; the module docstring already cites the literature (Zhao et al. RSI, EvilGenie, Gao self-evolving survey, Shumailov collapse, Catastrophic Goodhart, GRPO KL band) so the design rationale exists in-code but is not captured as an ADR, and the __init__ points readers to a file that isn't there.",
+      "where": "composer_replication/safety/__init__.py:17 (the dangling 'docs/adrs/ADR-015-*.md' pointer); docs/adrs/ (ADR-015 absent)",
+      "recommendation": "Either author docs/adrs/ADR-015-holdout-killswitch.md (the kill_switch.py module docstring is effectively the ADR draft already — proxy_real_gap Hacking-Gap, KL 0.08 nats/token hard stop, decline-patience collapse signature, defense-in-depth-over-HackMonitor) and index it in docs/adrs/README.md, OR remove the forward reference from safety/__init__.py until the ADR lands."
+    },
+    {
+      "severity": "low",
+      "what": "Test-count drift re-introduced by Wave 2. docs/V1_V8_COVERAGE.md:117 still states the canonical count as '266 passed / 62 skipped / 328 collected (measured 2026-06-09)' — that was the Wave-1 figure. Wave 2 added 93 tests across 4 new files (test_kill_switch 23, test_eks_executor 28, test_sagemaker_executor 14, test_docker_sandbox 28); the tree now collects 420 tests (328 -> 420, +92 net). B4 closed test-drift in Wave 1 but the doc is stale again post-Wave-2.",
+      "where": "docs/V1_V8_COVERAGE.md:117-134 (canonical count claim) vs actual `pytest --collect-only` = 420 collected",
+      "recommendation": "Re-run `.venv/bin/python -m pytest` to get the post-Wave-2 passed/skipped split and update the single canonical figure in V1_V8_COVERAGE.md (the doc explicitly says this line is 'the one canonical figure' that other docs reference)."
+    }
+  ],
+  "confirmed_good": [
+    "Required import smoke test passes: `import composer_replication; from composer_replication.diloco.serverless import EKSExecutor, SageMakerExecutor; from composer_replication.datagen import DockerSandbox; from composer_replication.safety import HeldOutGuard` -> exit 0, 'ALL IMPORTS OK'.",
+    "Optional-dep laziness (question 5) is CORRECT for all 4 new modules: no top-level `import kubernetes/boto3/docker` in eks.py / sagemaker.py / docker_sandbox.py / kill_switch.py (grep for eager imports returns empty). Blocking kubernetes+docker at import time and importing the new modules in isolation succeeds. EKSExecutor lazy-imports `kubernetes` only when no api injected / per-method; SageMakerExecutor lazy-imports boto3 in _make_boto3_client (construction-time, not import-time); DockerSandbox lazy-imports docker via _require_docker() inside methods.",
+    "NOTE on the whole-package blocked-import failure: blocking boto3 breaks `import composer_replication`, but the cause is PRE-EXISTING and NOT a Wave-2 regression — composer_replication/__init__.py:98 imports the trainer, which imports `trl.GRPOTrainer` -> accelerate.commands.config.sagemaker -> `import boto3`. boto3 is already a hard transitive dependency of the base trainer stack on main; Wave 2 did not introduce it.",
+    "No NEW unfinished-work markers (question 2): all NotImplementedError/TODO/FIXME/STUB hits in composer_replication/ are PRE-EXISTING and intentional (prime_rl/composer_loss.py deferred SDPO channel-2, recipes/monarch/actors.py v0 skeleton per ADR-006, diloco/serverless/{modal,hf_jobs,modal_spawn}.py documented loud-failing stubs). The 4 new modules contain ZERO NotImplementedError/TODO/FIXME/STUB — they are finished, not skeletons. SageMakerExecutor's docstring explicitly contrasts itself as 'fully-working, not the loud-failing modal.py/hf_jobs.py skeletons'.",
+    "Both new executors satisfy the runtime_checkable ServerlessExecutor Protocol (isinstance checks pass), expose correct backend_name ('eks'/'sagemaker') and supports_inter_replica_network=False (S3-only rendezvous).",
+    "All 90 collectable Wave-2 tests pass (3 skipped, the live-docker-daemon gated ones) via `pytest composer_replication/safety/tests composer_replication/diloco/serverless/tests/test_{eks,sagemaker}_executor.py composer_replication/datagen/tests/test_docker_sandbox.py`. Whole suite still collects cleanly (420 tests, no collection errors).",
+    "DockerSandbox.run_tests pytest-pass heuristic (`f\"{t} PASSED\" in out or (returncode==0 and not failed)`) is a faithful copy of the established LocalSubprocessSandbox.run_tests (sandbox.py:214) — not a new bug, consistent with the documented sibling behavior.",
+    "safety/ not being in the top-level composer_replication.__all__ is consistent with existing structure (datagen/diloco subpackages aren't fully surfaced at top level either); `composer_replication.safety` imports correctly as a subpackage."
+  ],
+  "new_backlog_items": [
+    "DOC: Document the 4 NEW Wave-2 public symbols in docs/API_REFERENCE.md — add EKSExecutor + SageMakerExecutor under §12 (and add .eks/.sagemaker to the §12 module list at line 23), add a new `composer_replication.safety` section (HeldOutGuard, TripwireStatus, CollapseStopError, kl_token_trust_filter), and a `composer_replication.datagen` section covering DockerSandbox (+ the also-undocumented LocalSubprocessSandbox/FakeSandbox).",
+    "ADR: Author docs/adrs/ADR-015-holdout-killswitch.md (the safety kill-switch / held-out-guard design) — currently referenced by composer_replication/safety/__init__.py:17 as 'docs/adrs/ADR-015-*.md' but the file does not exist; index it in docs/adrs/README.md. The kill_switch.py module docstring is the ready-made draft.",
+    "DOC: Refresh the canonical test count in docs/V1_V8_COVERAGE.md:117 — Wave 2 added 93 tests (collection 328 -> 420); the stated '266 passed / 62 skipped / 328 collected' is the Wave-1 figure and is now stale."
+  ]
+}

research/review-safety.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "area": "composer_replication/safety/kill_switch.py + test_kill_switch.py (Wave-2 C1)",
+  "verdict": "material-issues",
+  "findings": [
+    {
+      "severity": "high",
+      "what": "C1 was scoped as 'Held-out disjoint eval + depth/generation kill-switch' but ONLY the kill-switch half (HeldOutGuard) was built. The HeldoutSplit disjointness-enforcer does not exist anywhere in the tree (no composer_replication/safety/holdout.py, no HeldoutSplit class). The guard's heldout_score is an unvalidated caller-supplied float; nothing enforces that the held-out pool is actually disjoint from the train/generator set. The module's own docstring (kill_switch.py:41-43, 214-216) states this is load-bearing: 'if held-out drifts with the train set the gap signal is meaningless.' So the kill-switch's central proxy-real-gap and decline-streak signals can be silently meaningless with no guard rail.",
+      "where": "composer_replication/safety/ (missing holdout.py / HeldoutSplit); referenced at kill_switch.py:43, kill_switch.py:214-216",
+      "recommendation": "Build the HeldoutSplit disjointness enforcer (hash/id-based set-difference check that the held-out eval IDs never intersect the generator/train IDs, raising on overlap) as the second half of C1, OR explicitly re-scope C1 to two items and track the disjointness enforcer as a distinct OPEN backlog item. Do not mark C1 done with only the guard built."
+    },
+    {
+      "severity": "high",
+      "what": "HeldOutGuard is NOT wired into the trainer. Zero references to HeldOutGuard / kill_switch / CollapseStopError / should_halt / raise_if_fired in composer_replication/trainer/composer_trainer.py (or anywhere outside the safety package + its own test). The 'most load-bearing collapse safeguard (#2)' for the self-evolving flywheel exists as dead, never-invoked code. The trainer's GRPO loop never calls update() per checkpoint, so the run-level tripwire cannot fire in production.",
+      "where": "composer_replication/trainer/composer_trainer.py (no integration); HeldOutGuard defined composer_replication/safety/kill_switch.py:117",
+      "recommendation": "Wire HeldOutGuard.update(round_idx, in_loop_reward, heldout_score, kl_to_init=token_mean_kl(...)) into the trainer loop at the same checkpoint cadence DifficultyCurriculum.update is called (curriculum.py:78), and convert a fired verdict to a halt via raise_if_fired / should_halt. token_mean_kl already exists (kl_logging.py:53) to supply the per-token KL. Until wired, C1's safety claim is unrealized."
+    },
+    {
+      "severity": "low",
+      "what": "calibrate_kl_threshold does not re-validate the > 0 invariant that __post_init__ enforces. A negative factor (or negative baseline_kls) yields min(negative, 0.08) = a NEGATIVE kl_hard_stop, after which the KL tripwire fires on EVERY healthy step (any positive KL EMA > negative ceiling). Verified empirically: factor=-3.0 on baseline [0.01] sets kl_hard_stop=-0.03 and a healthy KL of 0.01 then fires. The min() 'tighten-only' clamp is satisfied in the literal numeric sense but violates the documented collapse-band semantics.",
+      "where": "composer_replication/safety/kill_switch.py:412-418 (calibrate_kl_threshold)",
+      "recommendation": "Validate factor > 0 and all(k >= 0 for k in baseline_kls) at the top of calibrate_kl_threshold, and/or clamp the result to a small positive floor (e.g. assert calibrated > 0). KL values are non-negative by definition so a negative factor is nonsensical input, but the invariant should be guarded since the method mutates a field __post_init__ otherwise protects."
+    },
+    {
+      "severity": "low",
+      "what": "Dangling cross-references in docstrings to artifacts that do not exist: safety/__init__.py:17-18 cites 'docs/adrs/ADR-015-*.md' (highest existing ADR is ADR-014; no ADR-015 file) and a \"'holdout-killswitch' research digest\" (no such file under research/). kill_switch.py:43,214 cite composer_replication.safety.holdout / HeldoutSplit 'design notes' that do not exist (same missing module as the high finding).",
+      "where": "composer_replication/safety/__init__.py:17-18; composer_replication/safety/kill_switch.py:43, 214-216",
+      "recommendation": "Either author ADR-015 documenting the kill-switch design decision (the module is substantial enough to warrant one and the docstring already promises it), or drop the dangling citations. Keep doc references honest to avoid the stale-cross-ref foot-guns the backlog (B5/B6/B8) is already cleaning up."
+    },
+    {
+      "severity": "low",
+      "what": "Gap-blowout path (c) fires when the proxy gain exceeds real gain by max_proxy_real_gap EVEN WHEN the held-out (real) score is still genuinely RISING. Verified: with both rising but proxy faster, it halts the run while real improvement is ongoing. This is defensible per the docstring ('fast single-generation divergence', lines 144-145), and the reason string is accurate, but it is a potential false-positive halt on a healthy-but-fast-proxy run and is not covered by a test asserting the desired behavior in the both-rising case (only the proxy-flat-real case is tested at test_kill_switch.py:143).",
+      "where": "composer_replication/safety/kill_switch.py:326-335 (path c); test gap test_kill_switch.py:143-158 only exercises real-flat",
+      "recommendation": "Add a test pinning the intended behavior when BOTH rise but proxy outpaces real beyond the ceiling (assert whether it should fire), and document in the docstring that path (c) is a divergence-RATE gate, not a real-decline gate, so future readers do not mistake a fired path-(c) for confirmed real regression."
+    }
+  ],
+  "confirmed_good": [
+    "All 23 tests in composer_replication/safety pass (.venv pytest, 23 passed).",
+    "Latched-fire is correct and cannot un-halt: _fired flips True in update() (line 277-278) and _evaluate() short-circuits with a 'latched:' verdict carrying the original reason before any threshold re-check (lines 294-296). Verified a full KL/gap recovery after fire stays fire=True.",
+    "Three halt conditions are individually correct: (b) KL EMA > kl_hard_stop checked first; (a) held-out-declines-while-in-loop-rises only increments the streak when BOTH conditions hold (a both-declining 'hard batch' correctly does NOT count, verified), fires at decline_patience; (c) proxy-real gap > ceiling. min_steps warm-up gate uses the internal _n counter (robust to non-contiguous round_idx, tested).",
+    "EMA denoising is sound: _fold seeds on first sample (no warm-up bias), alpha is weight-on-prior validated to [0,1); first-sample baseline seeding makes proxy_real_gap a gain-since-baseline quantity exactly matching the RSI Hacking-Gap definition. proxy_real_gap math verified (0.15 expected case) and returns 0.0 before first update.",
+    "CollapseStopError raise path: raise_if_fired raises the typed exception carrying .status only when fired, is a no-op when clean, and is a safe no-op before any update (last_status None). Strict > boundary on gap/KL confirmed (gap==ceiling does not fire).",
+    "calibrate tighten-only works for the intended (positive) inputs: min(3x baseline, current) so a drifting baseline cannot loosen past 0.08 (tested), and only tightens for a clean low baseline.",
+    "kl_token_trust_filter boundary correct (strict >, so threshold value itself is not masked).",
+    "Docstring cross-refs that DO resolve: DifficultyCurriculum.update (curriculum.py:78) and token_mean_kl (kl_logging.py:53) both exist, so the claimed cadence and KL-units convention are anchored to real code.",
+    "No false claim anywhere in examples/ or docs that the kill-switch is already wired/used (grep clean)."
+  ],
+  "new_backlog_items": [
+    "Build composer_replication/safety/holdout.py with a HeldoutSplit disjointness enforcer (id/hash set-difference, raises on train/held-out overlap) — the un-built second half of C1 that the kill-switch's gap/decline signals depend on for validity.",
+    "Wire HeldOutGuard into composer_replication/trainer/composer_trainer.py at the per-checkpoint cadence (alongside DifficultyCurriculum.update), feeding token_mean_kl as kl_to_init and converting a fired verdict to a halt via raise_if_fired/should_halt — the C1 safeguard is currently dead code.",
+    "Guard calibrate_kl_threshold against factor<=0 / negative baseline_kls (or clamp result to a positive floor) so calibration cannot drive kl_hard_stop negative and make the KL tripwire fire on every healthy step.",
+    "Author docs/adrs/ADR-015 for the held-out kill-switch (referenced by safety/__init__.py:17 but nonexistent) or remove the dangling ADR-015 + 'holdout-killswitch research digest' citations.",
+    "Add a test pinning path-(c) gap-blowout behavior in the BOTH-rising case (proxy outpaces a still-rising real) to lock the intended false-positive/true-positive decision."
+  ]
+}

research/review-sandbox.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "area": "composer_replication/datagen/docker_sandbox.py + sandbox.py scrub_tree refactor",
+  "verdict": "clean",
+  "findings": [
+    {
+      "severity": "low",
+      "what": "run_tests pass/fail parse carries the order-dependent fallback clause `if f\"{t} PASSED\" in out or (returncode == 0 and not failed)` verbatim from LocalSubprocessSandbox. If a runner exits 0 but does not print '<nodeid> PASSED' for every node id, the first un-printed node is marked passed solely on the exit code (and `not failed` is true only until the first failure is recorded). This is a pre-existing pattern (identical on main's LocalSubprocessSandbox at sandbox.py:214) faithfully mirrored into DockerSandbox, NOT a new regression — flagged only for completeness.",
+      "where": "composer_replication/datagen/docker_sandbox.py:272-276 (and the source LocalSubprocessSandbox at sandbox.py:212-217)",
+      "recommendation": "No action required for this review. If ever hardened, require an explicit PASSED token per node id and stop trusting the bare exit code; do it in both sandboxes together so they stay in lock-step."
+    }
+  ],
+  "confirmed_good": [
+    "REFACTOR DID NOT BREAK LocalSubprocessSandbox: boot() still scrubs — boot() (sandbox.py:169-172) calls self._scrub_tree() which delegates to the shared module-level scrub_tree() free function (sandbox.py:174-177). Smoke test confirmed __pycache__, .git, and *.pyc are removed on boot while real source (keep.py) survives.",
+    "No broken/dangling references to the old per-class _scrub_tree: the only remaining _scrub_tree occurrences are (a) the intentional back-compat delegating method + its self-call in boot, and (b) one descriptive comment in test_docker_substrate_e2e.py:161. grep for external callers of .SCRUB_NAMES/._SCRUB_NAMES/.SCRUB_SUFFIXES returned EMPTY.",
+    "Back-compat preserved: LocalSubprocessSandbox._SCRUB_NAMES / ._SCRUB_SUFFIXES class aliases still point at the module-level SCRUB_NAMES/SCRUB_SUFFIXES; the _scrub_tree() method is retained.",
+    "FeatureDeletionEnv unaffected: env.py uses the Sandbox Protocol generically (boot/exec/run_tests/trajectory at env.py:59,69,86,89) — agnostic to the scrub refactor.",
+    "SCRUB-BEFORE-MOUNT ORDERING IS CORRECT (no security bug): DockerSandbox.boot() runs scrub_tree(self.workdir) at line 190 BEFORE self._client.containers.run(**kwargs) at line 198. The container (and thus the RW bind mount) does not exist when the host-side scrub runs, so the scrub is provably pre-mount. The scrub-AFTER-mount security bug the audit asked to rule out is NOT present.",
+    "--network none: both network_disabled=True AND network_mode='none' set (docker_sandbox.py:154-155); live test_live_network_is_disabled actually ran on a real container and asserted egress BLOCKED / not CONNECTED.",
+    "Resource limits: mem_limit == memswap_limit (forbids swap), pids_limit (fork-bomb guard), nano_cpus (CPU quota); all present, configurable, and unit-asserted.",
+    "Ephemeral teardown: close() force-removes (idempotent, swallows errors), reap_leaked() sweeps label-filtered orphan containers at boot and shutdown, __enter__/__exit__/__del__ wired. Verified by test_close_removes_container_force, test_context_manager_closes, test_reap_leaked_sweeps_labelled_containers.",
+    "gVisor runtime option: runtime defaults to None (=> 'runtime' kwarg omitted, daemon-default runc); 'runsc' is only passed through when explicitly set (docker_sandbox.py:178-179) and gated by runsc_available(). test_live_runsc_runtime correctly SKIPPED (gVisor not installed on host).",
+    "Lazy docker import: _require_docker() imports `docker` inside the function with a clear RuntimeError on ImportError; docker SDK is never required by the FakeSandbox/pure-core path. Verified by test_require_docker_missing_sdk_raises.",
+    "Privilege lockdown: cap_drop=['ALL'], security_opt=['no-new-privileges:true'], user='1000:1000' (non-root), read_only root fs with tmpfs /tmp (noexec,nosuid), keep_root_writable escape hatch.",
+    "shlex.quote applied to every test node id in run_tests (shell-injection guard, matches LocalSubprocessSandbox); non-UTF-8 output decoded with errors='replace' (test_exec_decodes_non_utf8_bytes); exec wraps commands in coreutils `timeout`.",
+    "TEST SUITE: `.venv/bin/python -m pytest composer_replication/datagen -q` => 61 passed, 1 skipped (runsc only). The LIVE Docker E2E genuinely RAN (not skipped): test_live_four_inversion_gates_in_hardened_container, test_live_network_is_disabled, test_live_cache_scrub_removes_bytecode all PASSED on a real python:3.11-slim container. The long-blocked D1 substrate E2E (test_docker_substrate_e2e.py) is also GREEN (2/2). Broader regression datagen+safety+serverless => 137 passed, 18 skipped, no failures.",
+    "Public surface re-exports DockerSandbox and scrub_tree from composer_replication/datagen/__init__.py and __all__; package imports cleanly."
+  ],
+  "new_backlog_items": []
+}