{ "area": "composer_replication/diloco/serverless: EKSExecutor + SageMakerExecutor vs ServerlessExecutor Protocol", "verdict": "minor-issues", "findings": [ { "severity": "high", "what": "EKS rank/arg plumbing contract mismatch: launch_replicas defaults the container command to ['python','-m','composer_replication.diloco.serverless.replica_entrypoint'] with NO container args, and plumbs rendezvous_uri/world_size as UPPER-CASED env vars (RENDEZVOUS_URI, WORLD_SIZE). But replica_entrypoint.py's __main__ block uses argparse with --rendezvous, --world-size, --trainer-module ALL required=True and reads NONE of those env vars (only REPLICA_RANK via os.environ). It also reads trainer_module via --trainer-module, which EKS never plumbs in any form. A pod launched with the documented EKS defaults therefore SystemExits at startup ('the following arguments are required: --rendezvous, --world-size, --trainer-module'). SageMakerExecutor does this correctly (passes ContainerArguments=['--rendezvous',...,'--world-size',...,'--trainer-module',...,'--trainer-fn',...,'--trainer-kwargs-json',...] matching the entrypoint argparse exactly). This is an end-to-end run correctness bug, not a Protocol-signature gap, and it is untested (test_launch_uses_default_entrypoint_command only asserts the command vector; no test asserts the entrypoint can actually parse what EKS supplies; trainer_module is never asserted to reach the container).", "where": "composer_replication/diloco/serverless/eks.py:220-224 (default command), :296-303 (_build_env upper-cases scalars, drops nothing else), :405-458 (launch_replicas passes no ContainerArguments); contract owner composer_replication/diloco/serverless/replica_entrypoint.py:91-109 (argparse required=True, no env fallback)", "recommendation": "Pick ONE: (a) make EKS pass the same arg vector SageMaker does by appending args to self.command (e.g. command + ['--rendezvous', uri, '--world-size', N, '--trainer-module', tm, ...]); OR (b) add an env-var fallback to replica_entrypoint.__main__ (read RENDEZVOUS_URI/WORLD_SIZE/TRAINER_MODULE/etc. from os.environ when CLI args are absent) so the env-only EKS plumbing works unchanged. Add a test that constructs the entrypoint argv/env exactly as EKS would and asserts main() can be invoked (e.g. argparse parse over the supplied tokens, or env-driven path). Ensure trainer_module is plumbed in whichever channel is chosen." }, { "severity": "low", "what": "EKS cancel() swallows ALL non-404 ApiExceptions and even generic Exceptions with a bare 'return', reporting success even when the gang delete genuinely failed (e.g. 403 RBAC-denied, 409 conflict). Because the whole point of gang-cancel is to stop the entire GPU-burning cohort, a silently-swallowed real teardown failure leaves the cohort running while the caller believes it was cancelled — the exact failure mode the design calls out. SageMakerExecutor.cancel has the same broad swallow. The Protocol only requires 'no exception if already terminated', which a 404 satisfies; swallowing 403/409 is broader than the contract needs.", "where": "composer_replication/diloco/serverless/eks.py:594-600 (except ApiException -> swallow non-404; except Exception -> swallow); composer_replication/diloco/serverless/sagemaker.py:470-475 (bare except Exception: pass)", "recommendation": "Narrow the swallow to the 'already terminated' cases (404 / ResourceNotFound, and SageMaker's already-terminal ValidationException) and at minimum log/warn (or re-raise) on other API errors so a failed gang-teardown of GPU resources is observable rather than silent. Best-effort can still mean 'do not raise', but it should emit a warning on a non-idempotent failure." }, { "severity": "low", "what": "Result-dict shape inconsistency across executors in collect(): SageMaker and Modal/Local include a 'result' key (SageMaker surfaces ModelArtifacts.S3ModelArtifacts path; Local/Modal include the in-process return value), but EKS _result_dict omits 'result' entirely and instead adds 'job_name'. The Protocol only mandates {rank,status,exit_code,error} 'at least', so this is conformant, but the divergence makes a backend-agnostic caller that reads result['result'] KeyError on EKS. Note: this is NOT a 'collect() not reading S3' Protocol violation — the Protocol/ADR-005 do not require collect() to read S3 contents; the payload flows through ObjectStoreAllReduce/S3 written by the replica itself, and collect() correctly returns status metadata (the reference LocalProcessExecutor returns an in-process value, not S3). SageMaker surfacing the S3 artifact path is a nice-to-have, not a requirement.", "where": "composer_replication/diloco/serverless/eks.py:655-671 (_result_dict: no 'result' key, adds 'job_name'); compare sagemaker.py:588-595 (includes 'result': artifacts.get('S3ModelArtifacts')) and executor.py:104-107 (Protocol documents only the 4 required keys)", "recommendation": "For cross-backend uniformity, add a 'result': None (or the rendezvous output URI if known) key to EKS _result_dict so callers can read result['result'] uniformly across executors. Optionally document in the Protocol docstring that 'result' is an optional, backend-specific extra key so callers use .get('result')." } ], "confirmed_good": [ "Both EKSExecutor and SageMakerExecutor satisfy the runtime_checkable ServerlessExecutor Protocol: isinstance(EKSExecutor(image=...,batch_api=...,core_api=...), ServerlessExecutor) is True and isinstance(SageMakerExecutor(...), ServerlessExecutor) is True (verified at runtime); both expose backend_name ('eks'/'sagemaker'), supports_inter_replica_network (both False, correct — S3-only rendezvous), and all five methods launch_replicas/poll/stream_logs/cancel/collect.", "Both are exported from serverless/__init__.py and present in __all__ (EKSExecutor line 50/62, SageMakerExecutor line 59/68).", "EKS single-Indexed-Job -> N-handles topology is correct: exactly one create_namespaced_job, completions==parallelism==n_replicas, completionMode='Indexed', restartPolicy='Never', backoffLimit=0, active_deadline_seconds==timeout, ttl_seconds_after_finished set; returns N rank-ordered handles (handles[i].rank==i) all sharing job_name/namespace (test_launch_returns_n_rank_ordered_handles, test_launch_creates_indexed_job_spec).", "EKS gang-cancel is correct: cancel(any handle) deletes the WHOLE shared Job with propagation_policy='Background' (cascading pod deletion, not the k8s default Orphan) and grace_period_seconds=0; idempotent on 404 (test_cancel_uses_background_propagation_on_shared_job, test_cancel_swallows_404, test_cancel_unknown_handle_is_noop).", "EKS rank plumbing via downward API is correct: REPLICA_RANK set via V1EnvVarSource.field_ref field_path metadata.annotations['batch.kubernetes.io/job-completion-index'] (value is None, value_from set), bridging k8s completion-index to the entrypoint's REPLICA_RANK read without modifying the entrypoint; rank_env LocalProcessExecutor convention is stripped (test_launch_rank_env_uses_downward_api_field_ref, test_launch_strips_rank_env_kwarg). NOTE: this rank channel works; the BROKEN channel is rendezvous_uri/trainer_module (see high finding).", "EKS poll status mapping covers all five Protocol states: rank in completed_indexes->succeeded (checked first, so a succeeded rank is not mis-flagged by a whole-job Failed condition), rank in failed_indexes->failed, whole-job Failed condition->failed (DeadlineExceeded/backoff), active>0->running, else pending, 404->cancelled, non-404 ApiException re-raised; run-length index strings expanded correctly incl. reversed ranges and whitespace (test_poll_* x7, test_expand_indexes_*).", "EKS GPU resource limit is always a STRING ('1' not int 1) per OpenAPI dict[str,str] typing; GPU node selector merged (caller wins) and nvidia.com/gpu NoSchedule toleration auto-added; CPU-only omits the gpu limit (test_launch_gpu_limit_is_string, test_launch_cpu_only_omits_gpu_limit).", "EKS partial-failure sibling cleanup is correctly N/A: launch issues exactly ONE create_namespaced_job (atomic gang scheduling), so there are no siblings to clean up if it fails — a genuine advantage of the single-Indexed-Job topology over N-job designs.", "SageMaker correctly uses N independent single-instance training jobs (ResourceConfig.InstanceCount==1) with rank via the Environment map (REPLICA_RANK/WORLD_SIZE/RENDEZVOUS_URI), and correctly passes the entrypoint args via ContainerArguments matching replica_entrypoint argparse; EnableNetworkIsolation pinned False (else S3 rendezvous deadlocks) — verified in test_launch_injects_rank_world_size_and_rendezvous_env.", "SageMaker partial-failure sibling cleanup is correct: a create_training_job failure at rank k best-effort stops the k already-launched siblings then raises with rank context (test_launch_partial_failure_stops_siblings_and_raises asserts 2 siblings stopped).", "SageMaker poll status mapping covers all 5 documented TrainingJobStatus values (InProgress->running, with SecondaryStatus refinement to pending for Starting/Pending/LaunchingMLInstances/PreparingTrainingStack; Completed->succeeded; Failed->failed; Stopping->running; Stopped->cancelled), vanished job (ResourceNotFound)->cancelled, unknown handle->cancelled; collect() correctly checks RAW SM status for terminality so Stopping keeps polling until Stopped (test_poll_status_mapping, test_poll_failed_and_stopped, test_poll_vanished_job_is_cancelled, test_poll_unknown_handle_is_cancelled).", "collect() reading S3: NOT a violation. The Protocol (executor.py:96-108) and ADR-005 require collect() to return status/exit metadata, not S3 contents — the result payload flows through ObjectStoreAllReduce written to S3 by each replica. SageMaker even surfaces the ModelArtifacts S3 path in result['result']. The reference LocalProcessExecutor returns an in-process value, confirming collect is not contractually an S3 reader.", "Full suite green: .venv/bin/python -m pytest composer_replication/diloco/serverless -q => 53 passed, 17 skipped (skips are the boto3/kubernetes/modal absent-path guards that cannot fire when the package is importable in this interpreter, plus integration gates)." ], "new_backlog_items": [ "EKS end-to-end run bug: default container command runs replica_entrypoint __main__ (argparse --rendezvous/--world-size/--trainer-module required) but EKSExecutor supplies env vars + no args and never plumbs trainer_module -> pod crashes on startup. Fix by passing ContainerArguments-equivalent args OR adding an env-var fallback to replica_entrypoint.__main__; add a test that the EKS-supplied argv/env actually parses. (Not in BACKLOG_RESOLUTION_2026-06-09; C2 only tracked building EKSExecutor, not the entrypoint contract.)", "Tighten EKSExecutor.cancel and SageMakerExecutor.cancel exception handling: only swallow 'already-terminated' errors (404/ResourceNotFound, already-terminal ValidationException); log/warn on other API errors so a failed gang-teardown of GPU resources is observable instead of silently leaving the cohort burning compute.", "Add a 'result' key to EKSExecutor.collect() result dicts (None or the rendezvous output URI) for cross-backend uniformity with Local/Modal/SageMaker, OR document in the Protocol that 'result' is an optional backend-specific extra so callers use .get('result')." ] }