Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Deep-Read: RL Infra & Frameworks — Critical Findings
Cluster 8 of the dataset-pipeline review series
Reviewer: automated critical pipeline, 2026-06-09
Primary sources fetched: TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: research/04-verl-trl.md, research/03-monarch-torchforge-openenv.md, docs/adrs/ADR-006-rl-frameworks.md, docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md, composer_replication/datagen/env.py, composer_replication/trainer/composer_trainer.py, composer_replication/recipes/prime_rl/composer_loss.py.
1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts?
1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10)
The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe two distinct agentic mechanisms:
Mechanism A — tools parameter: Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the tools argument in GRPOTrainer." The loop has a hard cap max_tool_calling_iterations (default: unlimited, stops on no-tool-call response or max_model_length). Each tool call is synchronous — the training GPU waits.
Mechanism B — environment_factory parameter: Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires transformers>=5.2.0. Marked experimental: "This feature is experimental and may change or be removed at any time without prior notice." The reset() method can return a string that gets appended to the last user message. rollout_func is similarly experimental.
Mechanism C — rollout_func (custom rollout): A callable that receives prompts and the trainer, returns {"prompt_ids", "completion_ids", "logprobs"}. Also experimental. This is the escape hatch for fully custom multi-turn generation.
Key constraint confirmed from primary source: TRL has no async GPU-decoupled agent loop. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (vllm_importance_sampling_correction=True by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — research/04-verl-trl.md correctly identified this gap — but the docs now show TRL has partially closed the multi-turn gap via tools / environment_factory.
1.2 What research/04-verl-trl.md claims vs. primary source
| Claim in research/04 | Primary source (TRL docs, 2026) | Verdict |
|---|---|---|
| "TRL does NOT have an async GPU-decoupled agent loop" | Confirmed | CORRECT |
| "OpenEnv integration (October 2025)" | Confirmed; environment_factory + TRL's OpenEnv guide |
CORRECT |
| "VLM support" | Confirmed — tools can return list of content blocks incl. images |
CORRECT |
| "GRPOTrainer supports multi-step agentic rollouts" (04:173) | Confirmed via tools + environment_factory |
CORRECT |
| TRL v1.0 released March 2026 | Confirmed; docs show versions v1.0.0 through v1.5.1 | CORRECT |
Default loss_type is "dapo" |
CONFIRMED from source: loss_type: str = 'dapo' in GRPOConfig |
CORRECT |
Default scale_rewards is... |
CONFIRMED: default is "group" (not False/"none") |
CORRECT |
1.3 Critical discovery: TRL's default is DAPO, not GRPO
The TRL GRPOConfig shows loss_type = 'dapo' as the default. ADR-008 claims to configure loss_type="dr_grpo" to match Composer 2.5. The source confirms "dr_grpo" is a valid value (uses max_completion_length as the constant denominator). This is consistent with ADR-008's decision.
However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show beta=0.0 as default (KL term disabled). If beta=0, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore low-priority when beta=0 (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant.
1.4 scale_rewards drift assertion
ADR-008 checks str(cfg.scale_rewards).lower() in ("none","false"). Primary source confirms scale_rewards accepts: True/"group" (default), "batch", False/"none". The check is correct.
2. The Colocate-vLLM Blog: What It Actually Says
Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate).
What the blog confirms:
- Co-locate mode (
vllm_mode="colocate") runs training and vLLM in the same process, sharing GPUs. No REST API overhead. - Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs.
- vLLM sleep mode (level 2) is not yet merged into TRL upstream (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show
vllm_enable_sleep_modeas a parameter, implying it was eventually merged, but the blog notes a real production bug. - FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443).
What research/04 says: Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread.
What the repo's SageMaker smoke recipe uses: The SageMaker GRPO smoke (from git history context) uses use_vllm=False for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it.
3. VeRL: Agentic Mode and the AsyncServer
Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README:
"[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library." Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF"
The verl README confirms multi-turn tool-calling exists and uni-agent was released May 2026 as a unified agent framework. The AsyncServer/AgentLoop architecture described in research/04 is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (fully_async_policy, transfer_queue) are available but not yet in main.
What research/04 claims about VeRL agentic support:
- "First-class agentic RL support" with
AsyncServer/AgentLoop— the README confirms the direction but notes these are underverl/experimental. The research/04 characterization of "first-class" slightly overclaims what is in the stable API; the full async path is experimental. SandboxFusionTool— mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent.- "Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training."
4. PRIME-RL in the Repo
4.1 What ADR-006 claims
ADR-006 claims PRIME-RL ships a CustomLossConfig with import_path for dropping in a Python loss function, exposing LossInputs with trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ).
4.2 What composer_replication/recipes/prime_rl/composer_loss.py confirms
The code reads (lines 21-28):
@dataclass
class LossInputs:
trainer_logprobs: Float[Tensor, ' seq']
inference_logprobs: Float[Tensor, ' seq']
teacher_logprobs: Float[Tensor, ' seq'] | None
advantages: Float[Tensor, ' seq']
loss_mask: Bool[Tensor, ' seq']
This is marked as "verified against PrimeIntellect-ai/prime-rl src/prime_rl/trainer/rl/loss.py lines 13-22." The code correctly raises NotImplementedError when alpha_sdpo > 0 (logits not available, only log-probs). This is a real constraint, not a placeholder.
4.3 The DPPO upstream loss — a subtle accuracy point
The composer_loss.py reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses:
probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) # probability-space diff
This is notably not a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO.
5. The Key Question: Is TRL's Single-Submit reward_fn a Dead End for Multi-Turn?
5.1 What env.py::reward_fn actually does
def reward_fn(self, prompts, completions, *, task_id, **kwargs) -> list[float]:
...
for comp, tid in zip(completions, task_id):
task = self.registry[tid]
self.reset(task)
if self._replay is not None:
res = self._replay(self, comp)
else:
res = self.step({"type": "submit"}) # <-- single submit
rewards.append(res.reward)
The fallback path (no _replay function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL reward_funcs call would do.
The intended multi-turn path is _replay: a callable that takes (env, completion) and drives multi-turn turns by parsing the agent's encoded tool-call history from the completion string. This is a custom deserializer that replays the agent turns and grades at the end.
5.2 Is this a dead end for multi-turn RL?
For current TRL integration: partly yes, mostly no.
The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths:
Path A — _replay + rollout_func (TRL experimental): The rollout_func parameter in GRPOTrainer can drive multi-turn generation externally (running the env's step() loop), serialize the full trajectory into completion tokens, then call reward_fn which uses _replay to deserialize and grade. This makes the reward_fn the grader, not the rollout driver. This works in TRL today but requires the experimental rollout_func interface.
Path B — environment_factory (TRL experimental): Pass FeatureDeletionEnv (or an adapter) as environment_factory. GRPOTrainer calls reset() and then uses the env's public methods as tools. The reward_fn is replaced by a reward function that reads environments[i].reward after generation. This is the more principled path for true multi-turn RL and is what TRL's environment_factory was designed for. It requires transformers>=5.2.0 and is still experimental.
Path C — VeRL's AgentLoop: For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's AsyncServer+AgentLoop is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during sandbox.exec() calls. The repo acknowledges this in research/04 §5.3 recommendation.
5.3 The honest migration path
The current TRL single-submit reward_fn is:
- Correct for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm.
- Insufficient for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end.
Migration path (in order of complexity):
Immediate (low cost): Use TRL's
environment_factorywithFeatureDeletionEnvas the adapter. The env'sstep()becomes a tool. Grade viareward_funcsreadingenv.reward. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host.Medium term (single-GPU scale): Implement
rollout_functhat drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update.Scale-out (multi-GPU, async, tree-of-work): Migrate to VeRL's
AgentLoop. TheFeatureDeletionEnvmaps onto verl'sSandboxFusionToolprotocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not.
The tree-of-work IS multi-turn. The vision in framework/composer-replication-framework.md of a "multi-model Monte-Carlo tree-of-work" requires:
- Many concurrent rollout branches per prompt
- Reward propagated back through the tree (not just at leaf)
- Asynchronous sandbox execution without blocking GPU
None of these are provided by TRL's current GRPOTrainer (even with tools/environment_factory). VeRL's experimental fully_async_policy + AgentLoop is the right substrate. The repo's research/04 correctly identifies this but the ADR layer has not formally acknowledged this migration requirement.
6. Sandboxing for Code Execution at Scale
6.1 What the secure-EKS article says (primary source)
"gVisor added negligible launch latency... handles isolation for most agent workloads." "Cold start was around 5 seconds per sandbox" for Kata+Firecracker. "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups." "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS."
6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds
Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks.
This is directly relevant to the repo's FeatureDeletionEnv/Sandbox design. The current sandbox.py uses LocalSubprocessSandbox (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds.
6.3 What the repo's sandbox.py actually provides
sandbox.py defines:
LocalSubprocessSandbox— runs commands viasubprocess.runin the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control").DockerSandbox(indocker_sandbox.py) — real isolation, referenced in tests.
The gap: For RL training at scale (many parallel rollout workers), neither LocalSubprocessSandbox nor per-task Docker containers are adequate:
- Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments).
- Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step.
The repo's research/review-sandbox.json presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation).
7. Misreads, Overclaims, and Gaps in Repo Research
7.1 OVERCLAIM: VeRL "first-class" agentic RL
research/04 §1.5: "VeRL has first-class agentic RL support" and describes AsyncServer/AgentLoop as stable. The verl README (main branch, 2026-06-10) shows:
transfer_queue,fully_async_policy,one_step_off_policyare kept underverl/experimental— "planned to be merged into the main library."uni-agent(May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library.
The agentic async path exists but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work.
7.2 MISS: TRL now has environment_factory + tools for multi-turn
research/04 (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show environment_factory and tools for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the environment_factory interface which would allow FeatureDeletionEnv to drive multi-turn episodes inside GRPOTrainer without a custom rollout_func. This is not mentioned in any ADR.
7.3 MISS: TRL default loss_type="dapo", NOT "grpo" or "dr_grpo"
ADR-008 correctly targets loss_type="dr_grpo", but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in make_dr_grpo_config is the right mitigation.
7.4 GAP: scale_rewards default is "group", not True
The GRPOConfig shows scale_rewards: str = 'group' (a string, not a bool). ADR-008's assertion str(cfg.scale_rewards).lower() in ("none","false") correctly handles both the old bool (False) and new string ("none") forms. But the docs show True and "group" are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug.
7.5 GAP: KL estimator — TRL default is k3, not k1
The TRL docs show the KL approximator formula:
D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1
This is the k3 estimator (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (-log r = log π_θ/π_ref). The ADR notes this as an OPEN item. Since beta=0.0 by default in TRL, the KL term is disabled and this doesn't affect training unless beta>0. However, composer_trainer.py implements the k1-in-reward path via kl_in_reward.py — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes log(π_θ/π_ref) at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: k1-in-reward active, beta=0 (TRL in-loss KL disabled). The code appears to do this but there's no explicit assertion that beta=0 when kl_in_reward=True.
7.6 MISS: num_iterations=1 narrowed claim
ADR-008 acknowledges that num_iterations=1 controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct.
7.7 MISS: TRL default optimizer is adamw_torch_fused, not adam
ADR-008 has an OPEN item: "Adam is claimed but optim is not set." The GRPOConfig docs show:
optim: transformers.training_args.OptimizerNames | str = 'adamw_torch_fused'
Default is adamw_torch_fused (AdamW with fused CUDA kernel), not plain adam. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set weight_decay=0.0 and optim="adam" explicitly to match. The default AdamW has weight decay (though weight_decay=0.0 is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, adamw_torch_fused ≠ adam in terms of the optimizer implementation; to be precise, set optim="adamw_8bit" or optim="paged_adamw_8bit" (memory efficient) or just optim="adam_torch" if plain Adam is intended.
8. verl uni-agent — A New Development Not in Research/04
The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. uni-agent could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level AgentLoop/AsyncServer integration that ADR-006 contemplates.
Implication for the repo: Before committing to a custom VeRL AgentLoop integration, evaluate whether uni-agent already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface.
9. Sandboxing Recommendation Gaps
9.1 The SWE-MiniSandbox approach is not referenced anywhere in the repo
arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to FeatureDeletionEnv at scale. The repo's sandbox design doesn't reference this work.
9.2 The repo's docker_sandbox.py is production-blocking for RL at scale
Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives.
10. Summary of Critical Findings
| Finding | Severity | Affected Files |
|---|---|---|
| VeRL async agent loop is EXPERIMENTAL, not "first-class" stable | MEDIUM — overclaim | research/04 §1.5, ADR-006 |
TRL environment_factory (multi-turn) not in any ADR |
MEDIUM — miss | ADR-008, env.py |
| k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set) | MEDIUM — correctness gap | composer_trainer.py, ADR-008 |
optim default is adamw_torch_fused, not adam |
LOW — fidelity gap | ADR-008 OPEN item |
TRL loss_type defaults to "dapo" (not GRPO), correctly handled |
INFO — confirmed correct | ADR-008, make_dr_grpo_config |
env.py::reward_fn single-submit path is dead end for tree-of-work |
HIGH — architecture gap | env.py, no ADR exists |
uni-agent (verl, May 2026) not evaluated — may supersede custom AgentLoop |
MEDIUM — miss | ADR-006 |
| SWE-MiniSandbox approach not referenced (5% disk, 25% setup time) | MEDIUM — miss | sandbox.py, docker_sandbox.py |
| EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt) | INFO — production gotcha | no ADR |
11. Migration Path for Multi-Turn Agentic RL (Honest Assessment)
The current repo architecture (TRL reward_fn with single-submit fallback) is:
Phase 1 — GRPO on completions (current): The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase.
Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO): The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order:
TRL
environment_factoryadapter (experimental, weeks of work): WrapFeatureDeletionEnvas a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism.TRL
rollout_func(experimental, 1–2 weeks): Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update.VeRL AsyncServer +
FeatureDeletionEnvas SandboxFusionTool adapter (2–4 weeks): GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented.
Phase 3 — Tree-of-Work (MCTS, multi-branch): This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The uni-agent framework on top of verl should be evaluated first before building a custom AgentLoop integration.
Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.