Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 26 days ago

26.1 kB

	# Deep-Read: RL Infra & Frameworks — Critical Findings
	Cluster 8 of the dataset-pipeline review series
	Reviewer: automated critical pipeline, 2026-06-09
	Primary sources fetched: TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: `research/04-verl-trl.md`, `research/03-monarch-torchforge-openenv.md`, `docs/adrs/ADR-006-rl-frameworks.md`, `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`, `composer_replication/datagen/env.py`, `composer_replication/trainer/composer_trainer.py`, `composer_replication/recipes/prime_rl/composer_loss.py`.

	---

	## 1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts?

	### 1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10)

	The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe two distinct agentic mechanisms:

	Mechanism A — `tools` parameter: Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the `tools` argument in `GRPOTrainer`." The loop has a hard cap `max_tool_calling_iterations` (default: unlimited, stops on no-tool-call response or `max_model_length`). Each tool call is synchronous — the training GPU waits.

	Mechanism B — `environment_factory` parameter: Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires `transformers>=5.2.0`. Marked experimental: "This feature is experimental and may change or be removed at any time without prior notice." The `reset()` method can return a string that gets appended to the last user message. `rollout_func` is similarly experimental.

	Mechanism C — `rollout_func` (custom rollout): A callable that receives prompts and the trainer, returns `{"prompt_ids", "completion_ids", "logprobs"}`. Also experimental. This is the escape hatch for fully custom multi-turn generation.

	Key constraint confirmed from primary source: TRL has no async GPU-decoupled agent loop. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (`vllm_importance_sampling_correction=True` by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — `research/04-verl-trl.md` correctly identified this gap — but the docs now show TRL has partially closed the multi-turn gap via `tools` / `environment_factory`.

	### 1.2 What `research/04-verl-trl.md` claims vs. primary source

	\| Claim in research/04 \| Primary source (TRL docs, 2026) \| Verdict \|
	\|---\|---\|---\|
	\| "TRL does NOT have an async GPU-decoupled agent loop" \| Confirmed \| CORRECT \|
	\| "OpenEnv integration (October 2025)" \| Confirmed; `environment_factory` + TRL's OpenEnv guide \| CORRECT \|
	\| "VLM support" \| Confirmed — tools can return `list` of content blocks incl. images \| CORRECT \|
	\| "GRPOTrainer supports multi-step agentic rollouts" (04:173) \| Confirmed via `tools` + `environment_factory` \| CORRECT \|
	\| TRL v1.0 released March 2026 \| Confirmed; docs show versions v1.0.0 through v1.5.1 \| CORRECT \|
	\| Default `loss_type` is `"dapo"` \| CONFIRMED from source: `loss_type: str = 'dapo'` in GRPOConfig \| CORRECT \|
	\| Default `scale_rewards` is... \| CONFIRMED: default is `"group"` (not `False`/`"none"`) \| CORRECT \|

	### 1.3 Critical discovery: TRL's default is DAPO, not GRPO

	The TRL GRPOConfig shows `loss_type = 'dapo'` as the default. ADR-008 claims to configure `loss_type="dr_grpo"` to match Composer 2.5. The source confirms `"dr_grpo"` is a valid value (uses `max_completion_length` as the constant denominator). This is consistent with ADR-008's decision.

	However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show `beta=0.0` as default (KL term disabled). If `beta=0`, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore low-priority when beta=0 (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant.

	### 1.4 `scale_rewards` drift assertion

	ADR-008 checks `str(cfg.scale_rewards).lower() in ("none","false")`. Primary source confirms `scale_rewards` accepts: `True`/`"group"` (default), `"batch"`, `False`/`"none"`. The check is correct.

	---

	## 2. The Colocate-vLLM Blog: What It Actually Says

	Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate).

	What the blog confirms:
	- Co-locate mode (`vllm_mode="colocate"`) runs training and vLLM in the same process, sharing GPUs. No REST API overhead.
	- Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs.
	- vLLM sleep mode (level 2) is not yet merged into TRL upstream (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show `vllm_enable_sleep_mode` as a parameter, implying it was eventually merged, but the blog notes a real production bug.
	- FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443).

	What `research/04` says: Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread.

	What the repo's SageMaker smoke recipe uses: The SageMaker GRPO smoke (from git history context) uses `use_vllm=False` for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it.

	---

	## 3. VeRL: Agentic Mode and the AsyncServer

	Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README:

	> "[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl."
	> "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library."
	> Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF"

	The verl README confirms multi-turn tool-calling exists and `uni-agent` was released May 2026 as a unified agent framework. The `AsyncServer`/`AgentLoop` architecture described in `research/04` is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (`fully_async_policy`, `transfer_queue`) are available but not yet in main.

	What `research/04` claims about VeRL agentic support:
	- "First-class agentic RL support" with `AsyncServer`/`AgentLoop` — the README confirms the direction but notes these are under `verl/experimental`. The research/04 characterization of "first-class" slightly overclaims what is in the stable API; the full async path is experimental.
	- `SandboxFusionTool` — mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent.
	- "Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training."

	---

	## 4. PRIME-RL in the Repo

	### 4.1 What ADR-006 claims

	ADR-006 claims PRIME-RL ships a `CustomLossConfig` with `import_path` for dropping in a Python loss function, exposing `LossInputs` with `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask`. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ).

	### 4.2 What `composer_replication/recipes/prime_rl/composer_loss.py` confirms

	The code reads (lines 21-28):
	```python
	@dataclass
	class LossInputs:
	trainer_logprobs: Float[Tensor, ' seq']
	inference_logprobs: Float[Tensor, ' seq']
	teacher_logprobs: Float[Tensor, ' seq'] \| None
	advantages: Float[Tensor, ' seq']
	loss_mask: Bool[Tensor, ' seq']
	```
	This is marked as "verified against PrimeIntellect-ai/prime-rl `src/prime_rl/trainer/rl/loss.py` lines 13-22." The code correctly raises `NotImplementedError` when `alpha_sdpo > 0` (logits not available, only log-probs). This is a real constraint, not a placeholder.

	### 4.3 The DPPO upstream loss — a subtle accuracy point

	The `composer_loss.py` reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses:
	```python
	probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) # probability-space diff
	```
	This is notably not a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO.

	---

	## 5. The Key Question: Is TRL's Single-Submit `reward_fn` a Dead End for Multi-Turn?

	### 5.1 What `env.py::reward_fn` actually does

	```python
	def reward_fn(self, prompts, completions, , task_id, *kwargs) -> list[float]:
	...
	for comp, tid in zip(completions, task_id):
	task = self.registry[tid]
	self.reset(task)
	if self._replay is not None:
	res = self._replay(self, comp)
	else:
	res = self.step({"type": "submit"}) # <-- single submit
	rewards.append(res.reward)
	```

	The fallback path (no `_replay` function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL `reward_funcs` call would do.

	The intended multi-turn path is `_replay`: a callable that takes `(env, completion)` and drives multi-turn turns by parsing the agent's encoded tool-call history from the `completion` string. This is a custom deserializer that replays the agent turns and grades at the end.

	### 5.2 Is this a dead end for multi-turn RL?

	For current TRL integration: partly yes, mostly no.

	The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths:

	Path A — `_replay` + `rollout_func` (TRL experimental): The `rollout_func` parameter in GRPOTrainer can drive multi-turn generation externally (running the env's `step()` loop), serialize the full trajectory into `completion` tokens, then call `reward_fn` which uses `_replay` to deserialize and grade. This makes the `reward_fn` the grader, not the rollout driver. This works in TRL today but requires the experimental `rollout_func` interface.

	Path B — `environment_factory` (TRL experimental): Pass `FeatureDeletionEnv` (or an adapter) as `environment_factory`. GRPOTrainer calls `reset()` and then uses the env's public methods as tools. The `reward_fn` is replaced by a reward function that reads `environments[i].reward` after generation. This is the more principled path for true multi-turn RL and is what TRL's `environment_factory` was designed for. It requires `transformers>=5.2.0` and is still experimental.

	Path C — VeRL's AgentLoop: For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's `AsyncServer`+`AgentLoop` is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during `sandbox.exec()` calls. The repo acknowledges this in `research/04` §5.3 recommendation.

	### 5.3 The honest migration path

	The current TRL single-submit `reward_fn` is:
	- Correct for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm.
	- Insufficient for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end.

	Migration path (in order of complexity):

	1. Immediate (low cost): Use TRL's `environment_factory` with `FeatureDeletionEnv` as the adapter. The env's `step()` becomes a tool. Grade via `reward_funcs` reading `env.reward`. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host.

	2. Medium term (single-GPU scale): Implement `rollout_func` that drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update.

	3. Scale-out (multi-GPU, async, tree-of-work): Migrate to VeRL's `AgentLoop`. The `FeatureDeletionEnv` maps onto verl's `SandboxFusionTool` protocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not.

	The tree-of-work IS multi-turn. The vision in `framework/composer-replication-framework.md` of a "multi-model Monte-Carlo tree-of-work" requires:
	- Many concurrent rollout branches per prompt
	- Reward propagated back through the tree (not just at leaf)
	- Asynchronous sandbox execution without blocking GPU

	None of these are provided by TRL's current `GRPOTrainer` (even with `tools`/`environment_factory`). VeRL's experimental `fully_async_policy` + `AgentLoop` is the right substrate. The repo's `research/04` correctly identifies this but the ADR layer has not formally acknowledged this migration requirement.

	---

	## 6. Sandboxing for Code Execution at Scale

	### 6.1 What the secure-EKS article says (primary source)

	> "gVisor added negligible launch latency... handles isolation for most agent workloads."
	> "Cold start was around 5 seconds per sandbox" for Kata+Firecracker.
	> "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups."
	> "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS."

	### 6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds

	Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks.

	This is directly relevant to the repo's `FeatureDeletionEnv`/`Sandbox` design. The current `sandbox.py` uses `LocalSubprocessSandbox` (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds.

	### 6.3 What the repo's sandbox.py actually provides

	`sandbox.py` defines:
	- `LocalSubprocessSandbox` — runs commands via `subprocess.run` in the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control").
	- `DockerSandbox` (in `docker_sandbox.py`) — real isolation, referenced in tests.

	The gap: For RL training at scale (many parallel rollout workers), neither `LocalSubprocessSandbox` nor per-task Docker containers are adequate:
	- Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments).
	- Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step.

	The repo's `research/review-sandbox.json` presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation).

	---

	## 7. Misreads, Overclaims, and Gaps in Repo Research

	### 7.1 OVERCLAIM: VeRL "first-class" agentic RL

	`research/04` §1.5: "VeRL has first-class agentic RL support" and describes `AsyncServer`/`AgentLoop` as stable. The verl README (main branch, 2026-06-10) shows:
	- `transfer_queue`, `fully_async_policy`, `one_step_off_policy` are kept under `verl/experimental` — "planned to be merged into the main library."
	- `uni-agent` (May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library.

	The agentic async path exists but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work.

	### 7.2 MISS: TRL now has `environment_factory` + `tools` for multi-turn

	`research/04` (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show `environment_factory` and `tools` for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the `environment_factory` interface which would allow `FeatureDeletionEnv` to drive multi-turn episodes inside GRPOTrainer without a custom `rollout_func`. This is not mentioned in any ADR.

	### 7.3 MISS: TRL default `loss_type="dapo"`, NOT `"grpo"` or `"dr_grpo"`

	ADR-008 correctly targets `loss_type="dr_grpo"`, but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in `make_dr_grpo_config` is the right mitigation.

	### 7.4 GAP: `scale_rewards` default is `"group"`, not `True`

	The GRPOConfig shows `scale_rewards: str = 'group'` (a string, not a bool). ADR-008's assertion `str(cfg.scale_rewards).lower() in ("none","false")` correctly handles both the old bool (`False`) and new string (`"none"`) forms. But the docs show `True` and `"group"` are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug.

	### 7.5 GAP: KL estimator — TRL default is k3, not k1

	The TRL docs show the KL approximator formula:
	```
	D_KL[π_θ \|\| π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1
	```
	This is the k3 estimator (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (`-log r = log π_θ/π_ref`). The ADR notes this as an OPEN item. Since `beta=0.0` by default in TRL, the KL term is disabled and this doesn't affect training unless `beta>0`. However, `composer_trainer.py` implements the k1-in-reward path via `kl_in_reward.py` — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes `log(π_θ/π_ref)` at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: k1-in-reward active, beta=0 (TRL in-loss KL disabled). The code appears to do this but there's no explicit assertion that `beta=0` when `kl_in_reward=True`.

	### 7.6 MISS: `num_iterations=1` narrowed claim

	ADR-008 acknowledges that `num_iterations=1` controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct.

	### 7.7 MISS: TRL default optimizer is `adamw_torch_fused`, not `adam`

	ADR-008 has an OPEN item: "Adam is claimed but `optim` is not set." The GRPOConfig docs show:
	```
	optim: transformers.training_args.OptimizerNames \| str = 'adamw_torch_fused'
	```
	Default is `adamw_torch_fused` (AdamW with fused CUDA kernel), not plain `adam`. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set `weight_decay=0.0` and `optim="adam"` explicitly to match. The default AdamW has weight decay (though `weight_decay=0.0` is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, `adamw_torch_fused` ≠ `adam` in terms of the optimizer implementation; to be precise, set `optim="adamw_8bit"` or `optim="paged_adamw_8bit"` (memory efficient) or just `optim="adam_torch"` if plain Adam is intended.

	---

	## 8. verl `uni-agent` — A New Development Not in Research/04

	The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. `uni-agent` could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level `AgentLoop`/`AsyncServer` integration that ADR-006 contemplates.

	Implication for the repo: Before committing to a custom VeRL `AgentLoop` integration, evaluate whether `uni-agent` already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface.

	---

	## 9. Sandboxing Recommendation Gaps

	### 9.1 The `SWE-MiniSandbox` approach is not referenced anywhere in the repo

	arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to `FeatureDeletionEnv` at scale. The repo's sandbox design doesn't reference this work.

	### 9.2 The repo's `docker_sandbox.py` is production-blocking for RL at scale

	Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives.

	---

	## 10. Summary of Critical Findings

	\| Finding \| Severity \| Affected Files \|
	\|---\|---\|---\|
	\| VeRL async agent loop is EXPERIMENTAL, not "first-class" stable \| MEDIUM — overclaim \| research/04 §1.5, ADR-006 \|
	\| TRL `environment_factory` (multi-turn) not in any ADR \| MEDIUM — miss \| ADR-008, env.py \|
	\| k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set) \| MEDIUM — correctness gap \| composer_trainer.py, ADR-008 \|
	\| `optim` default is `adamw_torch_fused`, not `adam` \| LOW — fidelity gap \| ADR-008 OPEN item \|
	\| TRL `loss_type` defaults to `"dapo"` (not GRPO), correctly handled \| INFO — confirmed correct \| ADR-008, make_dr_grpo_config \|
	\| `env.py::reward_fn` single-submit path is dead end for tree-of-work \| HIGH — architecture gap \| env.py, no ADR exists \|
	\| `uni-agent` (verl, May 2026) not evaluated — may supersede custom AgentLoop \| MEDIUM — miss \| ADR-006 \|
	\| SWE-MiniSandbox approach not referenced (5% disk, 25% setup time) \| MEDIUM — miss \| sandbox.py, docker_sandbox.py \|
	\| EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt) \| INFO — production gotcha \| no ADR \|

	---

	## 11. Migration Path for Multi-Turn Agentic RL (Honest Assessment)

	The current repo architecture (TRL `reward_fn` with single-submit fallback) is:

	Phase 1 — GRPO on completions (current): The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase.

	Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO): The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order:

	1. TRL `environment_factory` adapter (experimental, weeks of work): Wrap `FeatureDeletionEnv` as a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism.

	2. TRL `rollout_func` (experimental, 1–2 weeks): Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update.

	3. VeRL AsyncServer + `FeatureDeletionEnv` as SandboxFusionTool adapter (2–4 weeks): GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented.

	Phase 3 — Tree-of-Work (MCTS, multi-branch): This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The `uni-agent` framework on top of verl should be evaluated first before building a custom AgentLoop integration.

	---

	Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.