Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
26.1 kB
# Deep-Read: RL Infra & Frameworks — Critical Findings
**Cluster 8 of the dataset-pipeline review series**
**Reviewer:** automated critical pipeline, 2026-06-09
**Primary sources fetched:** TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: `research/04-verl-trl.md`, `research/03-monarch-torchforge-openenv.md`, `docs/adrs/ADR-006-rl-frameworks.md`, `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`, `composer_replication/datagen/env.py`, `composer_replication/trainer/composer_trainer.py`, `composer_replication/recipes/prime_rl/composer_loss.py`.
---
## 1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts?
### 1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10)
The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe **two distinct agentic mechanisms**:
**Mechanism A — `tools` parameter:** Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the `tools` argument in `GRPOTrainer`." The loop has a hard cap `max_tool_calling_iterations` (default: unlimited, stops on no-tool-call response or `max_model_length`). Each tool call is synchronous — the training GPU waits.
**Mechanism B — `environment_factory` parameter:** Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires `transformers>=5.2.0`. Marked **experimental**: "This feature is experimental and may change or be removed at any time without prior notice." The `reset()` method can return a string that gets appended to the last user message. `rollout_func` is similarly experimental.
**Mechanism C — `rollout_func` (custom rollout):** A callable that receives prompts and the trainer, returns `{"prompt_ids", "completion_ids", "logprobs"}`. Also experimental. This is the escape hatch for fully custom multi-turn generation.
**Key constraint confirmed from primary source:** TRL has **no async GPU-decoupled agent loop**. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (`vllm_importance_sampling_correction=True` by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — `research/04-verl-trl.md` correctly identified this gap — but the docs now show TRL has partially closed the *multi-turn* gap via `tools` / `environment_factory`.
### 1.2 What `research/04-verl-trl.md` claims vs. primary source
| Claim in research/04 | Primary source (TRL docs, 2026) | Verdict |
|---|---|---|
| "TRL does NOT have an async GPU-decoupled agent loop" | Confirmed | CORRECT |
| "OpenEnv integration (October 2025)" | Confirmed; `environment_factory` + TRL's OpenEnv guide | CORRECT |
| "VLM support" | Confirmed — tools can return `list` of content blocks incl. images | CORRECT |
| "GRPOTrainer supports multi-step agentic rollouts" (04:173) | Confirmed via `tools` + `environment_factory` | CORRECT |
| TRL v1.0 released March 2026 | Confirmed; docs show versions v1.0.0 through v1.5.1 | CORRECT |
| Default `loss_type` is `"dapo"` | **CONFIRMED from source**: `loss_type: str = 'dapo'` in GRPOConfig | CORRECT |
| Default `scale_rewards` is... | **CONFIRMED: default is `"group"`** (not `False`/`"none"`) | CORRECT |
### 1.3 Critical discovery: TRL's default is DAPO, not GRPO
The TRL GRPOConfig shows `loss_type = 'dapo'` as the default. ADR-008 claims to configure `loss_type="dr_grpo"` to match Composer 2.5. The source confirms `"dr_grpo"` is a valid value (uses `max_completion_length` as the constant denominator). **This is consistent with ADR-008's decision.**
However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show `beta=0.0` as default (KL term disabled). If `beta=0`, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore **low-priority when beta=0** (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant.
### 1.4 `scale_rewards` drift assertion
ADR-008 checks `str(cfg.scale_rewards).lower() in ("none","false")`. Primary source confirms `scale_rewards` accepts: `True`/`"group"` (default), `"batch"`, `False`/`"none"`. The check is correct.
---
## 2. The Colocate-vLLM Blog: What It Actually Says
Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate).
**What the blog confirms:**
- Co-locate mode (`vllm_mode="colocate"`) runs training and vLLM in the same process, sharing GPUs. No REST API overhead.
- Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): **co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs**.
- vLLM sleep mode (level 2) is **not yet merged into TRL upstream** (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show `vllm_enable_sleep_mode` as a parameter, implying it was eventually merged, but the blog notes a real production bug.
- FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443).
**What `research/04` says:** Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread.
**What the repo's SageMaker smoke recipe uses:** The SageMaker GRPO smoke (from git history context) uses `use_vllm=False` for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it.
---
## 3. VeRL: Agentic Mode and the AsyncServer
Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README:
> "[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl."
> "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library."
> Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF"
The verl README confirms multi-turn tool-calling exists and `uni-agent` was released May 2026 as a unified agent framework. The `AsyncServer`/`AgentLoop` architecture described in `research/04` is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (`fully_async_policy`, `transfer_queue`) are available but not yet in main.
**What `research/04` claims about VeRL agentic support:**
- "First-class agentic RL support" with `AsyncServer`/`AgentLoop` — the README confirms the direction but notes these are under `verl/experimental`. The research/04 characterization of "first-class" **slightly overclaims** what is in the stable API; the full async path is experimental.
- `SandboxFusionTool` — mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent.
- "Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training."
---
## 4. PRIME-RL in the Repo
### 4.1 What ADR-006 claims
ADR-006 claims PRIME-RL ships a `CustomLossConfig` with `import_path` for dropping in a Python loss function, exposing `LossInputs` with `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask`. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ).
### 4.2 What `composer_replication/recipes/prime_rl/composer_loss.py` confirms
The code reads (lines 21-28):
```python
@dataclass
class LossInputs:
trainer_logprobs: Float[Tensor, ' seq']
inference_logprobs: Float[Tensor, ' seq']
teacher_logprobs: Float[Tensor, ' seq'] | None
advantages: Float[Tensor, ' seq']
loss_mask: Bool[Tensor, ' seq']
```
This is marked as "verified against PrimeIntellect-ai/prime-rl `src/prime_rl/trainer/rl/loss.py` lines 13-22." The code correctly raises `NotImplementedError` when `alpha_sdpo > 0` (logits not available, only log-probs). **This is a real constraint, not a placeholder.**
### 4.3 The DPPO upstream loss — a subtle accuracy point
The `composer_loss.py` reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses:
```python
probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) # probability-space diff
```
This is notably **not** a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO.
---
## 5. The Key Question: Is TRL's Single-Submit `reward_fn` a Dead End for Multi-Turn?
### 5.1 What `env.py::reward_fn` actually does
```python
def reward_fn(self, prompts, completions, *, task_id, **kwargs) -> list[float]:
...
for comp, tid in zip(completions, task_id):
task = self.registry[tid]
self.reset(task)
if self._replay is not None:
res = self._replay(self, comp)
else:
res = self.step({"type": "submit"}) # <-- single submit
rewards.append(res.reward)
```
**The fallback path** (no `_replay` function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL `reward_funcs` call would do.
**The intended multi-turn path** is `_replay`: a callable that takes `(env, completion)` and drives multi-turn turns by parsing the agent's encoded tool-call history from the `completion` string. This is a **custom deserializer** that replays the agent turns and grades at the end.
### 5.2 Is this a dead end for multi-turn RL?
**For current TRL integration: partly yes, mostly no.**
The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths:
**Path A — `_replay` + `rollout_func` (TRL experimental):** The `rollout_func` parameter in GRPOTrainer can drive multi-turn generation externally (running the env's `step()` loop), serialize the full trajectory into `completion` tokens, then call `reward_fn` which uses `_replay` to deserialize and grade. This makes the `reward_fn` the grader, not the rollout driver. This works in TRL **today** but requires the experimental `rollout_func` interface.
**Path B — `environment_factory` (TRL experimental):** Pass `FeatureDeletionEnv` (or an adapter) as `environment_factory`. GRPOTrainer calls `reset()` and then uses the env's public methods as tools. The `reward_fn` is replaced by a reward function that reads `environments[i].reward` after generation. This is the more principled path for true multi-turn RL and is what TRL's `environment_factory` was designed for. It requires `transformers>=5.2.0` and is still experimental.
**Path C — VeRL's AgentLoop:** For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's `AsyncServer`+`AgentLoop` is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during `sandbox.exec()` calls. The repo acknowledges this in `research/04` §5.3 recommendation.
### 5.3 The honest migration path
The current TRL single-submit `reward_fn` is:
- **Correct** for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm.
- **Insufficient** for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end.
**Migration path (in order of complexity):**
1. **Immediate (low cost):** Use TRL's `environment_factory` with `FeatureDeletionEnv` as the adapter. The env's `step()` becomes a tool. Grade via `reward_funcs` reading `env.reward`. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host.
2. **Medium term (single-GPU scale):** Implement `rollout_func` that drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update.
3. **Scale-out (multi-GPU, async, tree-of-work):** Migrate to VeRL's `AgentLoop`. The `FeatureDeletionEnv` maps onto verl's `SandboxFusionTool` protocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not.
**The tree-of-work IS multi-turn.** The vision in `framework/composer-replication-framework.md` of a "multi-model Monte-Carlo tree-of-work" requires:
- Many concurrent rollout branches per prompt
- Reward propagated back through the tree (not just at leaf)
- Asynchronous sandbox execution without blocking GPU
None of these are provided by TRL's current `GRPOTrainer` (even with `tools`/`environment_factory`). VeRL's experimental `fully_async_policy` + `AgentLoop` is the right substrate. The repo's `research/04` correctly identifies this but the ADR layer has not formally acknowledged this migration requirement.
---
## 6. Sandboxing for Code Execution at Scale
### 6.1 What the secure-EKS article says (primary source)
> "gVisor added negligible launch latency... handles isolation for most agent workloads."
> "Cold start was around 5 seconds per sandbox" for Kata+Firecracker.
> "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups."
> "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS."
### 6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds
Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks.
This is directly relevant to the repo's `FeatureDeletionEnv`/`Sandbox` design. The current `sandbox.py` uses `LocalSubprocessSandbox` (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds.
### 6.3 What the repo's sandbox.py actually provides
`sandbox.py` defines:
- `LocalSubprocessSandbox` — runs commands via `subprocess.run` in the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control").
- `DockerSandbox` (in `docker_sandbox.py`) — real isolation, referenced in tests.
**The gap:** For RL training at scale (many parallel rollout workers), neither `LocalSubprocessSandbox` nor per-task Docker containers are adequate:
- Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments).
- Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step.
The repo's `research/review-sandbox.json` presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation).
---
## 7. Misreads, Overclaims, and Gaps in Repo Research
### 7.1 OVERCLAIM: VeRL "first-class" agentic RL
`research/04` §1.5: "VeRL has first-class agentic RL support" and describes `AsyncServer`/`AgentLoop` as stable. The verl README (main branch, 2026-06-10) shows:
- `transfer_queue`, `fully_async_policy`, `one_step_off_policy` are kept under `verl/experimental` — "planned to be merged into the main library."
- `uni-agent` (May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library.
The agentic async path **exists** but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work.
### 7.2 MISS: TRL now has `environment_factory` + `tools` for multi-turn
`research/04` (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show `environment_factory` and `tools` for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the `environment_factory` interface which would allow `FeatureDeletionEnv` to drive multi-turn episodes inside GRPOTrainer without a custom `rollout_func`. This is not mentioned in any ADR.
### 7.3 MISS: TRL default `loss_type="dapo"`, NOT `"grpo"` or `"dr_grpo"`
ADR-008 correctly targets `loss_type="dr_grpo"`, but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in `make_dr_grpo_config` is the right mitigation.
### 7.4 GAP: `scale_rewards` default is `"group"`, not `True`
The GRPOConfig shows `scale_rewards: str = 'group'` (a string, not a bool). ADR-008's assertion `str(cfg.scale_rewards).lower() in ("none","false")` correctly handles both the old bool (`False`) and new string (`"none"`) forms. But the docs show `True` and `"group"` are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug.
### 7.5 GAP: KL estimator — TRL default is k3, not k1
The TRL docs show the KL approximator formula:
```
D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1
```
This is the **k3 estimator** (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (`-log r = log π_θ/π_ref`). The ADR notes this as an OPEN item. Since `beta=0.0` by default in TRL, the KL term is disabled and this doesn't affect training unless `beta>0`. However, `composer_trainer.py` implements the k1-in-reward path via `kl_in_reward.py` — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes `log(π_θ/π_ref)` at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: **k1-in-reward active, beta=0 (TRL in-loss KL disabled)**. The code appears to do this but there's no explicit assertion that `beta=0` when `kl_in_reward=True`.
### 7.6 MISS: `num_iterations=1` narrowed claim
ADR-008 acknowledges that `num_iterations=1` controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct.
### 7.7 MISS: TRL default optimizer is `adamw_torch_fused`, not `adam`
ADR-008 has an OPEN item: "Adam is claimed but `optim` is not set." The GRPOConfig docs show:
```
optim: transformers.training_args.OptimizerNames | str = 'adamw_torch_fused'
```
Default is `adamw_torch_fused` (AdamW with fused CUDA kernel), not plain `adam`. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set `weight_decay=0.0` and `optim="adam"` explicitly to match. The default AdamW has weight decay (though `weight_decay=0.0` is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, `adamw_torch_fused` ≠ `adam` in terms of the optimizer implementation; to be precise, set `optim="adamw_8bit"` or `optim="paged_adamw_8bit"` (memory efficient) or just `optim="adam_torch"` if plain Adam is intended.
---
## 8. verl `uni-agent` — A New Development Not in Research/04
The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. `uni-agent` could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level `AgentLoop`/`AsyncServer` integration that ADR-006 contemplates.
**Implication for the repo:** Before committing to a custom VeRL `AgentLoop` integration, evaluate whether `uni-agent` already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface.
---
## 9. Sandboxing Recommendation Gaps
### 9.1 The `SWE-MiniSandbox` approach is not referenced anywhere in the repo
arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to `FeatureDeletionEnv` at scale. The repo's sandbox design doesn't reference this work.
### 9.2 The repo's `docker_sandbox.py` is production-blocking for RL at scale
Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives.
---
## 10. Summary of Critical Findings
| Finding | Severity | Affected Files |
|---|---|---|
| VeRL async agent loop is EXPERIMENTAL, not "first-class" stable | MEDIUM — overclaim | research/04 §1.5, ADR-006 |
| TRL `environment_factory` (multi-turn) not in any ADR | MEDIUM — miss | ADR-008, env.py |
| k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set) | MEDIUM — correctness gap | composer_trainer.py, ADR-008 |
| `optim` default is `adamw_torch_fused`, not `adam` | LOW — fidelity gap | ADR-008 OPEN item |
| TRL `loss_type` defaults to `"dapo"` (not GRPO), correctly handled | INFO — confirmed correct | ADR-008, make_dr_grpo_config |
| `env.py::reward_fn` single-submit path is dead end for tree-of-work | HIGH — architecture gap | env.py, no ADR exists |
| `uni-agent` (verl, May 2026) not evaluated — may supersede custom AgentLoop | MEDIUM — miss | ADR-006 |
| SWE-MiniSandbox approach not referenced (5% disk, 25% setup time) | MEDIUM — miss | sandbox.py, docker_sandbox.py |
| EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt) | INFO — production gotcha | no ADR |
---
## 11. Migration Path for Multi-Turn Agentic RL (Honest Assessment)
The current repo architecture (TRL `reward_fn` with single-submit fallback) is:
**Phase 1 — GRPO on completions (current):** The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase.
**Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO):** The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order:
1. **TRL `environment_factory` adapter (experimental, weeks of work):** Wrap `FeatureDeletionEnv` as a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism.
2. **TRL `rollout_func` (experimental, 1–2 weeks):** Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update.
3. **VeRL AsyncServer + `FeatureDeletionEnv` as SandboxFusionTool adapter (2–4 weeks):** GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented.
**Phase 3 — Tree-of-Work (MCTS, multi-branch):** This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The `uni-agent` framework on top of verl should be evaluated first before building a custom AgentLoop integration.
---
*Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.*