Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 25 days ago

26.1 kB

Deep-Read: RL Infra & Frameworks — Critical Findings

Cluster 8 of the dataset-pipeline review series Reviewer: automated critical pipeline, 2026-06-09 Primary sources fetched: TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: research/04-verl-trl.md, research/03-monarch-torchforge-openenv.md, docs/adrs/ADR-006-rl-frameworks.md, docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md, composer_replication/datagen/env.py, composer_replication/trainer/composer_trainer.py, composer_replication/recipes/prime_rl/composer_loss.py.

1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts?

1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10)

The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe two distinct agentic mechanisms:

Mechanism A — tools parameter: Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the tools argument in GRPOTrainer." The loop has a hard cap max_tool_calling_iterations (default: unlimited, stops on no-tool-call response or max_model_length). Each tool call is synchronous — the training GPU waits.

Mechanism B — environment_factory parameter: Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires transformers>=5.2.0. Marked experimental: "This feature is experimental and may change or be removed at any time without prior notice." The reset() method can return a string that gets appended to the last user message. rollout_func is similarly experimental.

Mechanism C — rollout_func (custom rollout): A callable that receives prompts and the trainer, returns {"prompt_ids", "completion_ids", "logprobs"}. Also experimental. This is the escape hatch for fully custom multi-turn generation.

Key constraint confirmed from primary source: TRL has no async GPU-decoupled agent loop. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (vllm_importance_sampling_correction=True by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — research/04-verl-trl.md correctly identified this gap — but the docs now show TRL has partially closed the multi-turn gap via tools / environment_factory.

1.2 What `research/04-verl-trl.md` claims vs. primary source

Claim in research/04	Primary source (TRL docs, 2026)	Verdict
"TRL does NOT have an async GPU-decoupled agent loop"	Confirmed	CORRECT
"OpenEnv integration (October 2025)"	Confirmed; `environment_factory` + TRL's OpenEnv guide	CORRECT
"VLM support"	Confirmed — tools can return `list` of content blocks incl. images	CORRECT
"GRPOTrainer supports multi-step agentic rollouts" (04:173)	Confirmed via `tools` + `environment_factory`	CORRECT
TRL v1.0 released March 2026	Confirmed; docs show versions v1.0.0 through v1.5.1	CORRECT
Default `loss_type` is `"dapo"`	CONFIRMED from source: `loss_type: str = 'dapo'` in GRPOConfig	CORRECT
Default `scale_rewards` is...	CONFIRMED: default is `"group"` (not `False`/`"none"`)	CORRECT

1.3 Critical discovery: TRL's default is DAPO, not GRPO

The TRL GRPOConfig shows loss_type = 'dapo' as the default. ADR-008 claims to configure loss_type="dr_grpo" to match Composer 2.5. The source confirms "dr_grpo" is a valid value (uses max_completion_length as the constant denominator). This is consistent with ADR-008's decision.

However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show beta=0.0 as default (KL term disabled). If beta=0, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore low-priority when beta=0 (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant.

1.4 `scale_rewards` drift assertion

ADR-008 checks str(cfg.scale_rewards).lower() in ("none","false"). Primary source confirms scale_rewards accepts: True/"group" (default), "batch", False/"none". The check is correct.

2. The Colocate-vLLM Blog: What It Actually Says

Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate).

What the blog confirms:

Co-locate mode (vllm_mode="colocate") runs training and vLLM in the same process, sharing GPUs. No REST API overhead.
Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs.
vLLM sleep mode (level 2) is not yet merged into TRL upstream (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show vllm_enable_sleep_mode as a parameter, implying it was eventually merged, but the blog notes a real production bug.
FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443).

What research/04 says: Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread.

What the repo's SageMaker smoke recipe uses: The SageMaker GRPO smoke (from git history context) uses use_vllm=False for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it.

3. VeRL: Agentic Mode and the AsyncServer

Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README:

"[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library." Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF"

The verl README confirms multi-turn tool-calling exists and uni-agent was released May 2026 as a unified agent framework. The AsyncServer/AgentLoop architecture described in research/04 is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (fully_async_policy, transfer_queue) are available but not yet in main.

What research/04 claims about VeRL agentic support:

"First-class agentic RL support" with AsyncServer/AgentLoop — the README confirms the direction but notes these are under verl/experimental. The research/04 characterization of "first-class" slightly overclaims what is in the stable API; the full async path is experimental.
SandboxFusionTool — mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent.
"Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training."

4. PRIME-RL in the Repo

4.1 What ADR-006 claims

ADR-006 claims PRIME-RL ships a CustomLossConfig with import_path for dropping in a Python loss function, exposing LossInputs with trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ).

4.2 What `composer_replication/recipes/prime_rl/composer_loss.py` confirms

The code reads (lines 21-28):

@dataclass
class LossInputs:
    trainer_logprobs:   Float[Tensor, ' seq']
    inference_logprobs: Float[Tensor, ' seq']
    teacher_logprobs:   Float[Tensor, ' seq'] | None
    advantages:         Float[Tensor, ' seq']
    loss_mask:          Bool[Tensor, ' seq']

This is marked as "verified against PrimeIntellect-ai/prime-rl src/prime_rl/trainer/rl/loss.py lines 13-22." The code correctly raises NotImplementedError when alpha_sdpo > 0 (logits not available, only log-probs). This is a real constraint, not a placeholder.

4.3 The DPPO upstream loss — a subtle accuracy point

The composer_loss.py reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses:

probs_diff = exp(trainer_logprobs) - exp(inference_logprobs)  # probability-space diff

This is notably not a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO.

5. The Key Question: Is TRL's Single-Submit `reward_fn` a Dead End for Multi-Turn?

5.1 What `env.py::reward_fn` actually does

def reward_fn(self, prompts, completions, *, task_id, **kwargs) -> list[float]:
    ...
    for comp, tid in zip(completions, task_id):
        task = self.registry[tid]
        self.reset(task)
        if self._replay is not None:
            res = self._replay(self, comp)
        else:
            res = self.step({"type": "submit"})   # <-- single submit
        rewards.append(res.reward)

The fallback path (no _replay function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL reward_funcs call would do.

The intended multi-turn path is _replay: a callable that takes (env, completion) and drives multi-turn turns by parsing the agent's encoded tool-call history from the completion string. This is a custom deserializer that replays the agent turns and grades at the end.

5.2 Is this a dead end for multi-turn RL?

For current TRL integration: partly yes, mostly no.

The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths:

Path A — _replay + rollout_func (TRL experimental): The rollout_func parameter in GRPOTrainer can drive multi-turn generation externally (running the env's step() loop), serialize the full trajectory into completion tokens, then call reward_fn which uses _replay to deserialize and grade. This makes the reward_fn the grader, not the rollout driver. This works in TRL today but requires the experimental rollout_func interface.

Path B — environment_factory (TRL experimental): Pass FeatureDeletionEnv (or an adapter) as environment_factory. GRPOTrainer calls reset() and then uses the env's public methods as tools. The reward_fn is replaced by a reward function that reads environments[i].reward after generation. This is the more principled path for true multi-turn RL and is what TRL's environment_factory was designed for. It requires transformers>=5.2.0 and is still experimental.

Path C — VeRL's AgentLoop: For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's AsyncServer+AgentLoop is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during sandbox.exec() calls. The repo acknowledges this in research/04 §5.3 recommendation.

5.3 The honest migration path

The current TRL single-submit reward_fn is:

Correct for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm.
Insufficient for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end.

Migration path (in order of complexity):

Immediate (low cost): Use TRL's environment_factory with FeatureDeletionEnv as the adapter. The env's step() becomes a tool. Grade via reward_funcs reading env.reward. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host.
Medium term (single-GPU scale): Implement rollout_func that drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update.
Scale-out (multi-GPU, async, tree-of-work): Migrate to VeRL's AgentLoop. The FeatureDeletionEnv maps onto verl's SandboxFusionTool protocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not.

The tree-of-work IS multi-turn. The vision in framework/composer-replication-framework.md of a "multi-model Monte-Carlo tree-of-work" requires:

Many concurrent rollout branches per prompt
Reward propagated back through the tree (not just at leaf)
Asynchronous sandbox execution without blocking GPU

None of these are provided by TRL's current GRPOTrainer (even with tools/environment_factory). VeRL's experimental fully_async_policy + AgentLoop is the right substrate. The repo's research/04 correctly identifies this but the ADR layer has not formally acknowledged this migration requirement.

6. Sandboxing for Code Execution at Scale

6.1 What the secure-EKS article says (primary source)

"gVisor added negligible launch latency... handles isolation for most agent workloads." "Cold start was around 5 seconds per sandbox" for Kata+Firecracker. "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups." "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS."

6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds

Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks.

This is directly relevant to the repo's FeatureDeletionEnv/Sandbox design. The current sandbox.py uses LocalSubprocessSandbox (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds.

6.3 What the repo's sandbox.py actually provides

sandbox.py defines:

LocalSubprocessSandbox — runs commands via subprocess.run in the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control").
DockerSandbox (in docker_sandbox.py) — real isolation, referenced in tests.

The gap: For RL training at scale (many parallel rollout workers), neither LocalSubprocessSandbox nor per-task Docker containers are adequate:

Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments).
Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step.

The repo's research/review-sandbox.json presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation).

7. Misreads, Overclaims, and Gaps in Repo Research

7.1 OVERCLAIM: VeRL "first-class" agentic RL

research/04 §1.5: "VeRL has first-class agentic RL support" and describes AsyncServer/AgentLoop as stable. The verl README (main branch, 2026-06-10) shows:

transfer_queue, fully_async_policy, one_step_off_policy are kept under verl/experimental — "planned to be merged into the main library."
uni-agent (May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library.

The agentic async path exists but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work.

7.2 MISS: TRL now has `environment_factory` + `tools` for multi-turn

research/04 (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show environment_factory and tools for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the environment_factory interface which would allow FeatureDeletionEnv to drive multi-turn episodes inside GRPOTrainer without a custom rollout_func. This is not mentioned in any ADR.

7.3 MISS: TRL default `loss_type="dapo"`, NOT `"grpo"` or `"dr_grpo"`

ADR-008 correctly targets loss_type="dr_grpo", but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in make_dr_grpo_config is the right mitigation.

7.4 GAP: `scale_rewards` default is `"group"`, not `True`

The GRPOConfig shows scale_rewards: str = 'group' (a string, not a bool). ADR-008's assertion str(cfg.scale_rewards).lower() in ("none","false") correctly handles both the old bool (False) and new string ("none") forms. But the docs show True and "group" are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug.

7.5 GAP: KL estimator — TRL default is k3, not k1

The TRL docs show the KL approximator formula:

D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1

This is the k3 estimator (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (-log r = log π_θ/π_ref). The ADR notes this as an OPEN item. Since beta=0.0 by default in TRL, the KL term is disabled and this doesn't affect training unless beta>0. However, composer_trainer.py implements the k1-in-reward path via kl_in_reward.py — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes log(π_θ/π_ref) at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: k1-in-reward active, beta=0 (TRL in-loss KL disabled). The code appears to do this but there's no explicit assertion that beta=0 when kl_in_reward=True.

7.6 MISS: `num_iterations=1` narrowed claim

ADR-008 acknowledges that num_iterations=1 controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct.

7.7 MISS: TRL default optimizer is `adamw_torch_fused`, not `adam`

ADR-008 has an OPEN item: "Adam is claimed but optim is not set." The GRPOConfig docs show:

optim: transformers.training_args.OptimizerNames | str = 'adamw_torch_fused'

Default is adamw_torch_fused (AdamW with fused CUDA kernel), not plain adam. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set weight_decay=0.0 and optim="adam" explicitly to match. The default AdamW has weight decay (though weight_decay=0.0 is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, adamw_torch_fused ≠ adam in terms of the optimizer implementation; to be precise, set optim="adamw_8bit" or optim="paged_adamw_8bit" (memory efficient) or just optim="adam_torch" if plain Adam is intended.

8. verl `uni-agent` — A New Development Not in Research/04

The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. uni-agent could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level AgentLoop/AsyncServer integration that ADR-006 contemplates.

Implication for the repo: Before committing to a custom VeRL AgentLoop integration, evaluate whether uni-agent already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface.

9. Sandboxing Recommendation Gaps

9.1 The `SWE-MiniSandbox` approach is not referenced anywhere in the repo

arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to FeatureDeletionEnv at scale. The repo's sandbox design doesn't reference this work.

9.2 The repo's `docker_sandbox.py` is production-blocking for RL at scale

Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives.

10. Summary of Critical Findings

Finding	Severity	Affected Files
VeRL async agent loop is EXPERIMENTAL, not "first-class" stable	MEDIUM — overclaim	research/04 §1.5, ADR-006
TRL `environment_factory` (multi-turn) not in any ADR	MEDIUM — miss	ADR-008, env.py
k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set)	MEDIUM — correctness gap	composer_trainer.py, ADR-008
`optim` default is `adamw_torch_fused`, not `adam`	LOW — fidelity gap	ADR-008 OPEN item
TRL `loss_type` defaults to `"dapo"` (not GRPO), correctly handled	INFO — confirmed correct	ADR-008, make_dr_grpo_config
`env.py::reward_fn` single-submit path is dead end for tree-of-work	HIGH — architecture gap	env.py, no ADR exists
`uni-agent` (verl, May 2026) not evaluated — may supersede custom AgentLoop	MEDIUM — miss	ADR-006
SWE-MiniSandbox approach not referenced (5% disk, 25% setup time)	MEDIUM — miss	sandbox.py, docker_sandbox.py
EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt)	INFO — production gotcha	no ADR

11. Migration Path for Multi-Turn Agentic RL (Honest Assessment)

The current repo architecture (TRL reward_fn with single-submit fallback) is:

Phase 1 — GRPO on completions (current): The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase.

Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO): The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order:

TRL environment_factory adapter (experimental, weeks of work): Wrap FeatureDeletionEnv as a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism.
TRL rollout_func (experimental, 1–2 weeks): Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update.
VeRL AsyncServer + FeatureDeletionEnv as SandboxFusionTool adapter (2–4 weeks): GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented.

Phase 3 — Tree-of-Work (MCTS, multi-branch): This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The uni-agent framework on top of verl should be evaluated first before building a custom AgentLoop integration.

Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.