Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30 | # Deep-Read: RL Infra & Frameworks — Critical Findings | |
| **Cluster 8 of the dataset-pipeline review series** | |
| **Reviewer:** automated critical pipeline, 2026-06-09 | |
| **Primary sources fetched:** TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: `research/04-verl-trl.md`, `research/03-monarch-torchforge-openenv.md`, `docs/adrs/ADR-006-rl-frameworks.md`, `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`, `composer_replication/datagen/env.py`, `composer_replication/trainer/composer_trainer.py`, `composer_replication/recipes/prime_rl/composer_loss.py`. | |
| --- | |
| ## 1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts? | |
| ### 1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10) | |
| The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe **two distinct agentic mechanisms**: | |
| **Mechanism A — `tools` parameter:** Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the `tools` argument in `GRPOTrainer`." The loop has a hard cap `max_tool_calling_iterations` (default: unlimited, stops on no-tool-call response or `max_model_length`). Each tool call is synchronous — the training GPU waits. | |
| **Mechanism B — `environment_factory` parameter:** Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires `transformers>=5.2.0`. Marked **experimental**: "This feature is experimental and may change or be removed at any time without prior notice." The `reset()` method can return a string that gets appended to the last user message. `rollout_func` is similarly experimental. | |
| **Mechanism C — `rollout_func` (custom rollout):** A callable that receives prompts and the trainer, returns `{"prompt_ids", "completion_ids", "logprobs"}`. Also experimental. This is the escape hatch for fully custom multi-turn generation. | |
| **Key constraint confirmed from primary source:** TRL has **no async GPU-decoupled agent loop**. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (`vllm_importance_sampling_correction=True` by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — `research/04-verl-trl.md` correctly identified this gap — but the docs now show TRL has partially closed the *multi-turn* gap via `tools` / `environment_factory`. | |
| ### 1.2 What `research/04-verl-trl.md` claims vs. primary source | |
| | Claim in research/04 | Primary source (TRL docs, 2026) | Verdict | | |
| |---|---|---| | |
| | "TRL does NOT have an async GPU-decoupled agent loop" | Confirmed | CORRECT | | |
| | "OpenEnv integration (October 2025)" | Confirmed; `environment_factory` + TRL's OpenEnv guide | CORRECT | | |
| | "VLM support" | Confirmed — tools can return `list` of content blocks incl. images | CORRECT | | |
| | "GRPOTrainer supports multi-step agentic rollouts" (04:173) | Confirmed via `tools` + `environment_factory` | CORRECT | | |
| | TRL v1.0 released March 2026 | Confirmed; docs show versions v1.0.0 through v1.5.1 | CORRECT | | |
| | Default `loss_type` is `"dapo"` | **CONFIRMED from source**: `loss_type: str = 'dapo'` in GRPOConfig | CORRECT | | |
| | Default `scale_rewards` is... | **CONFIRMED: default is `"group"`** (not `False`/`"none"`) | CORRECT | | |
| ### 1.3 Critical discovery: TRL's default is DAPO, not GRPO | |
| The TRL GRPOConfig shows `loss_type = 'dapo'` as the default. ADR-008 claims to configure `loss_type="dr_grpo"` to match Composer 2.5. The source confirms `"dr_grpo"` is a valid value (uses `max_completion_length` as the constant denominator). **This is consistent with ADR-008's decision.** | |
| However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show `beta=0.0` as default (KL term disabled). If `beta=0`, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore **low-priority when beta=0** (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant. | |
| ### 1.4 `scale_rewards` drift assertion | |
| ADR-008 checks `str(cfg.scale_rewards).lower() in ("none","false")`. Primary source confirms `scale_rewards` accepts: `True`/`"group"` (default), `"batch"`, `False`/`"none"`. The check is correct. | |
| --- | |
| ## 2. The Colocate-vLLM Blog: What It Actually Says | |
| Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate). | |
| **What the blog confirms:** | |
| - Co-locate mode (`vllm_mode="colocate"`) runs training and vLLM in the same process, sharing GPUs. No REST API overhead. | |
| - Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): **co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs**. | |
| - vLLM sleep mode (level 2) is **not yet merged into TRL upstream** (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show `vllm_enable_sleep_mode` as a parameter, implying it was eventually merged, but the blog notes a real production bug. | |
| - FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443). | |
| **What `research/04` says:** Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread. | |
| **What the repo's SageMaker smoke recipe uses:** The SageMaker GRPO smoke (from git history context) uses `use_vllm=False` for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it. | |
| --- | |
| ## 3. VeRL: Agentic Mode and the AsyncServer | |
| Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README: | |
| > "[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." | |
| > "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library." | |
| > Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF" | |
| The verl README confirms multi-turn tool-calling exists and `uni-agent` was released May 2026 as a unified agent framework. The `AsyncServer`/`AgentLoop` architecture described in `research/04` is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (`fully_async_policy`, `transfer_queue`) are available but not yet in main. | |
| **What `research/04` claims about VeRL agentic support:** | |
| - "First-class agentic RL support" with `AsyncServer`/`AgentLoop` — the README confirms the direction but notes these are under `verl/experimental`. The research/04 characterization of "first-class" **slightly overclaims** what is in the stable API; the full async path is experimental. | |
| - `SandboxFusionTool` — mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent. | |
| - "Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training." | |
| --- | |
| ## 4. PRIME-RL in the Repo | |
| ### 4.1 What ADR-006 claims | |
| ADR-006 claims PRIME-RL ships a `CustomLossConfig` with `import_path` for dropping in a Python loss function, exposing `LossInputs` with `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask`. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ). | |
| ### 4.2 What `composer_replication/recipes/prime_rl/composer_loss.py` confirms | |
| The code reads (lines 21-28): | |
| ```python | |
| @dataclass | |
| class LossInputs: | |
| trainer_logprobs: Float[Tensor, ' seq'] | |
| inference_logprobs: Float[Tensor, ' seq'] | |
| teacher_logprobs: Float[Tensor, ' seq'] | None | |
| advantages: Float[Tensor, ' seq'] | |
| loss_mask: Bool[Tensor, ' seq'] | |
| ``` | |
| This is marked as "verified against PrimeIntellect-ai/prime-rl `src/prime_rl/trainer/rl/loss.py` lines 13-22." The code correctly raises `NotImplementedError` when `alpha_sdpo > 0` (logits not available, only log-probs). **This is a real constraint, not a placeholder.** | |
| ### 4.3 The DPPO upstream loss — a subtle accuracy point | |
| The `composer_loss.py` reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses: | |
| ```python | |
| probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) # probability-space diff | |
| ``` | |
| This is notably **not** a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO. | |
| --- | |
| ## 5. The Key Question: Is TRL's Single-Submit `reward_fn` a Dead End for Multi-Turn? | |
| ### 5.1 What `env.py::reward_fn` actually does | |
| ```python | |
| def reward_fn(self, prompts, completions, *, task_id, **kwargs) -> list[float]: | |
| ... | |
| for comp, tid in zip(completions, task_id): | |
| task = self.registry[tid] | |
| self.reset(task) | |
| if self._replay is not None: | |
| res = self._replay(self, comp) | |
| else: | |
| res = self.step({"type": "submit"}) # <-- single submit | |
| rewards.append(res.reward) | |
| ``` | |
| **The fallback path** (no `_replay` function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL `reward_funcs` call would do. | |
| **The intended multi-turn path** is `_replay`: a callable that takes `(env, completion)` and drives multi-turn turns by parsing the agent's encoded tool-call history from the `completion` string. This is a **custom deserializer** that replays the agent turns and grades at the end. | |
| ### 5.2 Is this a dead end for multi-turn RL? | |
| **For current TRL integration: partly yes, mostly no.** | |
| The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths: | |
| **Path A — `_replay` + `rollout_func` (TRL experimental):** The `rollout_func` parameter in GRPOTrainer can drive multi-turn generation externally (running the env's `step()` loop), serialize the full trajectory into `completion` tokens, then call `reward_fn` which uses `_replay` to deserialize and grade. This makes the `reward_fn` the grader, not the rollout driver. This works in TRL **today** but requires the experimental `rollout_func` interface. | |
| **Path B — `environment_factory` (TRL experimental):** Pass `FeatureDeletionEnv` (or an adapter) as `environment_factory`. GRPOTrainer calls `reset()` and then uses the env's public methods as tools. The `reward_fn` is replaced by a reward function that reads `environments[i].reward` after generation. This is the more principled path for true multi-turn RL and is what TRL's `environment_factory` was designed for. It requires `transformers>=5.2.0` and is still experimental. | |
| **Path C — VeRL's AgentLoop:** For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's `AsyncServer`+`AgentLoop` is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during `sandbox.exec()` calls. The repo acknowledges this in `research/04` §5.3 recommendation. | |
| ### 5.3 The honest migration path | |
| The current TRL single-submit `reward_fn` is: | |
| - **Correct** for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm. | |
| - **Insufficient** for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end. | |
| **Migration path (in order of complexity):** | |
| 1. **Immediate (low cost):** Use TRL's `environment_factory` with `FeatureDeletionEnv` as the adapter. The env's `step()` becomes a tool. Grade via `reward_funcs` reading `env.reward`. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host. | |
| 2. **Medium term (single-GPU scale):** Implement `rollout_func` that drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update. | |
| 3. **Scale-out (multi-GPU, async, tree-of-work):** Migrate to VeRL's `AgentLoop`. The `FeatureDeletionEnv` maps onto verl's `SandboxFusionTool` protocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not. | |
| **The tree-of-work IS multi-turn.** The vision in `framework/composer-replication-framework.md` of a "multi-model Monte-Carlo tree-of-work" requires: | |
| - Many concurrent rollout branches per prompt | |
| - Reward propagated back through the tree (not just at leaf) | |
| - Asynchronous sandbox execution without blocking GPU | |
| None of these are provided by TRL's current `GRPOTrainer` (even with `tools`/`environment_factory`). VeRL's experimental `fully_async_policy` + `AgentLoop` is the right substrate. The repo's `research/04` correctly identifies this but the ADR layer has not formally acknowledged this migration requirement. | |
| --- | |
| ## 6. Sandboxing for Code Execution at Scale | |
| ### 6.1 What the secure-EKS article says (primary source) | |
| > "gVisor added negligible launch latency... handles isolation for most agent workloads." | |
| > "Cold start was around 5 seconds per sandbox" for Kata+Firecracker. | |
| > "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups." | |
| > "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS." | |
| ### 6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds | |
| Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks. | |
| This is directly relevant to the repo's `FeatureDeletionEnv`/`Sandbox` design. The current `sandbox.py` uses `LocalSubprocessSandbox` (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds. | |
| ### 6.3 What the repo's sandbox.py actually provides | |
| `sandbox.py` defines: | |
| - `LocalSubprocessSandbox` — runs commands via `subprocess.run` in the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control"). | |
| - `DockerSandbox` (in `docker_sandbox.py`) — real isolation, referenced in tests. | |
| **The gap:** For RL training at scale (many parallel rollout workers), neither `LocalSubprocessSandbox` nor per-task Docker containers are adequate: | |
| - Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments). | |
| - Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step. | |
| The repo's `research/review-sandbox.json` presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation). | |
| --- | |
| ## 7. Misreads, Overclaims, and Gaps in Repo Research | |
| ### 7.1 OVERCLAIM: VeRL "first-class" agentic RL | |
| `research/04` §1.5: "VeRL has first-class agentic RL support" and describes `AsyncServer`/`AgentLoop` as stable. The verl README (main branch, 2026-06-10) shows: | |
| - `transfer_queue`, `fully_async_policy`, `one_step_off_policy` are kept under `verl/experimental` — "planned to be merged into the main library." | |
| - `uni-agent` (May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library. | |
| The agentic async path **exists** but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work. | |
| ### 7.2 MISS: TRL now has `environment_factory` + `tools` for multi-turn | |
| `research/04` (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show `environment_factory` and `tools` for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the `environment_factory` interface which would allow `FeatureDeletionEnv` to drive multi-turn episodes inside GRPOTrainer without a custom `rollout_func`. This is not mentioned in any ADR. | |
| ### 7.3 MISS: TRL default `loss_type="dapo"`, NOT `"grpo"` or `"dr_grpo"` | |
| ADR-008 correctly targets `loss_type="dr_grpo"`, but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in `make_dr_grpo_config` is the right mitigation. | |
| ### 7.4 GAP: `scale_rewards` default is `"group"`, not `True` | |
| The GRPOConfig shows `scale_rewards: str = 'group'` (a string, not a bool). ADR-008's assertion `str(cfg.scale_rewards).lower() in ("none","false")` correctly handles both the old bool (`False`) and new string (`"none"`) forms. But the docs show `True` and `"group"` are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug. | |
| ### 7.5 GAP: KL estimator — TRL default is k3, not k1 | |
| The TRL docs show the KL approximator formula: | |
| ``` | |
| D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1 | |
| ``` | |
| This is the **k3 estimator** (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (`-log r = log π_θ/π_ref`). The ADR notes this as an OPEN item. Since `beta=0.0` by default in TRL, the KL term is disabled and this doesn't affect training unless `beta>0`. However, `composer_trainer.py` implements the k1-in-reward path via `kl_in_reward.py` — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes `log(π_θ/π_ref)` at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: **k1-in-reward active, beta=0 (TRL in-loss KL disabled)**. The code appears to do this but there's no explicit assertion that `beta=0` when `kl_in_reward=True`. | |
| ### 7.6 MISS: `num_iterations=1` narrowed claim | |
| ADR-008 acknowledges that `num_iterations=1` controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct. | |
| ### 7.7 MISS: TRL default optimizer is `adamw_torch_fused`, not `adam` | |
| ADR-008 has an OPEN item: "Adam is claimed but `optim` is not set." The GRPOConfig docs show: | |
| ``` | |
| optim: transformers.training_args.OptimizerNames | str = 'adamw_torch_fused' | |
| ``` | |
| Default is `adamw_torch_fused` (AdamW with fused CUDA kernel), not plain `adam`. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set `weight_decay=0.0` and `optim="adam"` explicitly to match. The default AdamW has weight decay (though `weight_decay=0.0` is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, `adamw_torch_fused` ≠ `adam` in terms of the optimizer implementation; to be precise, set `optim="adamw_8bit"` or `optim="paged_adamw_8bit"` (memory efficient) or just `optim="adam_torch"` if plain Adam is intended. | |
| --- | |
| ## 8. verl `uni-agent` — A New Development Not in Research/04 | |
| The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. `uni-agent` could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level `AgentLoop`/`AsyncServer` integration that ADR-006 contemplates. | |
| **Implication for the repo:** Before committing to a custom VeRL `AgentLoop` integration, evaluate whether `uni-agent` already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface. | |
| --- | |
| ## 9. Sandboxing Recommendation Gaps | |
| ### 9.1 The `SWE-MiniSandbox` approach is not referenced anywhere in the repo | |
| arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to `FeatureDeletionEnv` at scale. The repo's sandbox design doesn't reference this work. | |
| ### 9.2 The repo's `docker_sandbox.py` is production-blocking for RL at scale | |
| Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives. | |
| --- | |
| ## 10. Summary of Critical Findings | |
| | Finding | Severity | Affected Files | | |
| |---|---|---| | |
| | VeRL async agent loop is EXPERIMENTAL, not "first-class" stable | MEDIUM — overclaim | research/04 §1.5, ADR-006 | | |
| | TRL `environment_factory` (multi-turn) not in any ADR | MEDIUM — miss | ADR-008, env.py | | |
| | k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set) | MEDIUM — correctness gap | composer_trainer.py, ADR-008 | | |
| | `optim` default is `adamw_torch_fused`, not `adam` | LOW — fidelity gap | ADR-008 OPEN item | | |
| | TRL `loss_type` defaults to `"dapo"` (not GRPO), correctly handled | INFO — confirmed correct | ADR-008, make_dr_grpo_config | | |
| | `env.py::reward_fn` single-submit path is dead end for tree-of-work | HIGH — architecture gap | env.py, no ADR exists | | |
| | `uni-agent` (verl, May 2026) not evaluated — may supersede custom AgentLoop | MEDIUM — miss | ADR-006 | | |
| | SWE-MiniSandbox approach not referenced (5% disk, 25% setup time) | MEDIUM — miss | sandbox.py, docker_sandbox.py | | |
| | EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt) | INFO — production gotcha | no ADR | | |
| --- | |
| ## 11. Migration Path for Multi-Turn Agentic RL (Honest Assessment) | |
| The current repo architecture (TRL `reward_fn` with single-submit fallback) is: | |
| **Phase 1 — GRPO on completions (current):** The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase. | |
| **Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO):** The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order: | |
| 1. **TRL `environment_factory` adapter (experimental, weeks of work):** Wrap `FeatureDeletionEnv` as a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism. | |
| 2. **TRL `rollout_func` (experimental, 1–2 weeks):** Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update. | |
| 3. **VeRL AsyncServer + `FeatureDeletionEnv` as SandboxFusionTool adapter (2–4 weeks):** GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented. | |
| **Phase 3 — Tree-of-Work (MCTS, multi-branch):** This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The `uni-agent` framework on top of verl should be evaluated first before building a custom AgentLoop integration. | |
| --- | |
| *Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.* | |