Verl LLM Best Practices (DAPO + Qwen3-235B) =========================================== Last updated: 11/03/2025. Purpose ------- This guide uses DAPO training on Qwen3-235B as a concrete example. We unpack every parameter that appears in the optimization objective, map it to Verl configuration entries, and share field-tested recommendations so you can derive sensible settings for your own workloads. .. note:: 1. The guide only covers the subset of parameters required to reproduce the DAPO experiments discussed here. For the full list, refer to the ``config`` components in the Verl source tree: https://github.com/volcengine/verl/tree/main/verl/trainer/config 2. PPO and GRPO introduce KL-constrained policies. We therefore include that setup in the explanations below. You can treat all configurations mentioned here as a DAPO pipeline augmented with a KL penalty. Optimization Objectives ----------------------- DAPO objective ~~~~~~~~~~~~~~ .. math:: \begin{aligned} \mathcal{J}_{\mathrm{DAPO}}(\theta)= & \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text {old }}}(\cdot \mid q)} \ {\left[\frac{1}{\sum_{i=1}^G\left|o_i\right|} \sum_{i=1}^G \sum_{t=1}^{\left|o_i\right|} \min \left(r_{i, t}(\theta) \hat{A}_{i, t}, \operatorname{clip}\left(r_{i, t}(\theta), 1-\varepsilon_{\text {low }}, 1+\varepsilon_{\text {high }}\right) \hat{A}_{i, t}\right)\right] } \\ \end{aligned} .. math:: \text { s.t. } \quad 0<\mid\left\{o_i \mid \text { is_equivalent }\left(a, o_i\right)\right\} \mid 2 * model_parameters`` (bf16/fp16). Increase TP gradually to expand KV cache capacity while watching communication cost—especially once TP > 8. - ``actor_rollout_ref.rollout.temperature`` / ``top_p`` / ``top_k``: Sampling knobs for rollout. Keep enough randomness; ``temperature=1.0``, ``top_p=1.0``, ``top_k=-1`` are good defaults. - ``actor_rollout_ref.rollout.val_kwargs.temperature`` / ``top_p`` / ``top_k`` / ``do_sample`` / ``n``: Sampling options for validation. Set ``temperature > 0`` to prevent repetitive thinking chains. For small test sets (e.g., AIME24) raise ``n`` (64 is a common choice) to reduce variance. A practical starting point is ``temperature=1.0``, ``top_p=0.7``, ``top_k=-1``, ``do_sample=True``, ``n=1`` and then increase ``n`` as needed. - ``+actor_rollout_ref.rollout.engine_kwargs.vllm.*`` / ``+actor_rollout_ref.rollout.engine_kwargs.sglang.*``: Extra backend options injected via the ``+`` syntax. Consult backend docs for exact semantics. Some switches (for example ``pipeline_parallel_size``) may not be supported yet; when TP=32, ``enable_expert_parallel=True`` can even slow down DeepSeek-V3 rollout, so benchmark carefully. :math:`\pi_\theta` - ``data.train_batch_size``: Total batch size per training iteration. Each rollout produces ``train_batch_size * n`` samples. Larger values reduce the number of rollouts but increase off-policy drift. - ``actor_rollout_ref.actor.ppo_mini_batch_size``: Mini-batch size per optimization step. Tune it the same way you would for standard deep learning workloads. - ``actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu``: Samples processed per forward pass on one GPU group (a Megatron group contains TP * PP * CP GPUs). Keep it ≤ ``ppo_mini_batch_size`` and as large as memory allows. - ``actor_rollout_ref.actor.use_dynamic_bsz``: Enable dynamic batch sizing to adapt to sequence length and improve throughput. - ``actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu``: Maximum tokens per GPU when computing log probabilities under dynamic batching. Set it to at least a multiple of ``max_prompt_length + max_response_length`` to prevent truncation. - Megatron parallelism parameters (``pipeline_model_parallel_size`` / ``tensor_model_parallel_size`` / ``expert_model_parallel_size`` / ``expert_tensor_parallel_size`` / ``context_parallel_size``): Balance PP/TP/EP/ETP/CP to match memory and network constraints. In bf16/fp16, each parameter consumes roughly ``2 / TP`` bytes; if you keep FP32 master weights or skip optimizer offload, reserve another 4–8 bytes for Adam. Activations scale with ``micro_batch_size × sequence_length × hidden_size`` and can be mitigated with gradient checkpointing, dynamic batches, or offload. Prefer increasing TP first, add PP when necessary, extend sequence capacity with CP, align EP/ETP with TP for MoE models, and keep DP minimal on constrained clusters while combining with offload. Always align the setup with hardware topology and communication cost. - ``actor_rollout_ref.model.use_fused_kernels``: Enable Verl’s fused kernels for supported models to squeeze out additional performance. :math:`\hat{A}_{i,t}` - ``algorithm.adv_estimator``: Advantage estimator. Set to ``grpo`` for DAPO/GRPO. :math:`R_i` - ``reward_model.reward_manager``: Reward aggregation strategy. Use ``dapo`` for DAPO and ``naive`` for GRPO. :math:`D_{KL}` - ``algorithm.use_kl_in_reward``: Whether to add a KL term to the reward. ``True`` for PPO, ``False`` for GRPO and DAPO. - ``actor_rollout_ref.actor.use_kl_loss``: Whether to include a KL loss term. ``False`` for PPO, ``True`` for GRPO, ``False`` for DAPO. :math:`\beta` - ``actor_rollout_ref.actor.kl_loss_coef``: Weight of the KL loss. Start around 0.001. Larger values curb reward hacking but reduce exploration. - ``algorithm.kl_ctrl.kl_coef``: KL coefficient applied within the reward. Adjust to match your tolerance for divergence. :math:`\pi_{old}` - ``actor_rollout_ref.rollout.log_prob_use_dynamic_bsz``: Enable dynamic batching when the old policy computes log-probabilities. Recommended. :math:`\pi_{ref}` - ``actor_rollout_ref.ref.log_prob_use_dynamic_bsz``: Enable dynamic batching for the reference policy. Recommended. - Reference Megatron parallelism: Keep ``pipeline_model_parallel_size``, ``tensor_model_parallel_size``, ``expert_model_parallel_size``, ``expert_tensor_parallel_size``, and ``context_parallel_size`` in sync with the actor. - ``actor_rollout_ref.ref.megatron.param_offload``: Offload reference parameters to CPU when the actor does so. Even without gradients or optimizer states, parity helps with capacity planning. :math:`o_i` / :math:`|o_i|` - ``actor_rollout_ref.actor.loss_agg_mode``: Loss aggregation mode. Token-level ``token-mean`` matches the recommendations from Dr.GRPO and DAPO; use ``seq-mean-token-mean`` to reproduce the original GRPO behavior. :math:`\pi_\theta(o_{i,t} \mid q_i,o_{i,