Papers
arxiv:2606.12370

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Published on Jun 10
· Submitted by
taesiri
on Jun 11
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

Community

Paper submitter

Accelerates RL training in LLMs using MTP with rejection sampling and a novel end-to-end TV loss.

Key finding: MTP acceptance rate is linearly constrained by target entropy — the dominant factor behind
degradation during RL.

Rejection sampling with TV loss breaks this bound: its acceptance depends on distributional overlap, not just the top-1
token, making it far more robust to entropy shifts.

截屏2026-06-11 15.50.31

Hi Qwen team,

Qwen3.7-Max is already out as a closed API model — any plans to release open-weight variants of the 3.7 family (similar to how 3.6-35B-A3B and 3.6-27B were released alongside 3.6-Max)?

Would love to run them locally via llama.cpp / GGUF.

Thanks!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12370
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12370 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12370 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12370 in a Space README.md to link it from this page.

Collections including this paper 1