Papers
arxiv:2603.04918

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Published on Mar 5
· Submitted by
Yuan-Li-FNLP
on Mar 9
Authors:
,
,
,
,

Abstract

Band-constrained Policy Optimization addresses stability issues in reinforcement learning for large language models by replacing fixed clipping with a dynamic probability-aware projection method that prevents entropy collapse.

AI-generated summary

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

Community

Paper author Paper submitter

This paper introduces BandPO (Band-constrained Policy Optimization), which addresses a critical but often overlooked bottleneck in LLM Reinforcement Learning (such as PPO/GRPO/DAPO).

Why it matters:
The canonical clipping mechanism in PPO/GRPO/DAPO uses fixed bounds. The authors mathematically reveal that this strictly constrains the upward update margin of low-probability actions, which disproportionately suppresses high-advantage tail strategies and induces rapid entropy collapse. Simply relaxing the bounds (like Clip-Higher) leads to training instability.

Key Contributions:

  • Dynamic, Probability-Aware Bounds: BandPO replaces fixed clipping with a unified "Band" operator, projecting trust regions defined by $f$-divergences into dynamic clipping intervals.
  • Prevents Entropy Collapse: It naturally expands the feasible upward margin for low-probability actions to prevent premature clipping, effectively preserving critical exploration gradients without losing stability.
  • Strong Empirical Results: Built on top of the GRPO framework, BandPO consistently outperforms vanilla GRPO and Clip-Higher on mathematical reasoning benchmarks (AMC 2023, AIME 2024/2025) across diverse models including Qwen2.5 (3B, 7B) and DeepSeek-R1-Distill (Llama-8B, Qwen).

We believe this provides a highly effective and theoretically grounded improvement over standard GRPO clipping, which will be very valuable to the open-source LLM post-training community. Code is publicly available!

I'm just starting to read it. But I have a question: is this the same discovery as in DPPO?

https://huggingface.co/papers/2602.04879

The basic RL approach to LLMs requires rethinking, as LLMs rely on calculating probabilities for tokens (of which there can be tens of thousands) and this action space is many times larger than the sets for which RL has been used so far.

·

wait, the idea of per-action, probability-aware clipping intervals that adapt to the old policy instead of fixed bounds is a slick way to keep tail actions in play. i'm curious how robust the convex optimization stays when you scale to huge vocabularies and try different f-divergences, especially with real-world rlhf noise. the breakdown on arxivlens was solid, found a nice walkthrough here: https://arxivlens.com/PaperView/Details/bandpo-bridging-trust-regions-and-ratio-clipping-via-probability-aware-bounds-for-llm-reinforcement-learning-283-62d2c3b7

·

Thanks for the thoughtful comment, and thanks for sharing the walkthrough.

This is a very natural concern, but one key point is that the runtime computation in BandPO is not a full high-dimensional optimization over the vocabulary. In our formulation, the trust-region projection can be strictly reduced to a 1D problem parameterized only by the old probability of the target token. The high-dimensional simplex constraint is scalarized into a univariate equation g_f(p, r) = \delta, whose roots give the clipping bounds directly.

So the practical scaling behavior is much lighter than “solving a huge convex program over a 100K+ vocabulary” might suggest. For TV and Pearson \chi^2, the bounds are available in closed form; for KL, the active-regime bound is the unique root of a monotone binding equation, which can be solved efficiently with standard bracketed methods such as bisection or Brent’s method, with global convergence guarantees.

In our implementation, we use a CUDA-parallelized bisection solver for the KL case, so the additional overhead is practical and parallel-friendly in LLM RL training.

Also, regarding “real-world RLHF noise”: that kind of noise can certainly affect the overall optimization dynamics, but it does not directly make the Band bound computation ill-posed. The bound solver itself is a deterministic geometric mapping from (p, \delta, f) to the admissible interval, rather than a noisy inner optimization over rewards. In that sense, the numerical stability issue is much milder than it may initially sound. The broader point of BandPO is exactly to replace fixed clipping with theoretically valid, probability-aware bounds while preserving practical usability.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.04918 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.04918 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.04918 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.