arxiv:2603.04918

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

Published on Mar 5

· Submitted by

Yuan-Li-FNLP on Mar 9

OpenMOSS

Upvote

Authors:

Yuan Li ,

Xinyuan Wang ,

Abstract

Band-constrained Policy Optimization addresses stability issues in reinforcement learning for large language models by replacing fixed clipping with a dynamic probability-aware projection method that prevents entropy collapse.

AI-generated summary

Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.

View arXiv page View PDF GitHub 37 Add to collection

Community

Yuan-Li-FNLP

Paper author Paper submitter about 22 hours ago

This paper introduces BandPO (Band-constrained Policy Optimization), which addresses a critical but often overlooked bottleneck in LLM Reinforcement Learning (such as PPO/GRPO/DAPO).

Why it matters:
The canonical clipping mechanism in PPO/GRPO/DAPO uses fixed bounds. The authors mathematically reveal that this strictly constrains the upward update margin of low-probability actions, which disproportionately suppresses high-advantage tail strategies and induces rapid entropy collapse. Simply relaxing the bounds (like Clip-Higher) leads to training instability.

Key Contributions:

Dynamic, Probability-Aware Bounds: BandPO replaces fixed clipping with a unified "Band" operator, projecting trust regions defined by $f$-divergences into dynamic clipping intervals.
Prevents Entropy Collapse: It naturally expands the feasible upward margin for low-probability actions to prevent premature clipping, effectively preserving critical exploration gradients without losing stability.
Strong Empirical Results: Built on top of the GRPO framework, BandPO consistently outperforms vanilla GRPO and Clip-Higher on mathematical reasoning benchmarks (AMC 2023, AIME 2024/2025) across diverse models including Qwen2.5 (3B, 7B) and DeepSeek-R1-Distill (Llama-8B, Qwen).

We believe this provides a highly effective and theoretically grounded improvement over standard GRPO clipping, which will be very valuable to the open-source LLM post-training community. Code is publicly available!

Eryk-Chmielewski

about 14 hours ago

I'm just starting to read it. But I have a question: is this the same discovery as in DPPO?

https://huggingface.co/papers/2602.04879

The basic RL approach to LLMs requires rethinking, as LLMs rely on calculating probabilities for tokens (of which there can be tens of thousands) and this action space is many times larger than the sets for which RL has been used so far.

u12312828

about 10 hours ago

me too

avahal

about 8 hours ago

wait, the idea of per-action, probability-aware clipping intervals that adapt to the old policy instead of fixed bounds is a slick way to keep tail actions in play. i'm curious how robust the convex optimization stays when you scale to huge vocabularies and try different f-divergences, especially with real-world rlhf noise. the breakdown on arxivlens was solid, found a nice walkthrough here: https://arxivlens.com/PaperView/Details/bandpo-bridging-trust-regions-and-ratio-clipping-via-probability-aware-bounds-for-llm-reinforcement-learning-283-62d2c3b7

Yuan-Li-FNLP

Paper author about 1 hour ago

Thanks for the thoughtful comment, and thanks for sharing the walkthrough.

This is a very natural concern, but one key point is that the runtime computation in BandPO is not a full high-dimensional optimization over the vocabulary. In our formulation, the trust-region projection can be strictly reduced to a 1D problem parameterized only by the old probability of the target token. The high-dimensional simplex constraint is scalarized into a univariate equation g_f(p, r) = \delta, whose roots give the clipping bounds directly.

So the practical scaling behavior is much lighter than “solving a huge convex program over a 100K+ vocabulary” might suggest. For TV and Pearson \chi^2, the bounds are available in closed form; for KL, the active-regime bound is the unique root of a monotone binding equation, which can be solved efficiently with standard bracketed methods such as bisection or Brent’s method, with global convergence guarantees.

In our implementation, we use a CUDA-parallelized bisection solver for the KL case, so the additional overhead is practical and parallel-friendly in LLM RL training.

Also, regarding “real-world RLHF noise”: that kind of noise can certainly affect the overall optimization dynamics, but it does not directly make the Band bound computation ill-posed. The bound solver itself is a deterministic geometric mapping from (p, \delta, f) to the admissible interval, rather than a noisy inner optimization over rewards. In that sense, the numerical stability issue is much milder than it may initially sound. The broader point of BandPO is exactly to replace fixed clipping with theoretically valid, probability-aware bounds while preserving practical usability.