arxiv:2602.10231

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Published on Feb 10

· Submitted by

Alexander on Feb 12

Nebius

Upvote

Authors:

Alexander Golubev ,

Abstract

Blockwise Advantage Estimation addresses reward interference in structured generations by assigning separate advantages to different text blocks, using outcome-conditioned baselines to avoid expensive nested rollouts.

AI-generated summary

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

View arXiv page View PDF Add to collection

Community

djalexj

Paper author Paper submitter about 4 hours ago

Blockwise Advantage Estimation makes GRPO work for segmented, multi-objective generations by routing each objective’s learning signal to the tokens that control it, using an outcome-conditioned baseline for later segments.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10231 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.10231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10231 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.