arxiv:2511.00066

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Published on Oct 29, 2025

Authors:

Abstract

Sharpness-Guided GRPO improves reinforcement learning with verifiable rewards by reducing gradient sharpness and enhancing generalization through token-weighted optimization.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.00066 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.00066 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.00066 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.