GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Abstract
Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
Community
GDPO is a drop-in replacement for GRPO in verl and TRL โ only minor code changes needed.
We release a slurm-free, easy-to-run implementation supporting multiple RL frameworks (verl / TRL / NeMo-RL) so you can quickly validate GDPO on tool-calling and math reasoning tasks.
โฑ๏ธ Each run can be completed in ~1 hour on 8รA100s, or ~2.5 hours on a single A100.
๐ Switching from GRPO to GDPO is easy.
๐ Try it yourself: https://github.com/NVlabs/GDPO
Really cool paper!
I've created a podcast that explains the key concepts:
https://researchpod-share.vercel.app/episode/c83f1820-279a-4cc0-afe1-b927a0c20ec8
I enjoyed listening to the AI paper podcast!
When you RL models for real-world use, you care about more than one thing: accuracy, conciseness, alignment, faithfulness, etc.
But most RL pipelines still compress all of that into one scalar advantage in the loss function โ and a lot of preference signal gets washed out.
Weโre introducing GDPO, a simple fix that lets you express multi-dimensional preferences with a single advantage. Key idea: swap the order of reward normalization and aggregation.
Works out-of-the-box as a GRPO add-onโcode is provided for veRL, TRL, and NeMo-RL.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization (2025)
- Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization (2025)
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
- Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning (2025)
- Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model (2025)
- ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning (2025)
- AMIR-GRPO: Inducing Implicit Preference Signals into GRPO (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper