arxiv:2602.05946

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Published on Feb 5

· Submitted by

Authors:

Abstract

Preference alignment objectives are extended to general alignment settings using f-divergence variational representations, introducing novel on-policy and hybrid policy optimization methods for LLM alignment with theoretical and empirical validation.

AI-generated summary

Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

View arXiv page View PDF Add to collection

Community

rhaldar97

Paper submitter about 4 hours ago

Recent research shows that Preference Alignment
(PA) objectives act as divergence estimators be-
tween aligned (chosen) and unaligned (rejected)
response distributions. In this work, we extend
this divergence-based perspective to general align-
ment settings, such as reinforcement learning with
verifiable rewards (RLVR), where only environ-
mental rewards are available. Within this unified
framework, we propose f-Group Relative Policy
Optimization (f-GRPO), a class of on-policy re-
inforcement learning, and f-Hybrid Alignment
Loss (f-HAL), a hybrid on/off policy objectives,
for general LLM alignment based on variational
representation of f-divergences. We provide the-
oretical guarantees that these classes of objec-
tives improve the average reward after alignment.
Empirically, we validate our framework on both
RLVR (Math Reasoning) and PA tasks (Safety
Alignment), demonstrating superior performance
and flexibility compared to current methods.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05946 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05946 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05946 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.