GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization Paper • 2606.16771 • Published 10 days ago • 13
CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR Paper • 2603.10101 • Published Mar 10 • 6