Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents Paper • 2601.18217 • Published 3 days ago • 8
Self-rewarding correction for mathematical reasoning Paper • 2502.19613 • Published Feb 26, 2025 • 82
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Paper • 2405.16436 • Published May 26, 2024 • 1