MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization Paper • 2605.10784 • Published May 11 • 1
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning Paper • 2605.02913 • Published Apr 8 • 9