Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex Paper • 2605.06139 • Published 5 days ago • 57
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models Paper • 2602.01970 • Published Feb 2 • 2
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? Paper • 2507.04632 • Published Jul 7, 2025 • 2