Reinforcing Few-step Generators via Reward-Tilted Distribution Matching Paper • 2605.26108 • Published May 25 • 7
Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning Paper • 2606.10968 • Published 27 days ago • 42
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models Paper • 2606.11025 • Published 27 days ago • 41
RTDMD Collection Reinforcing Few-step Generators via Reward-Tilted Distribution Matching • 5 items • Updated Jun 2 • 3
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search Paper • 2509.25454 • Published Sep 29, 2025 • 147
Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts Paper • 2509.23188 • Published Sep 27, 2025 • 3
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR Paper • 2508.14029 • Published Aug 19, 2025 • 119
Optimizing Anytime Reasoning via Budget Relative Policy Optimization Paper • 2505.13438 • Published May 19, 2025 • 36
Understanding R1-Zero-Like Training: A Critical Perspective Paper • 2503.20783 • Published Mar 26, 2025 • 60