Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization Paper • 2602.23008 • Published 15 days ago • 35
Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs Paper • 2505.12929 • Published May 19, 2025 • 3
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key Paper • 2501.09695 • Published Jan 16, 2025 • 1