Safe and Scalable Web Agent Learning via Recreated Websites Paper • 2603.10505 • Published Mar 11 • 27
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? Paper • 2602.05023 • Published Feb 4 • 2
Learning to Reason as Action Abstractions with Scalable Mid-Training RL Paper • 2509.25810 • Published Sep 30, 2025 • 6
Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning Paper • 2505.20561 • Published May 26, 2025 • 7
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Paper • 2405.16436 • Published May 26, 2024 • 1
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs Paper • 2410.08067 • Published Oct 10, 2024 • 2
DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs Paper • 2411.13611 • Published Nov 20, 2024
Offline Reinforcement Learning for LLM Multi-Step Reasoning Paper • 2412.16145 • Published Dec 20, 2024 • 38
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment Paper • 2405.19332 • Published May 29, 2024 • 22
GeorgiaTech/0.0005_llama_nodpo_3iters_bs128_531lr_oldtrl_iter_3 Text Generation • 8B • Updated May 13, 2024 • 4