Guided Self-Evolving LLMs with Minimal Human Supervision Paper • 2512.02472 • Published 24 days ago • 50
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning Paper • 2512.15687 • Published 8 days ago • 17
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values Paper • 2510.20187 • Published Oct 23 • 18
TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy Paper • 2506.11302 • Published Jun 12 • 3
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values Paper • 2510.20187 • Published Oct 23 • 18
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering Paper • 2510.01591 • Published Oct 2 • 27
TARDIS STRIDE: A Spatio-Temporal Road Image Dataset for Exploration and Autonomy Paper • 2506.11302 • Published Jun 12 • 3
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Paper • 2509.15194 • Published Sep 18 • 33
Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation Paper • 2509.15194 • Published Sep 18 • 33
view article Article <p style="text-align:center;"> Bourbaki (7b): SOTA 7B Algorithms for Putnam Bench (Part I: Reasoning MDPs)</p> Jul 13 • 11
Reward Models 06-2025 Collection Nemotron reward models. For use in RLHF pipelines and LLM-as-a-Judge • 8 items • Updated 2 days ago • 23