Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR Paper β’ 2509.02522 β’ Published Sep 2, 2025 β’ 26
Self-Improving Language Models with Bidirectional Evolutionary Search Paper β’ 2605.28814 β’ Published 28 days ago β’ 60
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning Paper β’ 2605.25604 β’ Published about 1 month ago β’ 138
SkillOpt: Executive Strategy for Self-Evolving Agent Skills Paper β’ 2605.23904 β’ Published May 22 β’ 246
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper β’ 2605.11609 β’ Published May 12 β’ 196
RLPR: Extrapolating RLVR to General Domains without Verifiers Paper β’ 2506.18254 β’ Published Jun 23, 2025 β’ 35
Reinforcement-aware Knowledge Distillation for LLM Reasoning Paper β’ 2602.22495 β’ Published Feb 26 β’ 6
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning Paper β’ 2602.01058 β’ Published Feb 1 β’ 45
Running 353 LLM Embeddings Explained: A Visual and Intuitive Guide π 353 How Language Models Turn Text into Meaning, From Traditional