You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass Paper • 2604.10966 • Published 3 days ago • 6
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation Paper • 2604.13010 • Published 2 days ago • 4
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Paper • 2604.13016 • Published 2 days ago • 57
ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement Paper • 2604.01591 • Published 14 days ago • 40
Embarrassingly Simple Self-Distillation Improves Code Generation Paper • 2604.01193 • Published 14 days ago • 37
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights Paper • 2510.04800 • Published Oct 6, 2025 • 37
Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data Paper • 2510.03264 • Published Sep 26, 2025 • 25
andreasskyscanner/llama-31-hhrlhf-squad-rlhf-policy-model Text Generation • 1B • Updated Jul 1, 2025 • 1
andreasskyscanner/llama-31-hhrlhf-squad-rlhf-policy-model Text Generation • 1B • Updated Jul 1, 2025 • 1