Running 17 Defeating the trainer-generator precision mismatch in TRL 🎯 17 Download research PDF (Pro access required)
LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs Paper • 2510.16552 • Published Oct 18, 2025 • 1
CharlesLi/qwen_vl_3b_seedbench_position_3x3blocks_300step Image-Text-to-Text • 4B • Updated Sep 22, 2025 • 2
CharlesLi/qwen_vl_3b_seedbench_position_3x3blocks_300step Image-Text-to-Text • 4B • Updated Sep 22, 2025 • 2
CharlesLi/qwen_vl_3b_mmbench_position_3x3blocks_300step Image-Text-to-Text • 4B • Updated Sep 22, 2025 • 3
CharlesLi/qwen_vl_3b_mmbench_position_3x3blocks_300step Image-Text-to-Text • 4B • Updated Sep 22, 2025 • 3
CharlesLi/qwen_vl_3b_contrastive_qa_20_step300 Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 2
CharlesLi/qwen_vl_3b_contrastive_qa_20_step300 Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 2
CharlesLi/qwen_vl_3b_position_3x3blocks_step300 Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 1
CharlesLi/qwen_vl_3b_position_3x3blocks_step300 Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 1