AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems Paper • 2601.11354 • Published Jan 16 • 4
Pre-Trained Policy Discriminators are General Reward Models Paper • 2507.05197 • Published Jul 7, 2025 • 39
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models Paper • 2505.07591 • Published May 12, 2025 • 11
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning Paper • 2501.06590 • Published Jan 11, 2025 • 11
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents Paper • 2407.16741 • Published Jul 23, 2024 • 78
ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks Paper • 2311.09835 • Published Nov 16, 2023 • 11