RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 4 days ago • 50
T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning Paper • 2603.03790 • Published 2 days ago • 94
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Paper • 2505.20411 • Published May 26, 2025 • 93
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published 7 days ago • 66
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks Paper • 2506.10954 • Published Jun 12, 2025 • 52
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? Paper • 2507.12415 • Published Jul 16, 2025 • 43
SWE-World: Building Software Engineering Agents in Docker-Free Environments Paper • 2602.03419 • Published about 1 month ago • 40
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training Paper • 2602.03411 • Published about 1 month ago • 37
SWE-QA: Can Language Models Answer Repository-level Code Questions? Paper • 2509.14635 • Published Sep 18, 2025 • 35
BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? Paper • 2603.03194 • Published 2 days ago • 50
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development Paper • 2602.10975 • Published 22 days ago • 19
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Paper • 2512.18470 • Published Dec 20, 2025 • 12
SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents Paper • 2602.09447 • Published 24 days ago
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published Jan 16 • 65