SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published Feb 27 • 89
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation Paper • 2510.23393 • Published Oct 27, 2025 • 21
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation Paper • 2510.23393 • Published Oct 27, 2025 • 21
The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management Paper • 2508.21433 • Published Aug 29, 2025 • 8
Diff-XYZ: A Benchmark for Evaluating Diff Understanding Paper • 2510.12487 • Published Oct 14, 2025 • 9
The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management Paper • 2508.21433 • Published Aug 29, 2025 • 8
Diff-XYZ: A Benchmark for Evaluating Diff Understanding Paper • 2510.12487 • Published Oct 14, 2025 • 9
Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code Paper • 2412.02764 • Published Dec 3, 2024
EnvBench: A Benchmark for Automated Environment Setup Paper • 2503.14443 • Published Mar 18, 2025 • 1