BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
Abstract
Current code agent benchmarks fail to capture real-world complexity, prompting the creation of BeyondSWE to evaluate broader reasoning and knowledge scopes, alongside SearchSWE to study external knowledge integration in coding tasks.
Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
Community
BeyondSWE evaluates code agents beyond single-repo bug fixing with 500 real-world tasks across 4 challenging dimensions: cross-repo reasoning, domain-specific scientific coding, dependency migration, and full repo generation from specs — where even frontier models plateau below 45%.
SearchSWE integrates web search into the coding agent workflow, revealing a surprising disconnect: search doesn't reliably help and sometimes hurts, suggesting that search and coding capabilities have matured independently and their integration remains an open problem.
- SWE-bench Verified is 80%+ solved, but BeyondSWE shows current agents are far from real-world ready
- No single model dominates across all task types — different tasks expose different weaknesses
- Models that search more don't perform better; search quality matters far more than frequency
- Code-specialized models (e.g., Seed-Coder) actually degrade more with search than general-purpose ones
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper