Papers
arxiv:2603.03194

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Published on Mar 3
· Submitted by
Guoxin Chen
on Mar 4
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Current code agent benchmarks fail to capture real-world complexity, prompting the creation of BeyondSWE to evaluate broader reasoning and knowledge scopes, alongside SearchSWE to study external knowledge integration in coding tasks.

AI-generated summary

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Community

Paper submitter

BeyondSWE evaluates code agents beyond single-repo bug fixing with 500 real-world tasks across 4 challenging dimensions: cross-repo reasoning, domain-specific scientific coding, dependency migration, and full repo generation from specs — where even frontier models plateau below 45%.

SearchSWE integrates web search into the coding agent workflow, revealing a surprising disconnect: search doesn't reliably help and sometimes hurts, suggesting that search and coding capabilities have matured independently and their integration remains an open problem.

  • SWE-bench Verified is 80%+ solved, but BeyondSWE shows current agents are far from real-world ready
  • No single model dominates across all task types — different tasks expose different weaknesses
  • Models that search more don't perform better; search quality matters far more than frequency
  • Code-specialized models (e.g., Seed-Coder) actually degrade more with search than general-purpose ones

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.03194 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.03194 in a Space README.md to link it from this page.

Collections including this paper 1