--- language: - multilingual license: apache-2.0 license_name: kwaipilot-license license_link: LICENSE library_name: transformers ---
Kwaipilot

# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding --- ## ๐Ÿงญ Overview **SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows. It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes. Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions. ### Key Features - **โ‰ˆ 2,000 curated tasks** from real GitHub issues and pull requests. - **8 task types ร— 8 scenarios ร— 10 programming languages** for comprehensive coverage. - **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites. - **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency. --- ## ๐Ÿ“ Dataset Structure ``` SWE-Compass/ โ”œโ”€ data/ โ”‚ โ”œโ”€ test.jsonl # main evaluation set (~2,000 instances) โ”‚ โ”œโ”€ dev.jsonl # optional validation split โ”‚ โ””โ”€ train.jsonl # optional training data โ”œโ”€ scripts/ โ”‚ โ”œโ”€ setup_env.sh # environment setup (dependency installation) โ”‚ โ”œโ”€ run_instance.py # run one instance end-to-end โ”‚ โ””โ”€ eval_aggregate.py # aggregate evaluation metrics โ””โ”€ README.md ``` ```json { "instance_id": "compass_01234", "task_type": "bug_fixing", "scenario": "mono_repo_ci", "language": "python", "difficulty": "medium", "source": { "repo": "owner/project", "commit": "abcdef123456", "issue_or_pr": "PR#13091", "gh_url": "https://github.com/owner/project/pull/13091" }, "instruction": "Fix failing test in module X caused by Y...", "context_files": ["path/to/file1.py", "path/to/file2.py"], "tools_available": ["git", "pytest", "bash"], "evaluation": { "setup_cmds": ["pip install -e .", "pytest -q"], "test_cmd": "pytest -q", "timeout_sec": 1800 }, "reference_patch": "diff --git a/... b/...", "verified": true } ``` # ๐Ÿงช Usage ## Load via ๐Ÿค— datasets ```python from datasets import load_dataset dataset = load_dataset("Kwaipilot/SWE-Compass", split="test") print(len(dataset), dataset[0].keys()) Run Local Evaluation ``` ## Run Local Evaluation ```bash bash scripts/setup_env.sh python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234 python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs ``` # ๐Ÿ“Š Metrics | Category | Metric | Description | |-----------|---------|-------------| | **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts | | **Process** | Tool Calls / Latency | Efficiency and reasoning stability | | **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification | --- # ๐Ÿงฉ Citation ```bibtex @article{xu2025swecompass, title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models}, author = {Jingxuan Xu and others}, journal = {arXiv preprint arXiv:2511.05459}, year = {2025} } ``` # ๐Ÿค Contributing We welcome community contributions โ€” new verified instances, environment fixes, or evaluation scripts. Please open a pull request or issue on this repository.