| | --- |
| | language: |
| | - multilingual |
| | license: apache-2.0 |
| | license_name: kwaipilot-license |
| | license_link: LICENSE |
| | library_name: transformers |
| | --- |
| | <div align="center"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" /> |
| | </div> |
| |
|
| | <hr> |
| |
|
| |
|
| | # SWE-Compass: Unified Evaluation Benchmark for Agentic Coding |
| |
|
| |
|
| | --- |
| |
|
| | ## π§ Overview |
| |
|
| | **SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows. |
| |
|
| | It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes. |
| | Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions. |
| |
|
| | ### Key Features |
| |
|
| | - **β 2,000 curated tasks** from real GitHub issues and pull requests. |
| | - **8 task types Γ 8 scenarios Γ 10 programming languages** for comprehensive coverage. |
| | - **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites. |
| | - **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency. |
| |
|
| | --- |
| |
|
| | ## π Dataset Structure |
| |
|
| | ``` |
| | SWE-Compass/ |
| | ββ data/ |
| | β ββ test.jsonl # main evaluation set (~2,000 instances) |
| | β ββ dev.jsonl # optional validation split |
| | β ββ train.jsonl # optional training data |
| | ββ scripts/ |
| | β ββ setup_env.sh # environment setup (dependency installation) |
| | β ββ run_instance.py # run one instance end-to-end |
| | β ββ eval_aggregate.py # aggregate evaluation metrics |
| | ββ README.md |
| | ``` |
| |
|
| |
|
| | ```json |
| | { |
| | "instance_id": "compass_01234", |
| | "task_type": "bug_fixing", |
| | "scenario": "mono_repo_ci", |
| | "language": "python", |
| | "difficulty": "medium", |
| | "source": { |
| | "repo": "owner/project", |
| | "commit": "abcdef123456", |
| | "issue_or_pr": "PR#13091", |
| | "gh_url": "https://github.com/owner/project/pull/13091" |
| | }, |
| | "instruction": "Fix failing test in module X caused by Y...", |
| | "context_files": ["path/to/file1.py", "path/to/file2.py"], |
| | "tools_available": ["git", "pytest", "bash"], |
| | "evaluation": { |
| | "setup_cmds": ["pip install -e .", "pytest -q"], |
| | "test_cmd": "pytest -q", |
| | "timeout_sec": 1800 |
| | }, |
| | "reference_patch": "diff --git a/... b/...", |
| | "verified": true |
| | } |
| | ``` |
| |
|
| | # π§ͺ Usage |
| | ## Load via π€ datasets |
| |
|
| | ```python |
| | from datasets import load_dataset |
| | |
| | dataset = load_dataset("Kwaipilot/SWE-Compass", split="test") |
| | print(len(dataset), dataset[0].keys()) |
| | Run Local Evaluation |
| | ``` |
| |
|
| | ## Run Local Evaluation |
| |
|
| | ```bash |
| | bash scripts/setup_env.sh |
| | python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234 |
| | python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs |
| | ``` |
| |
|
| | # π Metrics |
| |
|
| | | Category | Metric | Description | |
| | |-----------|---------|-------------| |
| | | **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts | |
| | | **Process** | Tool Calls / Latency | Efficiency and reasoning stability | |
| | | **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification | |
| |
|
| | --- |
| |
|
| |
|
| | # π§© Citation |
| |
|
| | ```bibtex |
| | @article{xu2025swecompass, |
| | title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models}, |
| | author = {Jingxuan Xu and others}, |
| | journal = {arXiv preprint arXiv:2511.05459}, |
| | year = {2025} |
| | } |
| | ``` |
| |
|
| | # π€ Contributing |
| | We welcome community contributions β new verified instances, environment fixes, or evaluation scripts. |
| | Please open a pull request or issue on this repository. |
| |
|
| |
|