---
language:
- multilingual
license: apache-2.0
license_name: kwaipilot-license
license_link: LICENSE
library_name: transformers
---
# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding
---
## ๐งญ Overview
**SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.
It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.
### Key Features
- **โ 2,000 curated tasks** from real GitHub issues and pull requests.
- **8 task types ร 8 scenarios ร 10 programming languages** for comprehensive coverage.
- **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites.
- **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency.
---
## ๐ Dataset Structure
```
SWE-Compass/
โโ data/
โ โโ test.jsonl # main evaluation set (~2,000 instances)
โ โโ dev.jsonl # optional validation split
โ โโ train.jsonl # optional training data
โโ scripts/
โ โโ setup_env.sh # environment setup (dependency installation)
โ โโ run_instance.py # run one instance end-to-end
โ โโ eval_aggregate.py # aggregate evaluation metrics
โโ README.md
```
```json
{
"instance_id": "compass_01234",
"task_type": "bug_fixing",
"scenario": "mono_repo_ci",
"language": "python",
"difficulty": "medium",
"source": {
"repo": "owner/project",
"commit": "abcdef123456",
"issue_or_pr": "PR#13091",
"gh_url": "https://github.com/owner/project/pull/13091"
},
"instruction": "Fix failing test in module X caused by Y...",
"context_files": ["path/to/file1.py", "path/to/file2.py"],
"tools_available": ["git", "pytest", "bash"],
"evaluation": {
"setup_cmds": ["pip install -e .", "pytest -q"],
"test_cmd": "pytest -q",
"timeout_sec": 1800
},
"reference_patch": "diff --git a/... b/...",
"verified": true
}
```
# ๐งช Usage
## Load via ๐ค datasets
```python
from datasets import load_dataset
dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
```
## Run Local Evaluation
```bash
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
```
# ๐ Metrics
| Category | Metric | Description |
|-----------|---------|-------------|
| **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| **Process** | Tool Calls / Latency | Efficiency and reasoning stability |
| **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification |
---
# ๐งฉ Citation
```bibtex
@article{xu2025swecompass,
title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
author = {Jingxuan Xu and others},
journal = {arXiv preprint arXiv:2511.05459},
year = {2025}
}
```
# ๐ค Contributing
We welcome community contributions โ new verified instances, environment fixes, or evaluation scripts.
Please open a pull request or issue on this repository.