File size: 3,770 Bytes
56b8fcc 13bc8ef 56b8fcc f31b274 56b8fcc 5855b8a 49de317 403322b 5855b8a 13bc8ef 5855b8a 56b8fcc 5855b8a 56b8fcc 5855b8a 50bea7d 5855b8a ba5c32a 5855b8a 50bea7d 5855b8a 50bea7d 56b8fcc 5855b8a 56b8fcc 5855b8a 56b8fcc 5855b8a 56b8fcc 5855b8a 3a2b807 5855b8a 3a2b807 5855b8a 3a2b807 5855b8a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | ---
language:
- multilingual
license: apache-2.0
license_name: kwaipilot-license
license_link: LICENSE
library_name: transformers
---
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" />
</div>
<hr>
# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding
---
## π§ Overview
**SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.
It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.
### Key Features
- **β 2,000 curated tasks** from real GitHub issues and pull requests.
- **8 task types Γ 8 scenarios Γ 10 programming languages** for comprehensive coverage.
- **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites.
- **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency.
---
## π Dataset Structure
```
SWE-Compass/
ββ data/
β ββ test.jsonl # main evaluation set (~2,000 instances)
β ββ dev.jsonl # optional validation split
β ββ train.jsonl # optional training data
ββ scripts/
β ββ setup_env.sh # environment setup (dependency installation)
β ββ run_instance.py # run one instance end-to-end
β ββ eval_aggregate.py # aggregate evaluation metrics
ββ README.md
```
```json
{
"instance_id": "compass_01234",
"task_type": "bug_fixing",
"scenario": "mono_repo_ci",
"language": "python",
"difficulty": "medium",
"source": {
"repo": "owner/project",
"commit": "abcdef123456",
"issue_or_pr": "PR#13091",
"gh_url": "https://github.com/owner/project/pull/13091"
},
"instruction": "Fix failing test in module X caused by Y...",
"context_files": ["path/to/file1.py", "path/to/file2.py"],
"tools_available": ["git", "pytest", "bash"],
"evaluation": {
"setup_cmds": ["pip install -e .", "pytest -q"],
"test_cmd": "pytest -q",
"timeout_sec": 1800
},
"reference_patch": "diff --git a/... b/...",
"verified": true
}
```
# π§ͺ Usage
## Load via π€ datasets
```python
from datasets import load_dataset
dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
```
## Run Local Evaluation
```bash
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
```
# π Metrics
| Category | Metric | Description |
|-----------|---------|-------------|
| **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| **Process** | Tool Calls / Latency | Efficiency and reasoning stability |
| **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification |
---
# π§© Citation
```bibtex
@article{xu2025swecompass,
title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
author = {Jingxuan Xu and others},
journal = {arXiv preprint arXiv:2511.05459},
year = {2025}
}
```
# π€ Contributing
We welcome community contributions β new verified instances, environment fixes, or evaluation scripts.
Please open a pull request or issue on this repository.
|