File size: 3,770 Bytes
56b8fcc
 
 
13bc8ef
56b8fcc
 
 
 
 
f31b274
56b8fcc
 
 
 
 
5855b8a
49de317
403322b
5855b8a
 
 
13bc8ef
5855b8a
56b8fcc
5855b8a
 
56b8fcc
5855b8a
50bea7d
5855b8a
 
 
 
 
 
ba5c32a
5855b8a
50bea7d
5855b8a
 
 
 
 
 
 
 
 
 
 
 
50bea7d
56b8fcc
5855b8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56b8fcc
5855b8a
 
56b8fcc
 
5855b8a
 
 
 
 
56b8fcc
 
5855b8a
3a2b807
5855b8a
 
 
 
3a2b807
5855b8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a2b807
 
5855b8a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
language:
- multilingual
license: apache-2.0
license_name: kwaipilot-license
license_link: LICENSE
library_name: transformers
---
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" />
</div>

<hr>


# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding


---

## 🧭 Overview

**SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.

It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.  
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.

### Key Features

- **β‰ˆ 2,000 curated tasks** from real GitHub issues and pull requests.  
- **8 task types Γ— 8 scenarios Γ— 10 programming languages** for comprehensive coverage.  
- **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites.  
- **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency.

---

## πŸ“ Dataset Structure

```
SWE-Compass/
β”œβ”€ data/
β”‚  β”œβ”€ test.jsonl              # main evaluation set (~2,000 instances)
β”‚  β”œβ”€ dev.jsonl               # optional validation split
β”‚  └─ train.jsonl             # optional training data
β”œβ”€ scripts/
β”‚  β”œβ”€ setup_env.sh            # environment setup (dependency installation)
β”‚  β”œβ”€ run_instance.py         # run one instance end-to-end
β”‚  └─ eval_aggregate.py       # aggregate evaluation metrics
└─ README.md
```


```json
{
  "instance_id": "compass_01234",
  "task_type": "bug_fixing",
  "scenario": "mono_repo_ci",
  "language": "python",
  "difficulty": "medium",
  "source": {
    "repo": "owner/project",
    "commit": "abcdef123456",
    "issue_or_pr": "PR#13091",
    "gh_url": "https://github.com/owner/project/pull/13091"
  },
  "instruction": "Fix failing test in module X caused by Y...",
  "context_files": ["path/to/file1.py", "path/to/file2.py"],
  "tools_available": ["git", "pytest", "bash"],
  "evaluation": {
    "setup_cmds": ["pip install -e .", "pytest -q"],
    "test_cmd": "pytest -q",
    "timeout_sec": 1800
  },
  "reference_patch": "diff --git a/... b/...",
  "verified": true
}
```

# πŸ§ͺ Usage
## Load via πŸ€— datasets

```python
from datasets import load_dataset

dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
```

## Run Local Evaluation

```bash
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
```

# πŸ“Š Metrics

| Category | Metric | Description |
|-----------|---------|-------------|
| **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| **Process** | Tool Calls / Latency | Efficiency and reasoning stability |
| **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification |

---


# 🧩 Citation

```bibtex
@article{xu2025swecompass,
  title   = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
  author  = {Jingxuan Xu and others},
  journal = {arXiv preprint arXiv:2511.05459},
  year    = {2025}
}
```

# 🀝 Contributing
We welcome community contributions β€” new verified instances, environment fixes, or evaluation scripts.
Please open a pull request or issue on this repository.