---
language:
- multilingual
license: apache-2.0
license_name: kwaipilot-license
license_link: LICENSE
library_name: transformers
---
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" />
</div>

<hr>


# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding


---

## 🧭 Overview

**SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.

It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.  
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.

### Key Features

- **≈ 2,000 curated tasks** from real GitHub issues and pull requests.  
- **8 task types × 8 scenarios × 10 programming languages** for comprehensive coverage.  
- **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites.  
- **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency.

---

## 📁 Dataset Structure

```
SWE-Compass/
├─ data/
│  ├─ test.jsonl              # main evaluation set (~2,000 instances)
│  ├─ dev.jsonl               # optional validation split
│  └─ train.jsonl             # optional training data
├─ scripts/
│  ├─ setup_env.sh            # environment setup (dependency installation)
│  ├─ run_instance.py         # run one instance end-to-end
│  └─ eval_aggregate.py       # aggregate evaluation metrics
└─ README.md
```


```json
{
  "instance_id": "compass_01234",
  "task_type": "bug_fixing",
  "scenario": "mono_repo_ci",
  "language": "python",
  "difficulty": "medium",
  "source": {
    "repo": "owner/project",
    "commit": "abcdef123456",
    "issue_or_pr": "PR#13091",
    "gh_url": "https://github.com/owner/project/pull/13091"
  },
  "instruction": "Fix failing test in module X caused by Y...",
  "context_files": ["path/to/file1.py", "path/to/file2.py"],
  "tools_available": ["git", "pytest", "bash"],
  "evaluation": {
    "setup_cmds": ["pip install -e .", "pytest -q"],
    "test_cmd": "pytest -q",
    "timeout_sec": 1800
  },
  "reference_patch": "diff --git a/... b/...",
  "verified": true
}
```

# 🧪 Usage
## Load via 🤗 datasets

```python
from datasets import load_dataset

dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
```

## Run Local Evaluation

```bash
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
```

# 📊 Metrics

| Category | Metric | Description |
|-----------|---------|-------------|
| **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| **Process** | Tool Calls / Latency | Efficiency and reasoning stability |
| **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification |

---


# 🧩 Citation

```bibtex
@article{xu2025swecompass,
  title   = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
  author  = {Jingxuan Xu and others},
  journal = {arXiv preprint arXiv:2511.05459},
  year    = {2025}
}
```

# 🤝 Contributing
We welcome community contributions — new verified instances, environment fixes, or evaluation scripts.
Please open a pull request or issue on this repository.