test_2 / README.md
shunxing1234's picture
Update README.md
3b09d2b verified
---
language:
- multilingual
license: apache-2.0
license_name: kwaipilot-license
license_link: LICENSE
library_name: transformers
---
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" />
</div>
<hr>
# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding
---
## 🧭 Overview
**SWE-Compass** is a unified benchmark and dataset designed to evaluate **Agentic Coding** capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.
It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.
### Key Features
- **β‰ˆ 2,000 curated tasks** from real GitHub issues and pull requests.
- **8 task types Γ— 8 scenarios Γ— 10 programming languages** for comprehensive coverage.
- **Fully reproducible pipeline** including setup scripts, environment dependencies, and test suites.
- **Multi-dimensional evaluation** of correctness, reasoning trace, and agentic efficiency.
---
## πŸ“ Dataset Structure
```
SWE-Compass/
β”œβ”€ data/
β”‚ β”œβ”€ test.jsonl # main evaluation set (~2,000 instances)
β”‚ β”œβ”€ dev.jsonl # optional validation split
β”‚ └─ train.jsonl # optional training data
β”œβ”€ scripts/
β”‚ β”œβ”€ setup_env.sh # environment setup (dependency installation)
β”‚ β”œβ”€ run_instance.py # run one instance end-to-end
β”‚ └─ eval_aggregate.py # aggregate evaluation metrics
└─ README.md
```
```json
{
"instance_id": "compass_01234",
"task_type": "bug_fixing",
"scenario": "mono_repo_ci",
"language": "python",
"difficulty": "medium",
"source": {
"repo": "owner/project",
"commit": "abcdef123456",
"issue_or_pr": "PR#13091",
"gh_url": "https://github.com/owner/project/pull/13091"
},
"instruction": "Fix failing test in module X caused by Y...",
"context_files": ["path/to/file1.py", "path/to/file2.py"],
"tools_available": ["git", "pytest", "bash"],
"evaluation": {
"setup_cmds": ["pip install -e .", "pytest -q"],
"test_cmd": "pytest -q",
"timeout_sec": 1800
},
"reference_patch": "diff --git a/... b/...",
"verified": true
}
```
# πŸ§ͺ Usage
## Load via πŸ€— datasets
```python
from datasets import load_dataset
dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
print(len(dataset), dataset[0].keys())
Run Local Evaluation
```
## Run Local Evaluation
```bash
bash scripts/setup_env.sh
python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
```
# πŸ“Š Metrics
| Category | Metric | Description |
|-----------|---------|-------------|
| **Main** | Solved@1 / Solved@k | Fraction of tasks solved within k attempts |
| **Process** | Tool Calls / Latency | Efficiency and reasoning stability |
| **Failure Types** | Build Error / Test Fail / Timeout | Root-cause classification |
---
# 🧩 Citation
```bibtex
@article{xu2025swecompass,
title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
author = {Jingxuan Xu and others},
journal = {arXiv preprint arXiv:2511.05459},
year = {2025}
}
```
# 🀝 Contributing
We welcome community contributions β€” new verified instances, environment fixes, or evaluation scripts.
Please open a pull request or issue on this repository.