test_2 / README.md

Update README.md

3b09d2b verified 3 months ago

3.77 kB

	---
	language:
	- multilingual
	license: apache-2.0
	license_name: kwaipilot-license
	license_link: LICENSE
	library_name: transformers
	---
	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/KIYEa1c_WJEWPpeS0L_k1.png" width="100%" alt="Kwaipilot" />
	</div>

	<hr>


	# SWE-Compass: Unified Evaluation Benchmark for Agentic Coding


	---

	## 🧭 Overview

	SWE-Compass is a unified benchmark and dataset designed to evaluate Agentic Coding capabilities of large language models (LLMs) in realistic, multi-step software engineering workflows.

	It bridges the gap between conventional static code-generation benchmarks and real-world, tool-driven development processes.
	Each instance corresponds to a reproducible issue-fixing or feature-implementation task that can be executed end-to-end: cloning a repository, applying patches, running tests, and verifying solutions.

	### Key Features

	- ≈ 2,000 curated tasks from real GitHub issues and pull requests.
	- 8 task types × 8 scenarios × 10 programming languages for comprehensive coverage.
	- Fully reproducible pipeline including setup scripts, environment dependencies, and test suites.
	- Multi-dimensional evaluation of correctness, reasoning trace, and agentic efficiency.

	---

	## 📁 Dataset Structure

	```
	SWE-Compass/
	├─ data/
	│ ├─ test.jsonl # main evaluation set (~2,000 instances)
	│ ├─ dev.jsonl # optional validation split
	│ └─ train.jsonl # optional training data
	├─ scripts/
	│ ├─ setup_env.sh # environment setup (dependency installation)
	│ ├─ run_instance.py # run one instance end-to-end
	│ └─ eval_aggregate.py # aggregate evaluation metrics
	└─ README.md
	```


	```json
	{
	"instance_id": "compass_01234",
	"task_type": "bug_fixing",
	"scenario": "mono_repo_ci",
	"language": "python",
	"difficulty": "medium",
	"source": {
	"repo": "owner/project",
	"commit": "abcdef123456",
	"issue_or_pr": "PR#13091",
	"gh_url": "https://github.com/owner/project/pull/13091"
	},
	"instruction": "Fix failing test in module X caused by Y...",
	"context_files": ["path/to/file1.py", "path/to/file2.py"],
	"tools_available": ["git", "pytest", "bash"],
	"evaluation": {
	"setup_cmds": ["pip install -e .", "pytest -q"],
	"test_cmd": "pytest -q",
	"timeout_sec": 1800
	},
	"reference_patch": "diff --git a/... b/...",
	"verified": true
	}
	```

	# 🧪 Usage
	## Load via 🤗 datasets

	```python
	from datasets import load_dataset

	dataset = load_dataset("Kwaipilot/SWE-Compass", split="test")
	print(len(dataset), dataset[0].keys())
	Run Local Evaluation
	```

	## Run Local Evaluation

	```bash
	bash scripts/setup_env.sh
	python scripts/run_instance.py --data data/test.jsonl --instance_id compass_01234
	python scripts/eval_aggregate.py --data data/test.jsonl --runs ./runs/your_model_outputs
	```

	# 📊 Metrics

	\| Category \| Metric \| Description \|
	\|-----------\|---------\|-------------\|
	\| Main \| Solved@1 / Solved@k \| Fraction of tasks solved within k attempts \|
	\| Process \| Tool Calls / Latency \| Efficiency and reasoning stability \|
	\| Failure Types \| Build Error / Test Fail / Timeout \| Root-cause classification \|

	---


	# 🧩 Citation

	```bibtex
	@article{xu2025swecompass,
	title = {SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
	author = {Jingxuan Xu and others},
	journal = {arXiv preprint arXiv:2511.05459},
	year = {2025}
	}
	```

	# 🤝 Contributing
	We welcome community contributions — new verified instances, environment fixes, or evaluation scripts.
	Please open a pull request or issue on this repository.