Upload README.md with huggingface_hub

b7b0ad0 verified 13 days ago

6.19 kB

	---
	title: PatchJudge
	emoji: ⚖️
	colorFrom: purple
	colorTo: blue
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	tags:
	- code-evaluation
	- swe-bench
	- llm-judge
	- code-quality
	- merge-score
	---

	# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents

	PatchJudge evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.

	## The Problem

	The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
	- OpenAI abandoned SWE-bench Verified because 16.4%+ of test cases are flawed
	- METR found ~50% of test-passing PRs wouldn't be merged into real codebases
	- 7.8% of "correct" patches are actually wrong — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
	- 23% of SWE-bench tasks can be "solved" by a trivial regex patch

	## The Solution: MergeScore

	PatchJudge scores every patch on 5 dimensions, then computes a single MergeScore (0-100):

	\| Dimension \| What It Measures \| Weight \|
	\|-----------\|-----------------\|--------\|
	\| Correctness \| Does the fix address the actual issue, not just the test? \| 30% \|
	\| Completeness \| Are edge cases handled? Error handling present? \| 20% \|
	\| Code Quality \| Clean, idiomatic, maintainable, follows conventions? \| 20% \|
	\| Non-Regression Risk \| Could this break unrelated functionality? \| 15% \|
	\| Merge-Readiness \| Would a senior engineer approve this PR as-is? \| 15% \|

	## Architecture

	```
	Input: {issue_text, patch_diff, test_results, repo_context}
	│
	┌────────────────┼────────────────┐
	▼ ▼ ▼
	Feature LLM Judge Score
	Extractor (structured Aggregator
	(AST, diff 5-dimension (weighted avg
	stats) eval) → MergeScore)
	```

	## Components

	### 1. Data Loader (`patchjudge/data_loader.py`)
	- Loads SWE-bench Verified (500 gold-standard tasks)
	- Collects agent patches from multiple sources:
	- CoderForge (Qwen3-Coder-32B): 500 instances, 297 passed
	- OpenHands+O1: 499 instances, 229 passed
	- SWE-bench S3 bucket: 139 verified agent submissions
	- Builds unified `PatchExample` format

	### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
	- AST-based analysis for Python patches
	- Diff statistics (files, lines, hunks, scope)
	- Issue-patch alignment via keyword matching
	- Code quality signals (TODOs, hardcoded values, debug statements)
	- Risk assessment (core file modifications, scope analysis)

	### 3. LLM Judge (`patchjudge/judge.py`)
	- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
	- Structured JSON output with reasoning per dimension
	- Temperature 0.1 for scoring consistency
	- Robust JSON parsing with retry logic

	### 4. Validation (`patchjudge/validation.py`)
	- METR alignment check (~50% of test-passing patches should score below 50)
	- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
	- Resolved vs. unresolved separation analysis
	- Per-dimension statistical analysis

	## Dataset

	999 patch examples from SWE-bench Verified:
	- 526 test-passing patches across 2 agents
	- 473 test-failing patches
	- 126 synthetically generated known-bad patches for validation
	- Features extracted for all examples

	## Evaluation Results (v1)

	Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:

	### Score Distribution
	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean MergeScore \| 50.6/100 \|
	\| Median MergeScore \| 49.5/100 \|
	\| Std Dev \| 13.8 \|
	\| Score range \| 23.0 – 80.5 \|

	### METR Alignment ✅
	- 50% of test-passing patches scored below 50 — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
	- Test-passing mean: 50.9, Test-failing mean: 42.5
	- Clear separation between resolved and unresolved patches

	### Per-Dimension Averages (0-10 scale)
	\| Dimension \| Mean \| Std \|
	\|-----------\|------\|-----\|
	\| Correctness \| 5.8 \| 1.9 \|
	\| Completeness \| 4.3 \| 1.3 \|
	\| Code Quality \| 5.1 \| 1.8 \|
	\| Non-Regression Risk \| 5.2 \| 1.8 \|
	\| Merge-Readiness \| 4.5 \| 1.7 \|

	### Per-Agent Comparison
	\| Agent \| Mean MergeScore \| Patches \|
	\|-------\|----------------\|---------\|
	\| CoderForge (Qwen3-32B) \| 49.9 \| 52 \|
	\| OpenHands+O1 \| 52.5 \| 20 \|

	### Known-Bad Detection
	In earlier testing, the judge correctly identified known-bad patterns:
	- noop patch (just adds `pass`): 18.5/100
	- broad try/except patches: flagged as low quality
	- hardcoded returns: flagged as non-genuine fixes

	## Quick Start

	```python
	from patchjudge.judge import PatchJudge, quick_judge

	# One-shot evaluation
	result = quick_judge(
	problem_statement="Fix divide by zero in calculate_average",
	agent_patch="diff --git a/utils.py...",
	gold_patch="diff --git a/utils.py...",
	test_passed=True,
	)

	print(f"MergeScore: {result.merge_score}/100")
	print(result.summary())
	```

	## Batch Evaluation

	```bash
	python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
	```

	## Key Research References

	- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
	- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
	- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
	- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"

	## Data Sources

	- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
	- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
	- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
	- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`

	## License

	MIT