PatchJudge / README.md
VD10's picture
Upload README.md with huggingface_hub
b7b0ad0 verified
---
title: PatchJudge
emoji: ⚖️
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- code-evaluation
- swe-bench
- llm-judge
- code-quality
- merge-score
---
# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
**PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
## The Problem
The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
- **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
- **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
- **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
- **23% of SWE-bench tasks** can be "solved" by a trivial regex patch
## The Solution: MergeScore
PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:
| Dimension | What It Measures | Weight |
|-----------|-----------------|--------|
| **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
| **Completeness** | Are edge cases handled? Error handling present? | 20% |
| **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
| **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
| **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |
## Architecture
```
Input: {issue_text, patch_diff, test_results, repo_context}
┌────────────────┼────────────────┐
▼ ▼ ▼
Feature LLM Judge Score
Extractor (structured Aggregator
(AST, diff 5-dimension (weighted avg
stats) eval) → MergeScore)
```
## Components
### 1. Data Loader (`patchjudge/data_loader.py`)
- Loads SWE-bench Verified (500 gold-standard tasks)
- Collects agent patches from multiple sources:
- **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
- **OpenHands+O1**: 499 instances, 229 passed
- **SWE-bench S3 bucket**: 139 verified agent submissions
- Builds unified `PatchExample` format
### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
- AST-based analysis for Python patches
- Diff statistics (files, lines, hunks, scope)
- Issue-patch alignment via keyword matching
- Code quality signals (TODOs, hardcoded values, debug statements)
- Risk assessment (core file modifications, scope analysis)
### 3. LLM Judge (`patchjudge/judge.py`)
- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
- Structured JSON output with reasoning per dimension
- Temperature 0.1 for scoring consistency
- Robust JSON parsing with retry logic
### 4. Validation (`patchjudge/validation.py`)
- METR alignment check (~50% of test-passing patches should score below 50)
- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
- Resolved vs. unresolved separation analysis
- Per-dimension statistical analysis
## Dataset
999 patch examples from SWE-bench Verified:
- **526 test-passing** patches across 2 agents
- **473 test-failing** patches
- **126 synthetically generated known-bad** patches for validation
- Features extracted for all examples
## Evaluation Results (v1)
Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
### Score Distribution
| Metric | Value |
|--------|-------|
| Mean MergeScore | **50.6/100** |
| Median MergeScore | **49.5/100** |
| Std Dev | 13.8 |
| Score range | 23.0 – 80.5 |
### METR Alignment ✅
- **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
- Test-passing mean: 50.9, Test-failing mean: 42.5
- Clear separation between resolved and unresolved patches
### Per-Dimension Averages (0-10 scale)
| Dimension | Mean | Std |
|-----------|------|-----|
| Correctness | 5.8 | 1.9 |
| Completeness | 4.3 | 1.3 |
| Code Quality | 5.1 | 1.8 |
| Non-Regression Risk | 5.2 | 1.8 |
| Merge-Readiness | 4.5 | 1.7 |
### Per-Agent Comparison
| Agent | Mean MergeScore | Patches |
|-------|----------------|---------|
| CoderForge (Qwen3-32B) | 49.9 | 52 |
| OpenHands+O1 | 52.5 | 20 |
### Known-Bad Detection
In earlier testing, the judge correctly identified known-bad patterns:
- **noop patch** (just adds `pass`): 18.5/100
- **broad try/except** patches: flagged as low quality
- **hardcoded returns**: flagged as non-genuine fixes
## Quick Start
```python
from patchjudge.judge import PatchJudge, quick_judge
# One-shot evaluation
result = quick_judge(
problem_statement="Fix divide by zero in calculate_average",
agent_patch="diff --git a/utils.py...",
gold_patch="diff --git a/utils.py...",
test_passed=True,
)
print(f"MergeScore: {result.merge_score}/100")
print(result.summary())
```
## Batch Evaluation
```bash
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
```
## Key Research References
- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"
## Data Sources
- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`
## License
MIT