File size: 6,190 Bytes
9b4b049 b7b0ad0 9b4b049 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
title: PatchJudge
emoji: ⚖️
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- code-evaluation
- swe-bench
- llm-judge
- code-quality
- merge-score
---
# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
**PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
## The Problem
The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
- **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
- **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
- **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
- **23% of SWE-bench tasks** can be "solved" by a trivial regex patch
## The Solution: MergeScore
PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:
| Dimension | What It Measures | Weight |
|-----------|-----------------|--------|
| **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
| **Completeness** | Are edge cases handled? Error handling present? | 20% |
| **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
| **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
| **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |
## Architecture
```
Input: {issue_text, patch_diff, test_results, repo_context}
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Feature LLM Judge Score
Extractor (structured Aggregator
(AST, diff 5-dimension (weighted avg
stats) eval) → MergeScore)
```
## Components
### 1. Data Loader (`patchjudge/data_loader.py`)
- Loads SWE-bench Verified (500 gold-standard tasks)
- Collects agent patches from multiple sources:
- **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
- **OpenHands+O1**: 499 instances, 229 passed
- **SWE-bench S3 bucket**: 139 verified agent submissions
- Builds unified `PatchExample` format
### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
- AST-based analysis for Python patches
- Diff statistics (files, lines, hunks, scope)
- Issue-patch alignment via keyword matching
- Code quality signals (TODOs, hardcoded values, debug statements)
- Risk assessment (core file modifications, scope analysis)
### 3. LLM Judge (`patchjudge/judge.py`)
- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
- Structured JSON output with reasoning per dimension
- Temperature 0.1 for scoring consistency
- Robust JSON parsing with retry logic
### 4. Validation (`patchjudge/validation.py`)
- METR alignment check (~50% of test-passing patches should score below 50)
- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
- Resolved vs. unresolved separation analysis
- Per-dimension statistical analysis
## Dataset
999 patch examples from SWE-bench Verified:
- **526 test-passing** patches across 2 agents
- **473 test-failing** patches
- **126 synthetically generated known-bad** patches for validation
- Features extracted for all examples
## Evaluation Results (v1)
Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
### Score Distribution
| Metric | Value |
|--------|-------|
| Mean MergeScore | **50.6/100** |
| Median MergeScore | **49.5/100** |
| Std Dev | 13.8 |
| Score range | 23.0 – 80.5 |
### METR Alignment ✅
- **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
- Test-passing mean: 50.9, Test-failing mean: 42.5
- Clear separation between resolved and unresolved patches
### Per-Dimension Averages (0-10 scale)
| Dimension | Mean | Std |
|-----------|------|-----|
| Correctness | 5.8 | 1.9 |
| Completeness | 4.3 | 1.3 |
| Code Quality | 5.1 | 1.8 |
| Non-Regression Risk | 5.2 | 1.8 |
| Merge-Readiness | 4.5 | 1.7 |
### Per-Agent Comparison
| Agent | Mean MergeScore | Patches |
|-------|----------------|---------|
| CoderForge (Qwen3-32B) | 49.9 | 52 |
| OpenHands+O1 | 52.5 | 20 |
### Known-Bad Detection
In earlier testing, the judge correctly identified known-bad patterns:
- **noop patch** (just adds `pass`): 18.5/100
- **broad try/except** patches: flagged as low quality
- **hardcoded returns**: flagged as non-genuine fixes
## Quick Start
```python
from patchjudge.judge import PatchJudge, quick_judge
# One-shot evaluation
result = quick_judge(
problem_statement="Fix divide by zero in calculate_average",
agent_patch="diff --git a/utils.py...",
gold_patch="diff --git a/utils.py...",
test_passed=True,
)
print(f"MergeScore: {result.merge_score}/100")
print(result.summary())
```
## Batch Evaluation
```bash
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
```
## Key Research References
- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"
## Data Sources
- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`
## License
MIT
|