File size: 6,190 Bytes
9b4b049
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7b0ad0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b4b049
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
title: PatchJudge
emoji: ⚖️
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- code-evaluation
- swe-bench
- llm-judge
- code-quality
- merge-score
---

# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents

**PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.

## The Problem

The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
- **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
- **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
- **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
- **23% of SWE-bench tasks** can be "solved" by a trivial regex patch

## The Solution: MergeScore

PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:

| Dimension | What It Measures | Weight |
|-----------|-----------------|--------|
| **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
| **Completeness** | Are edge cases handled? Error handling present? | 20% |
| **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
| **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
| **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |

## Architecture

```
Input: {issue_text, patch_diff, test_results, repo_context}

   ┌────────────────┼────────────────┐
   ▼                ▼                ▼
Feature         LLM Judge        Score
Extractor       (structured      Aggregator
(AST, diff      5-dimension      (weighted avg
 stats)         eval)            → MergeScore)
```

## Components

### 1. Data Loader (`patchjudge/data_loader.py`)
- Loads SWE-bench Verified (500 gold-standard tasks)
- Collects agent patches from multiple sources:
  - **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
  - **OpenHands+O1**: 499 instances, 229 passed
  - **SWE-bench S3 bucket**: 139 verified agent submissions
- Builds unified `PatchExample` format

### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
- AST-based analysis for Python patches
- Diff statistics (files, lines, hunks, scope)
- Issue-patch alignment via keyword matching
- Code quality signals (TODOs, hardcoded values, debug statements)
- Risk assessment (core file modifications, scope analysis)

### 3. LLM Judge (`patchjudge/judge.py`)
- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
- Structured JSON output with reasoning per dimension
- Temperature 0.1 for scoring consistency
- Robust JSON parsing with retry logic

### 4. Validation (`patchjudge/validation.py`)
- METR alignment check (~50% of test-passing patches should score below 50)
- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
- Resolved vs. unresolved separation analysis
- Per-dimension statistical analysis

## Dataset

999 patch examples from SWE-bench Verified:
- **526 test-passing** patches across 2 agents
- **473 test-failing** patches
- **126 synthetically generated known-bad** patches for validation
- Features extracted for all examples

## Evaluation Results (v1)

Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:

### Score Distribution
| Metric | Value |
|--------|-------|
| Mean MergeScore | **50.6/100** |
| Median MergeScore | **49.5/100** |
| Std Dev | 13.8 |
| Score range | 23.0 – 80.5 |

### METR Alignment ✅
- **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
- Test-passing mean: 50.9, Test-failing mean: 42.5
- Clear separation between resolved and unresolved patches

### Per-Dimension Averages (0-10 scale)
| Dimension | Mean | Std |
|-----------|------|-----|
| Correctness | 5.8 | 1.9 |
| Completeness | 4.3 | 1.3 |
| Code Quality | 5.1 | 1.8 |
| Non-Regression Risk | 5.2 | 1.8 |
| Merge-Readiness | 4.5 | 1.7 |

### Per-Agent Comparison
| Agent | Mean MergeScore | Patches |
|-------|----------------|---------|
| CoderForge (Qwen3-32B) | 49.9 | 52 |
| OpenHands+O1 | 52.5 | 20 |

### Known-Bad Detection
In earlier testing, the judge correctly identified known-bad patterns:
- **noop patch** (just adds `pass`): 18.5/100
- **broad try/except** patches: flagged as low quality
- **hardcoded returns**: flagged as non-genuine fixes

## Quick Start

```python
from patchjudge.judge import PatchJudge, quick_judge

# One-shot evaluation
result = quick_judge(
    problem_statement="Fix divide by zero in calculate_average",
    agent_patch="diff --git a/utils.py...",
    gold_patch="diff --git a/utils.py...",
    test_passed=True,
)

print(f"MergeScore: {result.merge_score}/100")
print(result.summary())
```

## Batch Evaluation

```bash
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
```

## Key Research References

- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"

## Data Sources

- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`

## License

MIT