metadata
title: PatchJudge
emoji: ⚖️
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
- code-evaluation
- swe-bench
- llm-judge
- code-quality
- merge-score
PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
PatchJudge evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
The Problem
The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
- OpenAI abandoned SWE-bench Verified because 16.4%+ of test cases are flawed
- METR found ~50% of test-passing PRs wouldn't be merged into real codebases
- 7.8% of "correct" patches are actually wrong — they pass tests but are incomplete/broken (PatchDiff, 2503.15223)
- 23% of SWE-bench tasks can be "solved" by a trivial regex patch
The Solution: MergeScore
PatchJudge scores every patch on 5 dimensions, then computes a single MergeScore (0-100):
| Dimension | What It Measures | Weight |
|---|---|---|
| Correctness | Does the fix address the actual issue, not just the test? | 30% |
| Completeness | Are edge cases handled? Error handling present? | 20% |
| Code Quality | Clean, idiomatic, maintainable, follows conventions? | 20% |
| Non-Regression Risk | Could this break unrelated functionality? | 15% |
| Merge-Readiness | Would a senior engineer approve this PR as-is? | 15% |
Architecture
Input: {issue_text, patch_diff, test_results, repo_context}
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Feature LLM Judge Score
Extractor (structured Aggregator
(AST, diff 5-dimension (weighted avg
stats) eval) → MergeScore)
Components
1. Data Loader (patchjudge/data_loader.py)
- Loads SWE-bench Verified (500 gold-standard tasks)
- Collects agent patches from multiple sources:
- CoderForge (Qwen3-Coder-32B): 500 instances, 297 passed
- OpenHands+O1: 499 instances, 229 passed
- SWE-bench S3 bucket: 139 verified agent submissions
- Builds unified
PatchExampleformat
2. Feature Extractor (patchjudge/feature_extractor.py)
- AST-based analysis for Python patches
- Diff statistics (files, lines, hunks, scope)
- Issue-patch alignment via keyword matching
- Code quality signals (TODOs, hardcoded values, debug statements)
- Risk assessment (core file modifications, scope analysis)
3. LLM Judge (patchjudge/judge.py)
- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
- Structured JSON output with reasoning per dimension
- Temperature 0.1 for scoring consistency
- Robust JSON parsing with retry logic
4. Validation (patchjudge/validation.py)
- METR alignment check (~50% of test-passing patches should score below 50)
- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
- Resolved vs. unresolved separation analysis
- Per-dimension statistical analysis
Dataset
999 patch examples from SWE-bench Verified:
- 526 test-passing patches across 2 agents
- 473 test-failing patches
- 126 synthetically generated known-bad patches for validation
- Features extracted for all examples
Evaluation Results (v1)
Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
Score Distribution
| Metric | Value |
|---|---|
| Mean MergeScore | 50.6/100 |
| Median MergeScore | 49.5/100 |
| Std Dev | 13.8 |
| Score range | 23.0 – 80.5 |
METR Alignment ✅
- 50% of test-passing patches scored below 50 — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
- Test-passing mean: 50.9, Test-failing mean: 42.5
- Clear separation between resolved and unresolved patches
Per-Dimension Averages (0-10 scale)
| Dimension | Mean | Std |
|---|---|---|
| Correctness | 5.8 | 1.9 |
| Completeness | 4.3 | 1.3 |
| Code Quality | 5.1 | 1.8 |
| Non-Regression Risk | 5.2 | 1.8 |
| Merge-Readiness | 4.5 | 1.7 |
Per-Agent Comparison
| Agent | Mean MergeScore | Patches |
|---|---|---|
| CoderForge (Qwen3-32B) | 49.9 | 52 |
| OpenHands+O1 | 52.5 | 20 |
Known-Bad Detection
In earlier testing, the judge correctly identified known-bad patterns:
- noop patch (just adds
pass): 18.5/100 - broad try/except patches: flagged as low quality
- hardcoded returns: flagged as non-genuine fixes
Quick Start
from patchjudge.judge import PatchJudge, quick_judge
# One-shot evaluation
result = quick_judge(
problem_statement="Fix divide by zero in calculate_average",
agent_patch="diff --git a/utils.py...",
gold_patch="diff --git a/utils.py...",
test_passed=True,
)
print(f"MergeScore: {result.merge_score}/100")
print(result.summary())
Batch Evaluation
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
Key Research References
- PatchDiff — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
- CodeJudgeBench — "Benchmarking LLM-as-a-Judge for Coding Tasks"
- SWE-smith — "Scaling Data for Software Engineering Agents"
- UTBoost — "Rigorous Evaluation of Coding Agents on SWE-Bench"
Data Sources
- Gold patches: princeton-nlp/SWE-bench_Verified
- CoderForge agent: togethercomputer/CoderForge-Preview-32B
- OpenHands+O1: AlexCuadron/SWE-Bench-Verified-O1
- SWE-bench S3 submissions: 139 agents via
s3://swe-bench-submissions/verified/
License
MIT