VD10
/

PatchJudge

+---
+title: PatchJudge
+emoji: ⚖️
+colorFrom: purple
+colorTo: blue
+sdk: docker
+app_port: 7860
+pinned: false
+license: mit
+tags:
+- code-evaluation
+- swe-bench
+- llm-judge
+- code-quality
+- merge-score
+---
+# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
+**PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
+## The Problem
+The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
+- **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
+- **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
+- **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
+- **23% of SWE-bench tasks** can be "solved" by a trivial regex patch
+## The Solution: MergeScore
+PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:
+| Dimension | What It Measures | Weight |
+|-----------|-----------------|--------|
+| **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
+| **Completeness** | Are edge cases handled? Error handling present? | 20% |
+| **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
+| **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
+| **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |
+## Architecture
+```
+Input: {issue_text, patch_diff, test_results, repo_context}
+                    │
+   ┌────────────────┼────────────────┐
+   ▼                ▼                ▼
+Feature         LLM Judge        Score
+Extractor       (structured      Aggregator
+(AST, diff      5-dimension      (weighted avg
+ stats)         eval)            → MergeScore)
+```
+## Components
+### 1. Data Loader (`patchjudge/data_loader.py`)
+- Loads SWE-bench Verified (500 gold-standard tasks)
+- Collects agent patches from multiple sources:
+  - **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
+  - **OpenHands+O1**: 499 instances, 229 passed
+  - **SWE-bench S3 bucket**: 139 verified agent submissions
+- Builds unified `PatchExample` format
+### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
+- AST-based analysis for Python patches
+- Diff statistics (files, lines, hunks, scope)
+- Issue-patch alignment via keyword matching
+- Code quality signals (TODOs, hardcoded values, debug statements)
+- Risk assessment (core file modifications, scope analysis)
+### 3. LLM Judge (`patchjudge/judge.py`)
+- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
+- Structured JSON output with reasoning per dimension
+- Temperature 0.1 for scoring consistency
+- Robust JSON parsing with retry logic
+### 4. Validation (`patchjudge/validation.py`)
+- METR alignment check (~50% of test-passing patches should score below 50)
+- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
+- Resolved vs. unresolved separation analysis
+- Per-dimension statistical analysis
+## Dataset
+999 patch examples from SWE-bench Verified:
+- **526 test-passing** patches across 2 agents
+- **473 test-failing** patches
+- **126 synthetically generated known-bad** patches for validation
+- Features extracted for all examples
+## Quick Start
+```python
+from patchjudge.judge import PatchJudge, quick_judge
+# One-shot evaluation
+result = quick_judge(
+    problem_statement="Fix divide by zero in calculate_average",
+    agent_patch="diff --git a/utils.py...",
+    gold_patch="diff --git a/utils.py...",
+    test_passed=True,
+)
+print(f"MergeScore: {result.merge_score}/100")
+print(result.summary())
+```
+## Batch Evaluation
+```bash
+python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
+```
+## Key Research References
+- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
+- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
+- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
+- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"
+## Data Sources
+- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
+- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
+- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
+- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`
+## License
+MIT