| --- |
| title: PatchJudge |
| emoji: ⚖️ |
| colorFrom: purple |
| colorTo: blue |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| license: mit |
| tags: |
| - code-evaluation |
| - swe-bench |
| - llm-judge |
| - code-quality |
| - merge-score |
| --- |
| |
| # PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents |
|
|
| **PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests. |
|
|
| ## The Problem |
|
|
| The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken: |
| - **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed |
| - **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases |
| - **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223)) |
| - **23% of SWE-bench tasks** can be "solved" by a trivial regex patch |
|
|
| ## The Solution: MergeScore |
|
|
| PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**: |
|
|
| | Dimension | What It Measures | Weight | |
| |-----------|-----------------|--------| |
| | **Correctness** | Does the fix address the actual issue, not just the test? | 30% | |
| | **Completeness** | Are edge cases handled? Error handling present? | 20% | |
| | **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% | |
| | **Non-Regression Risk** | Could this break unrelated functionality? | 15% | |
| | **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% | |
|
|
| ## Architecture |
|
|
| ``` |
| Input: {issue_text, patch_diff, test_results, repo_context} |
| │ |
| ┌────────────────┼────────────────┐ |
| ▼ ▼ ▼ |
| Feature LLM Judge Score |
| Extractor (structured Aggregator |
| (AST, diff 5-dimension (weighted avg |
| stats) eval) → MergeScore) |
| ``` |
|
|
| ## Components |
|
|
| ### 1. Data Loader (`patchjudge/data_loader.py`) |
| - Loads SWE-bench Verified (500 gold-standard tasks) |
| - Collects agent patches from multiple sources: |
| - **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed |
| - **OpenHands+O1**: 499 instances, 229 passed |
| - **SWE-bench S3 bucket**: 139 verified agent submissions |
| - Builds unified `PatchExample` format |
| |
| ### 2. Feature Extractor (`patchjudge/feature_extractor.py`) |
| - AST-based analysis for Python patches |
| - Diff statistics (files, lines, hunks, scope) |
| - Issue-patch alignment via keyword matching |
| - Code quality signals (TODOs, hardcoded values, debug statements) |
| - Risk assessment (core file modifications, scope analysis) |
|
|
| ### 3. LLM Judge (`patchjudge/judge.py`) |
| - Uses Qwen2.5-Coder-32B-Instruct via HF Inference API |
| - Structured JSON output with reasoning per dimension |
| - Temperature 0.1 for scoring consistency |
| - Robust JSON parsing with retry logic |
|
|
| ### 4. Validation (`patchjudge/validation.py`) |
| - METR alignment check (~50% of test-passing patches should score below 50) |
| - Known-bad pattern detection (hardcoded returns, broad try/except, test disabling) |
| - Resolved vs. unresolved separation analysis |
| - Per-dimension statistical analysis |
|
|
| ## Dataset |
|
|
| 999 patch examples from SWE-bench Verified: |
| - **526 test-passing** patches across 2 agents |
| - **473 test-failing** patches |
| - **126 synthetically generated known-bad** patches for validation |
| - Features extracted for all examples |
|
|
| ## Evaluation Results (v1) |
|
|
| Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model: |
|
|
| ### Score Distribution |
| | Metric | Value | |
| |--------|-------| |
| | Mean MergeScore | **50.6/100** | |
| | Median MergeScore | **49.5/100** | |
| | Std Dev | 13.8 | |
| | Score range | 23.0 – 80.5 | |
|
|
| ### METR Alignment ✅ |
| - **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy |
| - Test-passing mean: 50.9, Test-failing mean: 42.5 |
| - Clear separation between resolved and unresolved patches |
|
|
| ### Per-Dimension Averages (0-10 scale) |
| | Dimension | Mean | Std | |
| |-----------|------|-----| |
| | Correctness | 5.8 | 1.9 | |
| | Completeness | 4.3 | 1.3 | |
| | Code Quality | 5.1 | 1.8 | |
| | Non-Regression Risk | 5.2 | 1.8 | |
| | Merge-Readiness | 4.5 | 1.7 | |
|
|
| ### Per-Agent Comparison |
| | Agent | Mean MergeScore | Patches | |
| |-------|----------------|---------| |
| | CoderForge (Qwen3-32B) | 49.9 | 52 | |
| | OpenHands+O1 | 52.5 | 20 | |
|
|
| ### Known-Bad Detection |
| In earlier testing, the judge correctly identified known-bad patterns: |
| - **noop patch** (just adds `pass`): 18.5/100 |
| - **broad try/except** patches: flagged as low quality |
| - **hardcoded returns**: flagged as non-genuine fixes |
|
|
| ## Quick Start |
|
|
| ```python |
| from patchjudge.judge import PatchJudge, quick_judge |
| |
| # One-shot evaluation |
| result = quick_judge( |
| problem_statement="Fix divide by zero in calculate_average", |
| agent_patch="diff --git a/utils.py...", |
| gold_patch="diff --git a/utils.py...", |
| test_passed=True, |
| ) |
| |
| print(f"MergeScore: {result.merge_score}/100") |
| print(result.summary()) |
| ``` |
|
|
| ## Batch Evaluation |
|
|
| ```bash |
| python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad |
| ``` |
|
|
| ## Key Research References |
|
|
| - [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?" |
| - [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks" |
| - [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents" |
| - [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench" |
|
|
| ## Data Sources |
|
|
| - Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) |
| - CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories) |
| - OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results) |
| - SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/` |
|
|
| ## License |
|
|
| MIT |
|
|