--- title: PatchJudge emoji: ⚖️ colorFrom: purple colorTo: blue sdk: docker app_port: 7860 pinned: false license: mit tags: - code-evaluation - swe-bench - llm-judge - code-quality - merge-score --- # PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents **PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests. ## The Problem The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken: - **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed - **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases - **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223)) - **23% of SWE-bench tasks** can be "solved" by a trivial regex patch ## The Solution: MergeScore PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**: | Dimension | What It Measures | Weight | |-----------|-----------------|--------| | **Correctness** | Does the fix address the actual issue, not just the test? | 30% | | **Completeness** | Are edge cases handled? Error handling present? | 20% | | **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% | | **Non-Regression Risk** | Could this break unrelated functionality? | 15% | | **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% | ## Architecture ``` Input: {issue_text, patch_diff, test_results, repo_context} │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ Feature LLM Judge Score Extractor (structured Aggregator (AST, diff 5-dimension (weighted avg stats) eval) → MergeScore) ``` ## Components ### 1. Data Loader (`patchjudge/data_loader.py`) - Loads SWE-bench Verified (500 gold-standard tasks) - Collects agent patches from multiple sources: - **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed - **OpenHands+O1**: 499 instances, 229 passed - **SWE-bench S3 bucket**: 139 verified agent submissions - Builds unified `PatchExample` format ### 2. Feature Extractor (`patchjudge/feature_extractor.py`) - AST-based analysis for Python patches - Diff statistics (files, lines, hunks, scope) - Issue-patch alignment via keyword matching - Code quality signals (TODOs, hardcoded values, debug statements) - Risk assessment (core file modifications, scope analysis) ### 3. LLM Judge (`patchjudge/judge.py`) - Uses Qwen2.5-Coder-32B-Instruct via HF Inference API - Structured JSON output with reasoning per dimension - Temperature 0.1 for scoring consistency - Robust JSON parsing with retry logic ### 4. Validation (`patchjudge/validation.py`) - METR alignment check (~50% of test-passing patches should score below 50) - Known-bad pattern detection (hardcoded returns, broad try/except, test disabling) - Resolved vs. unresolved separation analysis - Per-dimension statistical analysis ## Dataset 999 patch examples from SWE-bench Verified: - **526 test-passing** patches across 2 agents - **473 test-failing** patches - **126 synthetically generated known-bad** patches for validation - Features extracted for all examples ## Evaluation Results (v1) Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model: ### Score Distribution | Metric | Value | |--------|-------| | Mean MergeScore | **50.6/100** | | Median MergeScore | **49.5/100** | | Std Dev | 13.8 | | Score range | 23.0 – 80.5 | ### METR Alignment ✅ - **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy - Test-passing mean: 50.9, Test-failing mean: 42.5 - Clear separation between resolved and unresolved patches ### Per-Dimension Averages (0-10 scale) | Dimension | Mean | Std | |-----------|------|-----| | Correctness | 5.8 | 1.9 | | Completeness | 4.3 | 1.3 | | Code Quality | 5.1 | 1.8 | | Non-Regression Risk | 5.2 | 1.8 | | Merge-Readiness | 4.5 | 1.7 | ### Per-Agent Comparison | Agent | Mean MergeScore | Patches | |-------|----------------|---------| | CoderForge (Qwen3-32B) | 49.9 | 52 | | OpenHands+O1 | 52.5 | 20 | ### Known-Bad Detection In earlier testing, the judge correctly identified known-bad patterns: - **noop patch** (just adds `pass`): 18.5/100 - **broad try/except** patches: flagged as low quality - **hardcoded returns**: flagged as non-genuine fixes ## Quick Start ```python from patchjudge.judge import PatchJudge, quick_judge # One-shot evaluation result = quick_judge( problem_statement="Fix divide by zero in calculate_average", agent_patch="diff --git a/utils.py...", gold_patch="diff --git a/utils.py...", test_passed=True, ) print(f"MergeScore: {result.merge_score}/100") print(result.summary()) ``` ## Batch Evaluation ```bash python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad ``` ## Key Research References - [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?" - [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks" - [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents" - [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench" ## Data Sources - Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) - CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories) - OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results) - SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/` ## License MIT