PatchJudge / README.md

VD10

Upload README.md with huggingface_hub

b7b0ad0 verified 12 days ago

preview code

raw

history blame contribute delete

6.19 kB

metadata

title: PatchJudge
emoji: ⚖️
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - code-evaluation
  - swe-bench
  - llm-judge
  - code-quality
  - merge-score

PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents

PatchJudge evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.

The Problem

The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:

OpenAI abandoned SWE-bench Verified because 16.4%+ of test cases are flawed
METR found ~50% of test-passing PRs wouldn't be merged into real codebases
7.8% of "correct" patches are actually wrong — they pass tests but are incomplete/broken (PatchDiff, 2503.15223)
23% of SWE-bench tasks can be "solved" by a trivial regex patch

The Solution: MergeScore

PatchJudge scores every patch on 5 dimensions, then computes a single MergeScore (0-100):

Dimension	What It Measures	Weight
Correctness	Does the fix address the actual issue, not just the test?	30%
Completeness	Are edge cases handled? Error handling present?	20%
Code Quality	Clean, idiomatic, maintainable, follows conventions?	20%
Non-Regression Risk	Could this break unrelated functionality?	15%
Merge-Readiness	Would a senior engineer approve this PR as-is?	15%

Architecture

Input: {issue_text, patch_diff, test_results, repo_context}
                    │
   ┌────────────────┼────────────────┐
   ▼                ▼                ▼
Feature         LLM Judge        Score
Extractor       (structured      Aggregator
(AST, diff      5-dimension      (weighted avg
 stats)         eval)            → MergeScore)

Components

1. Data Loader (`patchjudge/data_loader.py`)

Loads SWE-bench Verified (500 gold-standard tasks)
Collects agent patches from multiple sources:
- CoderForge (Qwen3-Coder-32B): 500 instances, 297 passed
- OpenHands+O1: 499 instances, 229 passed
- SWE-bench S3 bucket: 139 verified agent submissions
Builds unified PatchExample format

2. Feature Extractor (`patchjudge/feature_extractor.py`)

AST-based analysis for Python patches
Diff statistics (files, lines, hunks, scope)
Issue-patch alignment via keyword matching
Code quality signals (TODOs, hardcoded values, debug statements)
Risk assessment (core file modifications, scope analysis)

3. LLM Judge (`patchjudge/judge.py`)

Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
Structured JSON output with reasoning per dimension
Temperature 0.1 for scoring consistency
Robust JSON parsing with retry logic

4. Validation (`patchjudge/validation.py`)

METR alignment check (~50% of test-passing patches should score below 50)
Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
Resolved vs. unresolved separation analysis
Per-dimension statistical analysis

Dataset

999 patch examples from SWE-bench Verified:

526 test-passing patches across 2 agents
473 test-failing patches
126 synthetically generated known-bad patches for validation
Features extracted for all examples

Evaluation Results (v1)

Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:

Score Distribution

Metric	Value
Mean MergeScore	50.6/100
Median MergeScore	49.5/100
Std Dev	13.8
Score range	23.0 – 80.5

METR Alignment ✅

50% of test-passing patches scored below 50 — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
Test-passing mean: 50.9, Test-failing mean: 42.5
Clear separation between resolved and unresolved patches

Per-Dimension Averages (0-10 scale)

Dimension	Mean	Std
Correctness	5.8	1.9
Completeness	4.3	1.3
Code Quality	5.1	1.8
Non-Regression Risk	5.2	1.8
Merge-Readiness	4.5	1.7

Per-Agent Comparison

Agent	Mean MergeScore	Patches
CoderForge (Qwen3-32B)	49.9	52
OpenHands+O1	52.5	20

Known-Bad Detection

In earlier testing, the judge correctly identified known-bad patterns:

noop patch (just adds pass): 18.5/100
broad try/except patches: flagged as low quality
hardcoded returns: flagged as non-genuine fixes

Quick Start

from patchjudge.judge import PatchJudge, quick_judge

# One-shot evaluation
result = quick_judge(
    problem_statement="Fix divide by zero in calculate_average",
    agent_patch="diff --git a/utils.py...",
    gold_patch="diff --git a/utils.py...",
    test_passed=True,
)

print(f"MergeScore: {result.merge_score}/100")
print(result.summary())

Batch Evaluation

python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad

Key Research References

PatchDiff — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
CodeJudgeBench — "Benchmarking LLM-as-a-Judge for Coding Tasks"
SWE-smith — "Scaling Data for Software Engineering Agents"
UTBoost — "Rigorous Evaluation of Coding Agents on SWE-Bench"

Data Sources

Gold patches: princeton-nlp/SWE-bench_Verified
CoderForge agent: togethercomputer/CoderForge-Preview-32B
OpenHands+O1: AlexCuadron/SWE-Bench-Verified-O1
SWE-bench S3 submissions: 139 agents via s3://swe-bench-submissions/verified/

License

MIT