Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: PatchJudge
|
| 3 |
+
emoji: ⚖️
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: false
|
| 9 |
+
license: mit
|
| 10 |
+
tags:
|
| 11 |
+
- code-evaluation
|
| 12 |
+
- swe-bench
|
| 13 |
+
- llm-judge
|
| 14 |
+
- code-quality
|
| 15 |
+
- merge-score
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
|
| 19 |
+
|
| 20 |
+
**PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
|
| 21 |
+
|
| 22 |
+
## The Problem
|
| 23 |
+
|
| 24 |
+
The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
|
| 25 |
+
- **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
|
| 26 |
+
- **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
|
| 27 |
+
- **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
|
| 28 |
+
- **23% of SWE-bench tasks** can be "solved" by a trivial regex patch
|
| 29 |
+
|
| 30 |
+
## The Solution: MergeScore
|
| 31 |
+
|
| 32 |
+
PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:
|
| 33 |
+
|
| 34 |
+
| Dimension | What It Measures | Weight |
|
| 35 |
+
|-----------|-----------------|--------|
|
| 36 |
+
| **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
|
| 37 |
+
| **Completeness** | Are edge cases handled? Error handling present? | 20% |
|
| 38 |
+
| **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
|
| 39 |
+
| **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
|
| 40 |
+
| **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |
|
| 41 |
+
|
| 42 |
+
## Architecture
|
| 43 |
+
|
| 44 |
+
```
|
| 45 |
+
Input: {issue_text, patch_diff, test_results, repo_context}
|
| 46 |
+
│
|
| 47 |
+
┌────────────────┼────────────────┐
|
| 48 |
+
▼ ▼ ▼
|
| 49 |
+
Feature LLM Judge Score
|
| 50 |
+
Extractor (structured Aggregator
|
| 51 |
+
(AST, diff 5-dimension (weighted avg
|
| 52 |
+
stats) eval) → MergeScore)
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Components
|
| 56 |
+
|
| 57 |
+
### 1. Data Loader (`patchjudge/data_loader.py`)
|
| 58 |
+
- Loads SWE-bench Verified (500 gold-standard tasks)
|
| 59 |
+
- Collects agent patches from multiple sources:
|
| 60 |
+
- **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
|
| 61 |
+
- **OpenHands+O1**: 499 instances, 229 passed
|
| 62 |
+
- **SWE-bench S3 bucket**: 139 verified agent submissions
|
| 63 |
+
- Builds unified `PatchExample` format
|
| 64 |
+
|
| 65 |
+
### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
|
| 66 |
+
- AST-based analysis for Python patches
|
| 67 |
+
- Diff statistics (files, lines, hunks, scope)
|
| 68 |
+
- Issue-patch alignment via keyword matching
|
| 69 |
+
- Code quality signals (TODOs, hardcoded values, debug statements)
|
| 70 |
+
- Risk assessment (core file modifications, scope analysis)
|
| 71 |
+
|
| 72 |
+
### 3. LLM Judge (`patchjudge/judge.py`)
|
| 73 |
+
- Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
|
| 74 |
+
- Structured JSON output with reasoning per dimension
|
| 75 |
+
- Temperature 0.1 for scoring consistency
|
| 76 |
+
- Robust JSON parsing with retry logic
|
| 77 |
+
|
| 78 |
+
### 4. Validation (`patchjudge/validation.py`)
|
| 79 |
+
- METR alignment check (~50% of test-passing patches should score below 50)
|
| 80 |
+
- Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
|
| 81 |
+
- Resolved vs. unresolved separation analysis
|
| 82 |
+
- Per-dimension statistical analysis
|
| 83 |
+
|
| 84 |
+
## Dataset
|
| 85 |
+
|
| 86 |
+
999 patch examples from SWE-bench Verified:
|
| 87 |
+
- **526 test-passing** patches across 2 agents
|
| 88 |
+
- **473 test-failing** patches
|
| 89 |
+
- **126 synthetically generated known-bad** patches for validation
|
| 90 |
+
- Features extracted for all examples
|
| 91 |
+
|
| 92 |
+
## Quick Start
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
from patchjudge.judge import PatchJudge, quick_judge
|
| 96 |
+
|
| 97 |
+
# One-shot evaluation
|
| 98 |
+
result = quick_judge(
|
| 99 |
+
problem_statement="Fix divide by zero in calculate_average",
|
| 100 |
+
agent_patch="diff --git a/utils.py...",
|
| 101 |
+
gold_patch="diff --git a/utils.py...",
|
| 102 |
+
test_passed=True,
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
print(f"MergeScore: {result.merge_score}/100")
|
| 106 |
+
print(result.summary())
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
## Batch Evaluation
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## Key Research References
|
| 116 |
+
|
| 117 |
+
- [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
|
| 118 |
+
- [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
|
| 119 |
+
- [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
|
| 120 |
+
- [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"
|
| 121 |
+
|
| 122 |
+
## Data Sources
|
| 123 |
+
|
| 124 |
+
- Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
|
| 125 |
+
- CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
|
| 126 |
+
- OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
|
| 127 |
+
- SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`
|
| 128 |
+
|
| 129 |
+
## License
|
| 130 |
+
|
| 131 |
+
MIT
|