VD10 commited on
Commit
9b4b049
·
verified ·
1 Parent(s): 42c4979

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +131 -0
README.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: PatchJudge
3
+ emoji: ⚖️
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ license: mit
10
+ tags:
11
+ - code-evaluation
12
+ - swe-bench
13
+ - llm-judge
14
+ - code-quality
15
+ - merge-score
16
+ ---
17
+
18
+ # PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents
19
+
20
+ **PatchJudge** evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
21
+
22
+ ## The Problem
23
+
24
+ The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
25
+ - **OpenAI abandoned SWE-bench Verified** because 16.4%+ of test cases are flawed
26
+ - **METR found ~50% of test-passing PRs** wouldn't be merged into real codebases
27
+ - **7.8% of "correct" patches are actually wrong** — they pass tests but are incomplete/broken ([PatchDiff, 2503.15223](https://arxiv.org/abs/2503.15223))
28
+ - **23% of SWE-bench tasks** can be "solved" by a trivial regex patch
29
+
30
+ ## The Solution: MergeScore
31
+
32
+ PatchJudge scores every patch on **5 dimensions**, then computes a single **MergeScore (0-100)**:
33
+
34
+ | Dimension | What It Measures | Weight |
35
+ |-----------|-----------------|--------|
36
+ | **Correctness** | Does the fix address the actual issue, not just the test? | 30% |
37
+ | **Completeness** | Are edge cases handled? Error handling present? | 20% |
38
+ | **Code Quality** | Clean, idiomatic, maintainable, follows conventions? | 20% |
39
+ | **Non-Regression Risk** | Could this break unrelated functionality? | 15% |
40
+ | **Merge-Readiness** | Would a senior engineer approve this PR as-is? | 15% |
41
+
42
+ ## Architecture
43
+
44
+ ```
45
+ Input: {issue_text, patch_diff, test_results, repo_context}
46
+
47
+ ┌────────────────┼────────────────┐
48
+ ▼ ▼ ▼
49
+ Feature LLM Judge Score
50
+ Extractor (structured Aggregator
51
+ (AST, diff 5-dimension (weighted avg
52
+ stats) eval) → MergeScore)
53
+ ```
54
+
55
+ ## Components
56
+
57
+ ### 1. Data Loader (`patchjudge/data_loader.py`)
58
+ - Loads SWE-bench Verified (500 gold-standard tasks)
59
+ - Collects agent patches from multiple sources:
60
+ - **CoderForge** (Qwen3-Coder-32B): 500 instances, 297 passed
61
+ - **OpenHands+O1**: 499 instances, 229 passed
62
+ - **SWE-bench S3 bucket**: 139 verified agent submissions
63
+ - Builds unified `PatchExample` format
64
+
65
+ ### 2. Feature Extractor (`patchjudge/feature_extractor.py`)
66
+ - AST-based analysis for Python patches
67
+ - Diff statistics (files, lines, hunks, scope)
68
+ - Issue-patch alignment via keyword matching
69
+ - Code quality signals (TODOs, hardcoded values, debug statements)
70
+ - Risk assessment (core file modifications, scope analysis)
71
+
72
+ ### 3. LLM Judge (`patchjudge/judge.py`)
73
+ - Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
74
+ - Structured JSON output with reasoning per dimension
75
+ - Temperature 0.1 for scoring consistency
76
+ - Robust JSON parsing with retry logic
77
+
78
+ ### 4. Validation (`patchjudge/validation.py`)
79
+ - METR alignment check (~50% of test-passing patches should score below 50)
80
+ - Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
81
+ - Resolved vs. unresolved separation analysis
82
+ - Per-dimension statistical analysis
83
+
84
+ ## Dataset
85
+
86
+ 999 patch examples from SWE-bench Verified:
87
+ - **526 test-passing** patches across 2 agents
88
+ - **473 test-failing** patches
89
+ - **126 synthetically generated known-bad** patches for validation
90
+ - Features extracted for all examples
91
+
92
+ ## Quick Start
93
+
94
+ ```python
95
+ from patchjudge.judge import PatchJudge, quick_judge
96
+
97
+ # One-shot evaluation
98
+ result = quick_judge(
99
+ problem_statement="Fix divide by zero in calculate_average",
100
+ agent_patch="diff --git a/utils.py...",
101
+ gold_patch="diff --git a/utils.py...",
102
+ test_passed=True,
103
+ )
104
+
105
+ print(f"MergeScore: {result.merge_score}/100")
106
+ print(result.summary())
107
+ ```
108
+
109
+ ## Batch Evaluation
110
+
111
+ ```bash
112
+ python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
113
+ ```
114
+
115
+ ## Key Research References
116
+
117
+ - [PatchDiff](https://arxiv.org/abs/2503.15223) — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
118
+ - [CodeJudgeBench](https://arxiv.org/abs/2507.10535) — "Benchmarking LLM-as-a-Judge for Coding Tasks"
119
+ - [SWE-smith](https://arxiv.org/abs/2504.21798) — "Scaling Data for Software Engineering Agents"
120
+ - [UTBoost](https://arxiv.org/abs/2506.09289) — "Rigorous Evaluation of Coding Agents on SWE-Bench"
121
+
122
+ ## Data Sources
123
+
124
+ - Gold patches: [princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified)
125
+ - CoderForge agent: [togethercomputer/CoderForge-Preview-32B](https://huggingface.co/datasets/togethercomputer/CoderForge-Preview-32B-SWE-Bench-Verified-Evaluation-trajectories)
126
+ - OpenHands+O1: [AlexCuadron/SWE-Bench-Verified-O1](https://huggingface.co/datasets/AlexCuadron/SWE-Bench-Verified-O1-native-tool-calling-reasoning-high-results)
127
+ - SWE-bench S3 submissions: 139 agents via `s3://swe-bench-submissions/verified/`
128
+
129
+ ## License
130
+
131
+ MIT