PatchJudge / data /validation_report.txt
VD10's picture
Upload data/validation_report.txt with huggingface_hub
7c47838 verified
======================================================================
PatchJudge Validation Report
======================================================================
📊 Dataset: 160 examples
📈 Score Distribution:
Mean: 22.8
Median: 0.0
Std: 26.9
Score Distribution:
0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88)
10-20: (0)
20-30: ████████ (8)
30-40: ██████ (6)
40-50: ███████████████████████ (23)
50-60: ███████████████ (15)
60-70: ██████████████ (14)
70-80: █████ (5)
80-90: █ (1)
90-100: (0)
🎯 METR Alignment:
Test-passing patches below 50.0: 65.0%
⚠️ Too harsh — scoring too many patches as not merge-worthy
🔀 Resolved vs Unresolved Separation:
Mean score (resolved): 35.6
Mean score (unresolved): 1.4
Separation: +34.2
Correlation: 1.000
🚨 Known-Bad Pattern Detection:
Detected: 50/50 (100.0%)
✅ Good detection rate
📐 Per-Dimension Scores:
correctness: mean=2.7 std=3.1 [0-9]
completeness: mean=2.0 std=2.3 [0-6]
code_quality: mean=2.4 std=2.9 [0-9]
non_regression_risk: mean=2.4 std=2.9 [0-9]
merge_readiness: mean=1.7 std=2.2 [0-8]
🏴 Most Common Flags:
9x partial_fix
8x missing_edge_cases
7x Limited edge case coverage
6x not_production_ready
6x missing_edge_case_handling
5x Style violations
5x minimal_test_coverage
4x style_violations
4x Fundamentally flawed approach
4x Poor code quality
⭐ Top 3 Patches:
80.5 django__django-11066 (CoderForge-Qwen3-32B, PASS)
75.0 django__django-11999 (CoderForge-Qwen3-32B, PASS)
74.0 django__django-12143 (CoderForge-Qwen3-32B, PASS)
💀 Bottom 3 Patches:
0.0 pydata__xarray-3151 (OpenHands-O1-reasoning-high, FAIL)
0.0 django__django-14792 (OpenHands-O1-reasoning-high, FAIL)
0.0 django__django-11848 (OpenHands-O1-reasoning-high, FAIL)
======================================================================