====================================================================== PatchJudge Validation Report ====================================================================== 📊 Dataset: 160 examples 📈 Score Distribution: Mean: 22.8 Median: 0.0 Std: 26.9 Score Distribution: 0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88) 10-20: (0) 20-30: ████████ (8) 30-40: ██████ (6) 40-50: ███████████████████████ (23) 50-60: ███████████████ (15) 60-70: ██████████████ (14) 70-80: █████ (5) 80-90: █ (1) 90-100: (0) 🎯 METR Alignment: Test-passing patches below 50.0: 65.0% ⚠️ Too harsh — scoring too many patches as not merge-worthy 🔀 Resolved vs Unresolved Separation: Mean score (resolved): 35.6 Mean score (unresolved): 1.4 Separation: +34.2 Correlation: 1.000 🚨 Known-Bad Pattern Detection: Detected: 50/50 (100.0%) ✅ Good detection rate 📐 Per-Dimension Scores: correctness: mean=2.7 std=3.1 [0-9] completeness: mean=2.0 std=2.3 [0-6] code_quality: mean=2.4 std=2.9 [0-9] non_regression_risk: mean=2.4 std=2.9 [0-9] merge_readiness: mean=1.7 std=2.2 [0-8] 🏴 Most Common Flags: 9x partial_fix 8x missing_edge_cases 7x Limited edge case coverage 6x not_production_ready 6x missing_edge_case_handling 5x Style violations 5x minimal_test_coverage 4x style_violations 4x Fundamentally flawed approach 4x Poor code quality ⭐ Top 3 Patches: 80.5 django__django-11066 (CoderForge-Qwen3-32B, PASS) 75.0 django__django-11999 (CoderForge-Qwen3-32B, PASS) 74.0 django__django-12143 (CoderForge-Qwen3-32B, PASS) 💀 Bottom 3 Patches: 0.0 pydata__xarray-3151 (OpenHands-O1-reasoning-high, FAIL) 0.0 django__django-14792 (OpenHands-O1-reasoning-high, FAIL) 0.0 django__django-11848 (OpenHands-O1-reasoning-high, FAIL) ======================================================================