File size: 2,465 Bytes
7c47838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
======================================================================
  PatchJudge Validation Report
======================================================================

📊 Dataset: 160 examples

📈 Score Distribution:
  Mean:   22.8
  Median: 0.0
  Std:    26.9

  Score Distribution:
       0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88)
      10-20:  (0)
      20-30: ████████ (8)
      30-40: ██████ (6)
      40-50: ███████████████████████ (23)
      50-60: ███████████████ (15)
      60-70: ██████████████ (14)
      70-80: █████ (5)
      80-90: █ (1)
     90-100:  (0)

🎯 METR Alignment:
  Test-passing patches below 50.0: 65.0%
  ⚠️ Too harsh — scoring too many patches as not merge-worthy

🔀 Resolved vs Unresolved Separation:
  Mean score (resolved):   35.6
  Mean score (unresolved): 1.4
  Separation:              +34.2
  Correlation:             1.000

🚨 Known-Bad Pattern Detection:
  Detected: 50/50 (100.0%)
  ✅ Good detection rate

📐 Per-Dimension Scores:
                correctness: mean=2.7  std=3.1  [0-9]
               completeness: mean=2.0  std=2.3  [0-6]
               code_quality: mean=2.4  std=2.9  [0-9]
        non_regression_risk: mean=2.4  std=2.9  [0-9]
            merge_readiness: mean=1.7  std=2.2  [0-8]

🏴 Most Common Flags:
     9x  partial_fix
     8x  missing_edge_cases
     7x  Limited edge case coverage
     6x  not_production_ready
     6x  missing_edge_case_handling
     5x  Style violations
     5x  minimal_test_coverage
     4x  style_violations
     4x  Fundamentally flawed approach
     4x  Poor code quality

⭐ Top 3 Patches:
   80.5  django__django-11066  (CoderForge-Qwen3-32B, PASS)
   75.0  django__django-11999  (CoderForge-Qwen3-32B, PASS)
   74.0  django__django-12143  (CoderForge-Qwen3-32B, PASS)

💀 Bottom 3 Patches:
    0.0  pydata__xarray-3151  (OpenHands-O1-reasoning-high, FAIL)
    0.0  django__django-14792  (OpenHands-O1-reasoning-high, FAIL)
    0.0  django__django-11848  (OpenHands-O1-reasoning-high, FAIL)

======================================================================