VD10 commited on
Commit
7c47838
·
verified ·
1 Parent(s): 6c78dce

Upload data/validation_report.txt with huggingface_hub

Browse files
Files changed (1) hide show
  1. data/validation_report.txt +67 -0
data/validation_report.txt ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ======================================================================
2
+ PatchJudge Validation Report
3
+ ======================================================================
4
+
5
+ 📊 Dataset: 160 examples
6
+
7
+ 📈 Score Distribution:
8
+ Mean: 22.8
9
+ Median: 0.0
10
+ Std: 26.9
11
+
12
+ Score Distribution:
13
+ 0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88)
14
+ 10-20: (0)
15
+ 20-30: ████████ (8)
16
+ 30-40: ██████ (6)
17
+ 40-50: ███████████████████████ (23)
18
+ 50-60: ███████████████ (15)
19
+ 60-70: ██████████████ (14)
20
+ 70-80: █████ (5)
21
+ 80-90: █ (1)
22
+ 90-100: (0)
23
+
24
+ 🎯 METR Alignment:
25
+ Test-passing patches below 50.0: 65.0%
26
+ ⚠️ Too harsh — scoring too many patches as not merge-worthy
27
+
28
+ 🔀 Resolved vs Unresolved Separation:
29
+ Mean score (resolved): 35.6
30
+ Mean score (unresolved): 1.4
31
+ Separation: +34.2
32
+ Correlation: 1.000
33
+
34
+ 🚨 Known-Bad Pattern Detection:
35
+ Detected: 50/50 (100.0%)
36
+ ✅ Good detection rate
37
+
38
+ 📐 Per-Dimension Scores:
39
+ correctness: mean=2.7 std=3.1 [0-9]
40
+ completeness: mean=2.0 std=2.3 [0-6]
41
+ code_quality: mean=2.4 std=2.9 [0-9]
42
+ non_regression_risk: mean=2.4 std=2.9 [0-9]
43
+ merge_readiness: mean=1.7 std=2.2 [0-8]
44
+
45
+ 🏴 Most Common Flags:
46
+ 9x partial_fix
47
+ 8x missing_edge_cases
48
+ 7x Limited edge case coverage
49
+ 6x not_production_ready
50
+ 6x missing_edge_case_handling
51
+ 5x Style violations
52
+ 5x minimal_test_coverage
53
+ 4x style_violations
54
+ 4x Fundamentally flawed approach
55
+ 4x Poor code quality
56
+
57
+ ⭐ Top 3 Patches:
58
+ 80.5 django__django-11066 (CoderForge-Qwen3-32B, PASS)
59
+ 75.0 django__django-11999 (CoderForge-Qwen3-32B, PASS)
60
+ 74.0 django__django-12143 (CoderForge-Qwen3-32B, PASS)
61
+
62
+ 💀 Bottom 3 Patches:
63
+ 0.0 pydata__xarray-3151 (OpenHands-O1-reasoning-high, FAIL)
64
+ 0.0 django__django-14792 (OpenHands-O1-reasoning-high, FAIL)
65
+ 0.0 django__django-11848 (OpenHands-O1-reasoning-high, FAIL)
66
+
67
+ ======================================================================