PatchJudge / data /validation_report.txt

Upload data/validation_report.txt with huggingface_hub

7c47838 verified 14 days ago

2.47 kB

	======================================================================
	PatchJudge Validation Report
	======================================================================

	📊 Dataset: 160 examples

	📈 Score Distribution:
	Mean: 22.8
	Median: 0.0
	Std: 26.9

	Score Distribution:
	0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88)
	10-20: (0)
	20-30: ████████ (8)
	30-40: ██████ (6)
	40-50: ███████████████████████ (23)
	50-60: ███████████████ (15)
	60-70: ██████████████ (14)
	70-80: █████ (5)
	80-90: █ (1)
	90-100: (0)

	🎯 METR Alignment:
	Test-passing patches below 50.0: 65.0%
	⚠️ Too harsh — scoring too many patches as not merge-worthy

	🔀 Resolved vs Unresolved Separation:
	Mean score (resolved): 35.6
	Mean score (unresolved): 1.4
	Separation: +34.2
	Correlation: 1.000

	🚨 Known-Bad Pattern Detection:
	Detected: 50/50 (100.0%)
	✅ Good detection rate

	📐 Per-Dimension Scores:
	correctness: mean=2.7 std=3.1 [0-9]
	completeness: mean=2.0 std=2.3 [0-6]
	code_quality: mean=2.4 std=2.9 [0-9]
	non_regression_risk: mean=2.4 std=2.9 [0-9]
	merge_readiness: mean=1.7 std=2.2 [0-8]

	🏴 Most Common Flags:
	9x partial_fix
	8x missing_edge_cases
	7x Limited edge case coverage
	6x not_production_ready
	6x missing_edge_case_handling
	5x Style violations
	5x minimal_test_coverage
	4x style_violations
	4x Fundamentally flawed approach
	4x Poor code quality

	⭐ Top 3 Patches:
	80.5 django__django-11066 (CoderForge-Qwen3-32B, PASS)
	75.0 django__django-11999 (CoderForge-Qwen3-32B, PASS)
	74.0 django__django-12143 (CoderForge-Qwen3-32B, PASS)

	💀 Bottom 3 Patches:
	0.0 pydata__xarray-3151 (OpenHands-O1-reasoning-high, FAIL)
	0.0 django__django-14792 (OpenHands-O1-reasoning-high, FAIL)
	0.0 django__django-11848 (OpenHands-O1-reasoning-high, FAIL)

	======================================================================