agentbench / docs /_generated /kappa_table.md
Nomearod's picture
calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause
ab0e054

κ ablation table — calibration v1

Headline metric per dimension: groundedness → AC1, relevance → AC1, completeness → κ. AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold is prevalence-skewed enough to make Cohen's κ degenerate (groundedness 1×1/29×0, relevance 29×2/1×1); both metrics produce ≥0.95 raw agreement on those rows but Cohen's κ collapses to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its gold (23×2/5×1) is balanced enough for κ to behave normally.

Row Dimension Metric Agreement (95% CI) N Abstain rate Notes
baseline completeness κ 0.416 (-0.068, 0.866) 26 0.0%
baseline groundedness AC1 1.000 (1.000, 1.000) 26 0.0%
baseline relevance AC1 0.964 (0.885, 1.000) 29 3.3%
baseline_no_abstain completeness κ 0.416 (-0.068, 0.866) 26 0.0%
baseline_no_abstain groundedness AC1 1.000 (1.000, 1.000) 26 0.0%
baseline_no_abstain relevance AC1 0.963 (0.881, 1.000) 28 6.7%
baseline_no_anchors completeness κ 0.623 (-0.054, 1.000) 26 0.0%
baseline_no_anchors groundedness AC1 0.953 (0.834, 1.000) 24 7.7%
baseline_no_anchors relevance AC1 0.964 (0.885, 1.000) 29 3.3%
baseline_no_cot completeness κ 1.000 (1.000, 1.000) 24 7.7%
baseline_no_cot groundedness AC1 0.897 (0.707, 1.000) 23 11.5%
baseline_no_cot relevance AC1 0.963 (0.881, 1.000) 28 6.7%
jury_kappa_weighted completeness κ 0.014 (-0.077, 0.112) 26 0.0%
jury_kappa_weighted groundedness AC1 1.000 (1.000, 1.000) 26 0.0%
jury_kappa_weighted relevance AC1 1.000 (1.000, 1.000) 30 0.0%
jury_kappa_weighted_v1_1 completeness κ 0.416 (-0.068, 0.866) 26 0.0%
jury_kappa_weighted_v1_1 groundedness AC1 1.000 (1.000, 1.000) 26 0.0%
jury_kappa_weighted_v1_1 relevance AC1 1.000 (1.000, 1.000) 30 0.0%
permute completeness κ 0.506 (-0.061, 1.000) 26 0.0%
permute groundedness AC1 1.000 (1.000, 1.000) 25 3.8%
permute relevance AC1 0.966 (0.890, 1.000) 30 0.0%