agentbench / docs /_generated /kappa_table.md
Nomearod's picture
calibrate(jury): v1.1+v1.1.1 — fix weighting bugs; recency-position paraphrase clause
ab0e054
# κ ablation table — calibration v1
Headline metric per dimension: **groundedness → AC1**, **relevance → AC1**, **completeness → κ**. AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold is prevalence-skewed enough to make Cohen's κ degenerate (groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics produce ≥0.95 raw agreement on those rows but Cohen's κ collapses to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its gold (23×`2`/5×`1`) is balanced enough for κ to behave normally.
| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |
|---|---|---|---|---|---|---|
| baseline | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
| baseline | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
| baseline | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% | |
| baseline_no_abstain | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
| baseline_no_abstain | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
| baseline_no_abstain | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% | |
| baseline_no_anchors | completeness | κ | 0.623 (-0.054, 1.000) | 26 | 0.0% | |
| baseline_no_anchors | groundedness | AC1 | 0.953 (0.834, 1.000) | 24 | 7.7% | |
| baseline_no_anchors | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% | |
| baseline_no_cot | completeness | κ | 1.000 (1.000, 1.000) | 24 | 7.7% | |
| baseline_no_cot | groundedness | AC1 | 0.897 (0.707, 1.000) | 23 | 11.5% | |
| baseline_no_cot | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% | |
| jury_kappa_weighted | completeness | κ | 0.014 (-0.077, 0.112) | 26 | 0.0% | |
| jury_kappa_weighted | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
| jury_kappa_weighted | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% | |
| jury_kappa_weighted_v1_1 | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
| jury_kappa_weighted_v1_1 | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
| jury_kappa_weighted_v1_1 | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% | |
| permute | completeness | κ | 0.506 (-0.061, 1.000) | 26 | 0.0% | |
| permute | groundedness | AC1 | 1.000 (1.000, 1.000) | 25 | 3.8% | |
| permute | relevance | AC1 | 0.966 (0.890, 1.000) | 30 | 0.0% | |