Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

Nomearod commited on 21 days ago

Commit

4158bba

2 Parent(s): 4161c3e 2d9ce3a

Merge remote-tracking branch 'origin/main' into hf-deploy

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.github/workflows/ci.yaml +6 -0
.gitignore +7 -0
DECISIONS.md +701 -0
Makefile +16 -1
README.md +17 -4
agent_bench/core/config.py +3 -0
agent_bench/core/provider.py +22 -6
agent_bench/evaluation/calibration/__init__.py +9 -0
agent_bench/evaluation/calibration/metrics.py +173 -0
agent_bench/evaluation/calibration/report.py +325 -0
agent_bench/evaluation/datasets/calibration_v1.json +158 -0
agent_bench/evaluation/datasets/tech_docs_golden.json +255 -57
agent_bench/evaluation/harness.py +62 -16
agent_bench/evaluation/judges/__init__.py +25 -0
agent_bench/evaluation/judges/base.py +628 -0
agent_bench/evaluation/judges/citation_faithfulness.py +188 -0
agent_bench/evaluation/judges/completeness.py +62 -0
agent_bench/evaluation/judges/groundedness.py +57 -0
agent_bench/evaluation/judges/relevance.py +48 -0
agent_bench/evaluation/metrics.py +9 -85
agent_bench/evaluation/report.py +14 -5
agent_bench/evaluation/rubrics/citation_faithfulness.md +57 -0
agent_bench/evaluation/rubrics/completeness.md +71 -0
agent_bench/evaluation/rubrics/groundedness.md +142 -0
agent_bench/evaluation/rubrics/relevance.md +74 -0
agent_bench/evaluation/variance/__init__.py +9 -0
agent_bench/evaluation/variance/jury.py +181 -0
agent_bench/evaluation/variance/rubric_permute.py +109 -0
agent_bench/serving/static/index.html +235 -0
configs/calibration/rows/baseline.yaml +14 -0
configs/calibration/rows/baseline_no_abstain.yaml +14 -0
configs/calibration/rows/baseline_no_anchors.yaml +13 -0
configs/calibration/rows/baseline_no_cot.yaml +13 -0
configs/calibration/rows/jury_kappa_weighted.yaml +23 -0
configs/calibration/rows/permute.yaml +14 -0
docs/_generated/kappa_table.md +27 -0
docs/judge-design.md +687 -0
docs/plans/2026-05-04-judge-layer-v1-design.md +613 -0
docs/plans/2026-05-04-judge-layer-v1-implementation.md +0 -0
measurements/2026-05-04-judge-calibration-labels.jsonl +90 -0
measurements/2026-05-05-judge-rubric-opus-stress.jsonl +90 -0
measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl +5 -0
measurements/2026-05-06-4a-gpt4o-full-probe.jsonl +5 -0
measurements/2026-05-06-gpt4o-extraction-reasoning-split.md +162 -0
measurements/README.md +1 -0
pyproject.toml +2 -0
results/calibration_v1_judge_baseline.json +0 -0
results/calibration_v1_judge_baseline_no_abstain.json +0 -0
results/calibration_v1_judge_baseline_no_anchors.json +0 -0
results/calibration_v1_judge_baseline_no_cot.json +2115 -0

.github/workflows/ci.yaml CHANGED Viewed

@@ -9,6 +9,12 @@ on:
 jobs:
   test:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4

 jobs:
   test:
     runs-on: ubuntu-latest
+    # Explicit empty env: prevents accidental dependency on injected
+    # secrets. Tests use MockProvider and require no API keys; if a
+    # future test imports a provider that needs a key, it will fail
+    # in CI and in any contributor fork the same way (no silent
+    # divergence based on whether secrets are present).
+    env: {}
     steps:
       - uses: actions/checkout@v4

.gitignore CHANGED Viewed

@@ -24,6 +24,13 @@ venv/
 logs/
 *.jsonl
 # Opaque binary artifacts — no PDFs in the repo today, and any that
 # appear here are almost always local reference material (downloaded
 # papers, vendor docs) that should not be committed. If a PDF ever

 logs/
 *.jsonl
+# Evidence-bearing measurement artifacts referenced from DECISIONS.md.
+# Narrow exception to the *.jsonl ignore above. Add new measurement files
+# explicitly here so the audit-trail intent stays opt-in.
+!measurements/*.jsonl
+# Calibration jury/permute sidecars (per-member detail for κ ablation table).
+!results/*.jsonl
 # Opaque binary artifacts — no PDFs in the repo today, and any that
 # appear here are almost always local reference material (downloaded
 # papers, vendor docs) that should not be committed. If a PDF ever

DECISIONS.md CHANGED Viewed

@@ -2116,3 +2116,704 @@ the actual container filesystem would have caught it pre-deploy.
 Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
 build infrastructure) but is the right long-term mitigation for this
 class of bug.

 Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
 build infrastructure) but is the right long-term mitigation for this
 class of bug.
+## LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call
+The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
+(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
+and replaced by the per-dimension Judge layer at
+`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
+**Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
+**Why this is a supersession, not a refactor.** The new layer differs from
+the old on six axes: discrete-anchored scale (vs continuous 0–1),
+reasoning-before-score JSON ordering (vs score-first), per-dimension
+judges (vs combined faithfulness/correctness), full provenance per call
+(judge_id + rubric_version + system_output_hash + prompt_seed; old had
+none), composable variance wrappers (rubric_permute, jury — old was
+single-call), and an intentional abstain-vs-raise discipline (vs silent
+`None` from a bare `except Exception`).
+**Evidence backing the supersession claim** — the calibration κ table
+quantifies the new layer's agreement with hand-labels across 6 ablation
+rows (baseline + 3 variance ablations + permute + 2-judge jury). The
+files defending this entry's claim, by file path:
+- `measurements/2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3
+  dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the
+  literature ceiling). Lands in Phase 10.
+- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
+  `_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
+  `_permute.json`, `_jury_kappa_weighted.json` — per-row predictions.
+  Land in Phase 11.
+- `docs/_generated/kappa_table.md` — generated κ ablation table copy-
+  pasted into the writeup. Lands in Phase 11.
+- `docs/judge-design.md` — interpretive writeup with the closing
+  "when NOT to use LLM-judge" position. Lands in Phase 12.
+**Config-knob preservation.** `evaluation.judge_provider` is unchanged
+across all 5 YAML configs; new `evaluation.judge_dimensions` field
+defaults to the three v1 dimensions. Zero user-facing config migration.
+**Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
+Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
+prompt optimization, citation_faithfulness in the default
+judge_dimensions, AC2 sympy-derived parity tests.
+## Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05
+The Opus stress-test pass over the 30 calibration items × 3 dimensions
+disagreed with the single-rater human gold on **22 of 30 groundedness
+items** (8/30 agreement). Relevance and completeness agreed at 28/30 and
+25/30 respectively. The groundedness disagreement is consistent in
+direction — every disagreed-on item is `human=1, opus=0` — and has a
+single root cause.
+**Root cause: reference-scope drift between rubric author and labeler.**
+`agent_bench/evaluation/rubrics/groundedness.md` defines the reference
+scope as the gold snippets attached to each item:
+> The judge sees only the gold snippets — not the retrieved chunks. A
+> claim that happens to be true in the world but is not entailed by the
+> snippets fails groundedness.
+The single-rater notes on the disagreed-on items describe checking
+against the broader documentation, not against `source_snippets`:
+"supported by the corpus", "supported by the docs", "supported by the
+provided dependency snippet". For items like `k8s_006` the gold snippet
+is one sentence ("A ConfigMap is an API object used to store
+non-confidential data in key-value pairs"), while the agent's answer
+correctly synthesizes seven or eight additional claims from the full
+`k8s_configmap.md`. Those claims are true in the world and well-supported
+by the full doc, but **not entailed by the one snippet**. Opus applied
+the strict-snippet rubric; the human rater applied a corpus-supported
+rubric.
+**Why this blocks `make calibrate` against the current gold.** The κ
+ablation compares Haiku and GPT-4o-mini judges against the human gold.
+A judge that correctly applies the strict-snippet rubric will disagree
+with miscalibrated gold; a judge that's too lenient will agree. The
+ablation rewards leniency and punishes rigor — the opposite of the
+intended measurement. This is the same failure mode codified earlier in
+this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
+compensation when the measurement is wrong, not the intended effect.
+**Why the rubric stays as written, not relaxed to "corpus-supported".**
+Strict-snippet groundedness measures *RAG behavior*: did the agent
+synthesize from what it retrieved? Corpus-supported groundedness
+measures *LLM general knowledge passing through a RAG harness*: did the
+agent happen to be correct? The first is what this benchmark is for;
+the second is what `agent_bench/evaluation/metrics.py` measured before
+supersession. Relaxing the rubric to "corpus-supported" would silently
+re-introduce the failure mode the supersession entry above just removed.
+**Decision — three-step correction lands before `make calibrate` runs:**
+1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.**
+   Add an explicit reference-scope line and one anchored example
+   contrasting "supported by the snippet" vs "true in the world but
+   not in the snippet". Audit-trail requirement: the v1.1 writeup will
+   cite "rubric clarified between v1.0 and v1.1", and the git history
+   needs to back that claim.
+2. **Re-label the 22 disagreed-on groundedness items** in
+   `measurements/2026-05-04-judge-calibration-labels.jsonl` against the
+   clarified rubric, snippet-only. **Do not mechanically copy Opus's
+   labels.** The labels remain the human single-rater's; what changes is
+   the rubric being applied. Mechanical copy would turn the κ table
+   into "judge vs Opus", which is not what the writeup claims it
+   measures.
+3. **Recompute `make calibrate` against the corrected gold** and emit
+   `docs/_generated/kappa_table.md` from the v1.1 labels.
+**Evidence files for the v1.1 writeup section:**
+- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — 90 Opus
+  labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min
+  org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
+- `measurements/2026-05-04-judge-calibration-labels.jsonl` — original
+  v1.0 single-rater gold; will be diffed against v1.1 corrected gold to
+  quantify the re-label delta.
+- `agent_bench/evaluation/rubrics/groundedness.md` — pre/post diff is
+  the rubric clarification.
+**Pre-labeling observations also worth recording for the writeup
+methodology section:**
+- `q021` (fastapi · calculation) answered the CORS preflight question
+  correctly (600 / 60 = 10 minutes) with `sources: []` and
+  `ranked_sources: []` — the agent did the arithmetic without retrieval
+  and emitted an answer consistent with the snippet without having
+  retrieved it. Methodologically interesting for the
+  citation-faithfulness story (Block 2.7) if it ships: an answer can be
+  correct without being grounded-by-citation.
+- `q025` (fastapi · multi_hop) answer was truncated mid-token by the
+  orchestrator's max_tokens limit. The labels reflect what the system
+  produced, not a mentally-patched complete version. The completeness
+  rubric does not currently anchor "truncated response" as a level —
+  v1.1 rubric work should add an anchor.
+- Several K8s items embed external knowledge that's correct but not in
+  the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container
+  success; `k8s_009` describes Roles vs ClusterRoles by their semantics).
+  The clarified groundedness rubric should pick **strict** on this case
+  (claim must be supportable by the retrieved spans, not just consistent
+  with them) and the anchored example should show that ruling.
+**Methodology framing for the writeup.** The Opus stress-test was added
+specifically to catch hand-labeled-gold fragility before the κ table is
+published. It caught it. The writeup's calibration section should
+disclose the rubric clarification, quantify the re-label delta on
+groundedness, and report κ against the v1.1 corrected gold — that is a
+more credible story than a first-try clean κ table would have been.
+**Outcome — 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows
+ran cleanly after three coupled production-code fixes that landed on the
+same branch as the rubric clarification: (1) markdown fence stripping in
+`agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because
+Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens`
+512 → 1024 because v1.1 anchored examples elicit longer model reasoning,
+(3) calibration runner v1.0 omitted `item_id` from prediction records;
+fixed in v1.1 with backfill of the 6 already-written row files via
+`hash → item_id` map (no re-spend). Probe-one-cell-before-sweep saved a
+fourth $0.50 wasted run after the fence-strip change — the methodology
+note in `feedback_judge_probe_before_sweep.md` was earned by this
+session's two failed full-row attempts that paid ~$1.15 for unparseable
+output before the diagnosis converged.
+The κ table at `docs/_generated/kappa_table.md` (regenerated on
+2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for
+completeness — see report.py `_DIM_METRIC`) shows three findings
+that the writeup interprets rather than reports verbatim:
+**v1.1 finding 1 — relevance is not "judges fail" territory.**
+Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the
+29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00.
+AC1 is the load-bearing statistic on relevance and groundedness; both
+metrics agree on completeness where the gold (23×2 / 5×1) is balanced.
+**v1.1 finding 2 — `no_cot completeness` agreement is real, not
+selective abstain.** AC1 = κ = 1.000 at n=24. The 2 absent cells
+(`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit
+retry exhaustion), both gold=`2`, neither in baseline's disagreement
+set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
+(3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to
+agreement when CoT is removed. The interview-relevant claim is the
+*opposite* of the conventional CoT-helps story: CoT-before-score on
+3-point completeness lets the judge over-emphasize partial coverage
+and rationalize `1` when the human gold sides with the holistic
+"covers the points" reading.
+**v1.1 finding 3 — `jury_kappa_weighted` underperformed baseline on
+completeness, with a precise mechanism.** Per-member analysis from
+`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
+Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%;
+gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw
+26.9% — systematically harsh on the 3-point scale, almost never
+scoring `2`. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% —
+matches gpt-4o-mini alone exactly because the jury verdict reduces
+to gpt-4o-mini's verdict on every disputed cell.
+The mechanism is *missing-weight + round-down* compounding, not
+weighted voting in the usual sense. `scripts/run_calibration.py
+::_load_weights_from_baseline` is a documented v1 stub that returns
+weight = 1.0 for every judge_id present in baseline. baseline.json
+contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
+gets 1.0 from `jury.py`'s missing-key fallback (with a logged
+`jury_missing_weight_fallback_to_one` warning per call). Equal
+weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
+of 1.5; the `_discretize_mean` rule is `frac > 0.5 → ceil else floor`,
+and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict
+wins every disputed cell. The v1 design doc's risks subsection listed
+"jury κ worse than the better individual judge — (a) kappa-weighting
+wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
+*both* branches simultaneously: branch (a) because the weighting is a
+stub returning equal weights, and branch (b) because round-down at
+exact 0.5 ties hands the verdict to the lower-scoring member.
+The deeper structural point is that weighting alone cannot rescue a
+systematically miscalibrated member. Even held-out validation that
+correctly assigned gpt-4o-mini's true low weight on completeness
+would still let it dominate disputed ties unless its weight were
+driven near zero — and at that point exclusion is more honest than
+near-zero inclusion. The conservative-on-binary "ties to lower" rule
+also doesn't transfer cleanly to ordinal scales: on completeness,
+"conservative" means scoring *toward incomplete*, which is precisely
+the direction of gpt-4o-mini's bias.
+**v1.2 fix list (four items, expanding the earlier two-item list):**
+1. **Held-out jury weights.** Replace the
+   `_load_weights_from_baseline` stub with a real κ-derived
+   computation, evaluated on a *held-out validation set* — not the
+   same calibration row whose κ is being measured against the gold.
+   Closes the circular-weighting hole.
+2. **Symmetric member coverage in the weights source.** Missing-member
+   fallback to weight = 1.0 amplifies an unweighted member rather than
+   suppressing it. Either every jury member must have a weight in the
+   source file or the run must abort. The `jury_missing_weight_
+   fallback_to_one` warning fired loudly on every call this run; in
+   v1.2 it should be a hard error.
+3. **Per-dimension member exclusion when individual κ falls below a
+   threshold.** gpt-4o-mini at κ = 0.020 on completeness should not be
+   in the completeness jury at all. Weights below a floor (suggested
+   κ < 0.2) should be treated as exclusion, not as small-weight
+   inclusion. Held-out validation fixes circular weighting; it does
+   not fix systematic member bias.
+4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule
+   (ties to lower) was selected for conservative behavior on binary
+   scales, where "conservative" means scoring 0 on uncertainty. On
+   3-point completeness, "conservative" means scoring toward
+   *incomplete*, which interacts badly with member miscalibration.
+   v1.2 should select the tie-break rule per-dimension based on the
+   rubric's conservative direction, not globally.
+**Evidence files:** `docs/_generated/kappa_table.md` (regenerated with
+AC1 for groundedness/relevance, κ for completeness);
+`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
+(per-member sidecar where the gpt-4o-mini completeness bias is
+visible per item); `results/calibration_v1_judge_baseline.json`
+(weights source — note the absence of any gpt-4o-mini-2024-07-18
+entries, which is why the missing-weight fallback fires).
+## v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria
+**Date:** 2026-05-06. **Status:** in-flight; this entry is the pre-experiment
+contract that pins down what counts as success before the re-aggregation
+runs, so the outcome can't be negotiated post-hoc.
+**Sharpened diagnostic — extraction-vs-reasoning split, not just "model is
+biased".** Re-reading the per-member sidecar (item-level, not aggregate)
+on the gpt-4o-mini completeness disputes shows a more specific failure
+mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2
+/ gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes`
+field correctly extracts the paraphrased coverage from the agent answer
+— and then its `reasoning` field denies that those very quotes constitute
+coverage. k8s_002 is the cleanest instance: the model quotes the strings
+"declarative updates" and "sticky identity" into evidence, then writes
+"the answer does not explicitly mention 'declarative updates' and 'sticky
+identity'". The score follows the reasoning, not the evidence. The
+mechanism is that the model's *post-extraction reasoning step* applies a
+literal-string-match standard to the answer text while the rubric
+requires "paraphrase allowed" — i.e., the structured-output discipline
+forced an extraction step that the reasoning step then contradicted on
+autopilot. This is a known failure mode in chain-of-thought judges and
+shows up more in smaller models because the reasoning step has less
+capacity to integrate the rubric's instruction with the literal-text
+comparison the model is running by default. The artifact for the writeup
+is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three
+side-by-side reasoning + evidence_quotes excerpts).
+**Pragmatic v1.1 weights-source decision.** The v1.2 fix-list above
+specifies a held-out validation set for jury weights — methodologically
+clean but requires either splitting N=30 (loses statistical power on
+both halves) or labeling more items (eats interview prep time). v1.1
+chooses pragmatic: weights computed from the same calibration set used
+for κ reporting, with the circularity flagged in the writeup. Reason:
+(a) the alternative is splitting N=30, (b) the per-member κ values used
+as weights are internally consistent, (c) v1.2 will use a held-out 20-
+item set. The writeup will contain a sentence acknowledging the
+circularity rather than hiding it.
+**v1.1 elevated fix-list (subset of the v1.2 list above).** Items 2
+(symmetric coverage / hard-error) is elevated unconditionally. Item 1
+(real κ-derived weights) is elevated in pragmatic form (same set with
+circularity caveat). Items 3 (per-dimension exclusion) and 4 (per-
+dimension tie-break) remain v1.2 unless B's outcome forces them up.
+**Pre-committed B success criteria.** Plan B is "re-aggregate the existing
+164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl`
+with corrected κ-derived weights, no new API spend." The outcome maps
+deterministically to one of three predefined responses, picked *before*
+B runs:
+- **Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥
+  0.05** (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story:
+  "v1's weights-source bug masked correct aggregation; once both bugs
+  (asymmetric coverage + missing-weight fallback) are fixed, the jury
+  improves on baseline. Per-dimension exclusion remains a v1.2 design
+  pattern but is not needed at v1.1." This is the strong story.
+- **Outcome 2 — jury κ within ±0.05 of Haiku-baseline** (i.e., 0.366 ≤
+  jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity
+  but the jury isn't doing meaningful work on completeness — gpt-4o-
+  mini's near-zero weight makes it effectively excluded by aggregation.
+  This is *soft exclusion via weighting*; v1.2 will make exclusion
+  explicit." Defensible but less clean.
+- **Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05** (i.e.,
+  jury κ < 0.366). Writeup story: "weights-source fix is necessary but
+  not sufficient; even at near-zero weight gpt-4o-mini's verdict tips
+  disputed (1, 2) ties due to the round-down rule. v1.1 escalates to
+  per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.
+**Why the predefined-criteria framing matters.** "I ran B, looked at the
+number, decided it was good enough" is the same data with a weaker frame
+than "I predefined the success criteria before running the experiment, B
+landed at outcome X, which mapped to predefined response Y". The latter
+demonstrates evaluation maturity in the writeup; the former invites
+post-hoc reading of the outcome.
+**B outcome — 2026-05-06.** Plan B re-aggregated the existing 164 sidecar
+rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on
+completeness; clipped at 0 from raw κ values). Result: **jury κ on
+completeness = 0.416**, exactly matching Haiku-baseline. Δ = 0.000;
+maps to **Outcome 2 (soft exclusion via weighting)**. Per the
+pre-committed response, v1.1 stops here and writes up; per-dimension
+member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.
+Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1)
+with corrected weights aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 =
+1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the
+correct verdict. v1's two compounding bugs (asymmetric source returning
+weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt-
+4o-mini) jointly forced equal weights, and equal-weights with the same
+round-up rule produced `(2 × 1 + 1 × 1) / 2 = 1.5`, which has frac
+exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every
+disputed cell. The bug fixes recover the right verdict purely
+mechanically; no judge model behavior changes.
+The empirical reading: the weighting is *not doing meaningful work* —
+gpt-4o-mini's near-zero weight effectively excludes it on completeness,
+and the jury's κ matches Haiku-alone exactly because Haiku's verdict
+wins every disputed cell. This is "soft exclusion via weighting"; v1.2's
+explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the
+exclusion visible in the jury config rather than emergent from κ-derived
+weight collapse.
+**v1.1 code changes (this commit):**
+- `agent_bench/evaluation/variance/jury.py` — silent missing-weight
+  fallback to 1.0 → hard `ValueError`. Two existing tests that asserted
+  the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`,
+  `test_kappa_weighted_logs_warning_on_missing_weight`) updated to
+  assert the new contract.
+- `scripts/run_calibration.py::_load_weights_from_baseline` →
+  `_compute_kappa_weights` — replaces the v1 stub with real per-judge
+  Cohen's κ on the dimension; hard-errors when any expected member is
+  missing from the source. Clips κ < 0 to weight = 0 (soft exclusion).
+- `configs/calibration/rows/jury_kappa_weighted.yaml` — `weights_source`
+  re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only,
+  asymmetric coverage) to
+  `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges,
+  same calibration set with documented circularity).
+- `tests/scripts/test_run_calibration_dispatch.py` — two new tests cover
+  `_compute_kappa_weights`: (a) computes real κ (high-agreement judge →
+  weight=1.0, chance-agreement judge → 0); (b) hard-errors on
+  asymmetric source coverage.
+- `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` — new
+  predictions row produced by re-aggregating the existing sidecar
+  offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`).
+  `docs/_generated/kappa_table.md` regenerated with this row alongside
+  the broken v1 row, giving the writeup a clean before/after diff
+  (completeness: 0.014 → 0.416, n=26).
+- `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` — the
+  three side-by-side reasoning + evidence_quotes excerpts (q006 /
+  k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split
+  diagnostic finding.
+The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1
+(pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.
+## Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria)
+**Date:** 2026-05-06. **Status:** in-flight; this entry pins down the
+hypothesis and success criteria before the experiment runs.
+**Hypothesis sharpened by the 1A direction-of-bias finding.** GPT-4o-
+mini's completeness disagreements are 17/19 gold=2/pred=1 with zero
+up-mistakes across 26 items spanning two corpora — direction-aware noise,
+not balanced random labeling. The model is consistently applying *some*
+rule stricter than the rubric requires. The hypothesis under test: that
+stricter rule is "literal-string match required, paraphrase doesn't
+count," and the bias is fixable by recency-positioning the rubric's
+"paraphrase allowed" instruction adjacent to the commit-to-score
+decision instead of leaving it 500+ tokens upstream in the rubric body.
+**The intervention is positional, not lexical.** The current
+`CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`)
+sends the rubric body, then the gold reference, then the system answer,
+then a one-line "Score this answer..." instruction immediately followed
+by the JSON schema clause. The rubric body's "paraphrase allowed" clause
+appears in the introductory paragraphs, hundreds of tokens before the
+score decision. The intervention adds one sentence between the system
+answer and the score instruction:
+> *"Note: a paraphrase that captures the same meaning as a gold-answer
+> point counts as covered. Score on content equivalence, not surface
+> form."*
+This is the recency-positioning hypothesis: the model loses the
+paraphrase conditioning across the rubric anchors and the reasoning
+step. Restating the instruction adjacent to the score decision tests
+whether the bias is positionally correctable.
+**Selected 5 disputed items** (representative of the gold=2 / Haiku=2 /
+gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`,
+`k8s_018`. All four are pure paraphrase-coverage cases (the system
+answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored
+1 with the extraction-vs-reasoning split documented in
+`measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`).
+**Pre-committed 3A success criteria.**
+- **Fixed (≥3/5 shift from 1 → 2):** Recency-positioning is sufficient.
+  Re-run GPT-4o-mini on the full 26 disputed items with the corrected
+  prompt, recompute κ, update the writeup table. Story: "rubric-
+  engineering matters more than judge model choice for ordinal scales —
+  recency-positioning the paraphrase instruction recovered N% of
+  disputed items." The completeness story becomes actionable, not
+  diagnostic-only.
+- **Partially fixed (1–2/5 shift):** Inconclusive at N=5 (binomial-
+  significance line is ~3+). Re-run on the full 26 disputed items
+  (~$0.20) to get a clean number; write up whatever the full-26 says.
+- **Not fixed (0/5 shift):** The instruction is being received and
+  ignored — the model can't act on it under reasoning load. Escalate
+  to 4A (GPT-4o full on the same 5 items) to verify the small-model-
+  specific claim. Story: "repositioning the paraphrase instruction
+  adjacent to the score decision did not shift any of 5 disputed items;
+  GPT-4o handled the same prompts. The bias is small-model-specific,
+  not prompt-fixable."
+The 3/5 threshold is the binomial-significance line at this N — random
+shifting under the null produces 0 or 1 changes most of the time. Pre-
+committing avoids the "2 shifted, that's kind of a fix" negotiation.
+**On the 1A relevance finding — confirmed.** Both judges essentially
+correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ
+degeneracy is structural under 29/30 prevalence at class-2; AC1 +
+raw agreement is the right reporting. No further investigation on
+relevance. Writeup paragraph is one short sentence: prevalence-induced
+degeneracy → AC1 is load-bearing.
+## Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1)
+**Date:** 2026-05-06. **Status:** complete; the v1.1.1 prompt is now
+permanent in `agent_bench/evaluation/judges/completeness.py`.
+**3A 5-item probe:** 3/5 disputed items shifted 1 → 2 (q006, q011,
+k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre-
+committed threshold (≥3/5 → "fixed"), so the protocol triggered the
+full-26 re-run on gpt-4o-mini only (Haiku held as control to make the
+v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on
+the affected judge).
+**Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):**
+|                              | n  | raw    | κ      | AC1    |
+|------------------------------|----|--------|--------|--------|
+| v1.1   gpt-4o-mini           | 26 | 26.9%  | +0.020 | +0.006 |
+| **v1.1.1 gpt-4o-mini**       | 28 | **42.9%** | **+0.000** | **+0.232** |
+| v1.1   Haiku (control)       | 26 | 84.6%  | +0.416 | +0.792 |
+**Per-item delta (v1.1 → v1.1.1):** 7 items shifted up (1 → 2 or 1 → 2),
+0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2
+items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016,
+k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1
+in v1.1, now over-credited at pred=2). Net per-item correctness delta:
++5 items.
+**Cohen's κ is misleading on this comparison.** v1.1.1 raw agreement
+rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from
+0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly
+*lower* than v1.1's 0.020. The mechanism is prevalence-rebalancing in
+the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19,
+2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer
+to gold's `{1:5, 2:23}` over n=28). Cohen's κ = `(P_o - P_e)/(1 - P_e)`;
+when marginals become more diverse, P_e (chance agreement) rises in
+lockstep with P_o (observed agreement), and κ deflates. AC1 uses
+prevalence-robust chance correction (`P_e = (1/(q-1)) Σ pi_k(1-pi_k)`)
+and reads the actual signal.
+This is the same trap that motivated AC1 over κ on the relevance and
+groundedness rows of the original κ table, surfacing here at a
+different distribution boundary. The κ table footer already explains
+why per-dimension metric selection matters; v1.1.1's outcome
+demonstrates the trap *induced by the intervention itself*.
+**Effect on the jury aggregate.** With κ-derived weights and gpt-4o-
+mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury
+verdict on completeness is now mathematically equivalent to Haiku-alone
+on every item (gpt's contribution is multiplied by zero). Jury κ stays
+at 0.416, identical to v1.1's corrected aggregate. The intervention's
+per-member improvement is *invisible at the jury level* under this
+weighting scheme.
+**Methodological consequence — v1.2 fix-list addition.** The v1.2 fix-
+list now expands by one item:
+5. **Prevalence-robust weights for prevalence-skewed dimensions.**
+   v1.1's `_compute_kappa_weights` uses Cohen's κ for every dimension,
+   which has a *self-defeating property* on prevalence-skewed gold:
+   improving a member can lower its weight even as it gets more
+   accurate.
+   **Mechanism.** Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where
+   `P_e = Σ_k P(gold=k) × P(pred=k)` is the chance-agreement term
+   computed from the marginal distributions. P_e is *not* invariant to
+   the predictor's marginal distribution — when a member's predictions
+   become more diverse (less concentrated at one class), P_e *rises*
+   as the marginals approach gold's marginals. Concretely: when an
+   intervention moves a member's pred distribution from concentrated-
+   at-one-class toward gold's distribution, P_o and P_e rise together
+   in lockstep. The numerator `P_o - P_e` stays small, and κ deflates
+   even as raw accuracy improves. This is the same prevalence-induced
+   degeneracy that motivated AC1 over κ on relevance/groundedness rows
+   in the κ table — it surfaces in jury weighting at any
+   distribution-shifting intervention's boundary.
+   **Empirically observed in v1.1.1.** The recency-positioning
+   intervention shifted gpt-4o-mini completeness pred dist from
+   `{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's
+   `{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% → 42.9%.
+   AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×).
+   Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o
+   ≈ 0.43. v1.1's `_compute_kappa_weights` clips the new κ at zero,
+   producing weight = 0 — and the jury aggregate loses access to a
+   member that was empirically improved. The intervention's per-
+   member improvement is invisible at the jury level under κ-weighting.
+   **Architectural decomposition for v1.2.** The right separation:
+   - **Per-dimension metric for κ table reporting** (already in v1.1
+     via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`).
+   - **Per-dimension weight metric for jury aggregation** (new in
+     v1.2, reuses `_DIM_METRIC`). Use κ where the gold's prevalence
+     supports it, AC1 where κ degenerates. Same lookup, same per-
+     dimension policy at both reporting and weighting layers.
+   - **Per-dimension membership as explicit configuration override**
+     for members that are structurally inappropriate (v1.2 fix #3,
+     unchanged) — distinct from "low score on the chosen metric,"
+     which is handled by the weight floor.
+   **Why this is non-obvious.** A reader's first instinct is that
+   "weight by κ" is a sensible default — κ is *the* standard inter-
+   rater statistic. The self-defeating property is invisible until
+   you observe a real intervention that shifts marginals; in static
+   conditions (no intervention, fixed prompts), the κ-weight choice
+   is benign. The v1.1.1 outcome is the first time the agent-bench
+   calibration set has produced an intervention-induced marginal
+   shift on the same gold; the failure mode wouldn't have been
+   visible in v1.0's static calibration sweep.
+**v1.1.1 code changes (this commit):**
+- `agent_bench/evaluation/judges/completeness.py` — adds
+  `PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system
+  answer and the score instruction. Comment cites the 3A probe.
+- `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt`
+  — extends to assert the recency clause appears AND is positioned
+  between the answer and the score instruction (position is load-
+  bearing, not just lexical inclusion).
+- `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
+  — merged sidecar: v1.1 groundedness/relevance rows (unchanged
+  judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku
+  completeness rows.
+- `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the
+  5-item probe artifact with reasoning + evidence_quotes for each.
+- `scripts/_dev/probe_3a_paraphrase_recency.py`,
+  `scripts/_dev/rerun_completeness_v1_1_1.py` — reproducers; not
+  part of the production calibration runner.
+**No changes to the κ table.** The jury aggregate κ on completeness is
+unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy
+described above; adding a `jury_kappa_weighted_v1_1_1` row with
+identical numbers would be visual noise. The v1.1.1 finding lives in
+the writeup body, not the table — the per-member AC1 improvement
+(0.006 → 0.232) is the headline number, surfaced as a separate
+paragraph next to the κ table rather than inside it.
+**Total spend through Plan 3A:** $0.0013 (3A probe) + $0.0075 (full-26
+re-run) = $0.0088.
+## Plan 4A — GPT-4o (full) on the v1.1.1 residual
+**Date:** 2026-05-06. **Status:** complete. Run after the writeup-
+framing review surfaced that v1.1.1's "fixed" verdict was overclaim-
+prone — 5/19 items were recovered, 14 remained unchanged and
+uncharacterized. 4A was originally scoped as conditional on 3A *not*
+being fixed (per the predefined sequencing rule), but became valuable
+as a *post-3A* diagnostic to characterize the residual: is it small-
+model-specific or rubric-under-specified?
+**Scope.** GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged
+items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1
+production prompt (paraphrase recency clause active). The first two
+(k8s_006, k8s_018) are the items that didn't shift in the original 3A
+5-item probe — we have gpt-4o-mini's reasoning on those items *with*
+the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt
+varying only the model. q011, q012, k8s_001 cover the broader
+fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A
+checks whether GPT-4o agrees with gold or with Haiku).
+**Result: 5/5 correct.** All 5 items scored 2 by GPT-4o, matching gold
+exactly. Cost: $0.0011 reported (caveat: pricing config falls back to
+gpt-4o-mini rates for unlisted models, so actual cost is closer to
+$0.005–0.01 — the reported number under-reports by ~5–10×).
+**Sharpened mechanism — criteria-invention, not just literal-match.**
+The original 3-example artifact (q006, k8s_002, k8s_018) was framed
+as gpt-4o-mini "applying a literal-string-match standard" while
+correctly extracting paraphrased coverage into evidence_quotes. 4A's
+side-by-side reasoning on `k8s_018` shows a distinct second mechanism:
+  - **gpt-4o-mini (v1.1.1, score 1):** "It mentions some key points
+    from the reference... but does not explicitly state that the new
+    fields in `autoscaling/v2` are preserved as annotations when using
+    `autoscaling/v1`, nor does it mention the need to use
+    `autoscaling/v2` directly for memory or custom metric scaling for
+    a Deployment or StatefulSet."
+  - **gpt-4o (4A, score 2):** "The answer covers all the key points
+    from the reference. It mentions that the current stable version is
+    autoscaling/v2, which supports scaling on memory and custom
+    metrics, similar to the reference. It also notes that
+    autoscaling/v1 only supports CPU-based scaling, aligning with the
+    reference's points."
+The reference for k8s_018 specifies three points: (1) autoscaling/v2
+is the current stable API, (2) it adds memory metrics support beyond
+v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's
+reasoning step *invents additional criteria* the reference does not
+require ("preserved as annotations when using autoscaling/v1," "use
+autoscaling/v2 directly for ... a Deployment or StatefulSet") and then
+deducts against them, scoring 1. GPT-4o reads the reference's three
+points and scores against exactly those, scoring 2.
+This is a *capacity* finding distinct from the paraphrase-recency
+finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing
+it toward paraphrase semantics, manufactures additional gold criteria
+during scoring that aren't in the reference. Recency-positioning the
+"paraphrase allowed" clause doesn't address this — the bias isn't
+"missed paraphrase," it's "invented extra requirements." Two failure
+modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.
+**Implication for v1.2.** With 5/5 confirmed, v1.2 fix #3 (per-
+dimension membership) gets clean empirical support: gpt-4o-mini is
+the wrong tool for 3-point completeness with paraphrase semantics, and
+no amount of prompt engineering on this rubric is going to bridge the
+capacity gap. The right v1.2 path is one of:
+  - **Exclude gpt-4o-mini from completeness scoring** (per-dim
+    membership; jury reduces to single-judge Haiku on completeness;
+    explicit and visible in config).
+  - **Replace gpt-4o-mini with GPT-4o on completeness** (per-dim
+    judge selection; jury keeps two members but the second is a
+    frontier-class model on the dimension that needs it).
+Both are defensible v1.2 designs. The choice depends on cost
+budget — gpt-4o is ~10× the per-call cost of gpt-4o-mini. For
+agent-bench's calibration set scale (~30 items × per-row), even gpt-
+4o is trivially cheap; for production deployment evaluating thousands
+of agent outputs, the cost trade-off matters more.
+**4A artifact:** `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl`
+(per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs
+with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for
+the side-by-side analysis above).
+**Updated honest framing for the writeup.** "v1.1.1 addressed one
+identified failure mode (paraphrase-instruction-loss across reasoning,
+recovered 5/19 disputed items via positional change). 4A confirmed the
+residual 14 are a distinct failure mode (capacity-limited criteria
+invention during the reasoning step) — GPT-4o handles all 5 sampled
+residuals at the same v1.1.1 prompt, so the failure is small-model-
+specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge
+membership / model selection) is the right escalation; the rubric
+itself doesn't need changes."
+**Total session spend:** $0.0099 reported (~$0.013–0.018 actual after
+gpt-4o pricing correction).

Makefile CHANGED Viewed

@@ -1,6 +1,6 @@
 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
-.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
@@ -34,6 +34,21 @@ benchmark:
 evaluate-langchain:
 	$(PYTHON) scripts/run_langchain_eval.py --provider openai
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build

 PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
+.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
 install:
 	$(PYTHON) -m pip install -e ".[dev]"
 evaluate-langchain:
 	$(PYTHON) scripts/run_langchain_eval.py --provider openai
+calibrate:  ## Run full calibration pipeline (system outputs → all rows → strict κ table). Costs ~$2 in API calls.
+	$(PYTHON) scripts/run_calibration.py generate-outputs
+	@for cfg in configs/calibration/rows/*.yaml; do \
+		echo "==> running judges for $$cfg"; \
+		$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
+	done
+	$(PYTHON) scripts/run_calibration.py build-table --strict
+evaluate-judges:  ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1.
+	@for cfg in configs/calibration/rows/*.yaml; do \
+		echo "==> running judges for $$cfg"; \
+		$(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
+	done
+	$(PYTHON) scripts/run_calibration.py build-table --strict
 docker:
 	docker-compose -f docker/docker-compose.yaml up --build

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ app_port: 7860
 Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
-`444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
@@ -249,7 +249,7 @@ security:
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
-- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
 <details><summary>API Reference</summary>
@@ -311,12 +311,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3
 ## Testing
 ```bash
-make test    # 444 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
 ## Design Decisions
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
@@ -334,4 +347,4 @@ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF
 | **PII redaction** | None | None | Regex + optional NER |
 | **Output validation** | None | None | PII leakage + URL + blocklist |
 | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
-| Tests | 97 | 205 | 288 |

 Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
+`443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
+- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
 <details><summary>API Reference</summary>
 ## Testing
 ```bash
+make test    # 523 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
+### Targets that cost money
+These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
+| Target | Requires API key | Approximate cost | What it produces |
+|---|---|---|---|
+| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
+| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
+| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
+| `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
+Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
 ## Design Decisions
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
 | **PII redaction** | None | None | Regex + optional NER |
 | **Output validation** | None | None | PII leakage + URL + blocklist |
 | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
+| Tests | 97 | 205 | 443 |

agent_bench/core/config.py CHANGED Viewed

@@ -88,6 +88,9 @@ class MemoryConfig(BaseModel):
 class EvaluationConfig(BaseModel):
     judge_provider: str = "openai"
     golden_dataset: str = "agent_bench/evaluation/datasets/tech_docs_golden.json"
 _VALID_TIERS = {"heuristic", "classifier"}

 class EvaluationConfig(BaseModel):
     judge_provider: str = "openai"
     golden_dataset: str = "agent_bench/evaluation/datasets/tech_docs_golden.json"
+    # New in judge-layer v1: which dimensions to score with L2 LLM judges.
+    # citation_faithfulness is opt-in v1 (default-on v1.1).
+    judge_dimensions: list[str] = ["groundedness", "relevance", "completeness"]
 _VALID_TIERS = {"heuristic", "classifier"}

agent_bench/core/provider.py CHANGED Viewed

@@ -192,9 +192,17 @@ class MockProvider(LLMProvider):
 class OpenAIProvider(LLMProvider):
-    """OpenAI API provider pinned to a dated gpt-4o-mini snapshot."""
-    def __init__(self, config: AppConfig | None = None) -> None:
         try:
             from openai import AsyncOpenAI
         except ImportError as e:
@@ -205,7 +213,7 @@ class OpenAIProvider(LLMProvider):
         self.config = config or load_config()
         api_key = os.environ.get("OPENAI_API_KEY", "")
         self.client = AsyncOpenAI(api_key=api_key)
-        self.model = "gpt-4o-mini-2024-07-18"
         model_pricing = self.config.provider.models.get(self.model)
         self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
         self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
@@ -410,9 +418,17 @@ def format_messages_anthropic(
 class AnthropicProvider(LLMProvider):
-    """Anthropic Claude provider."""
-    def __init__(self, config: AppConfig | None = None) -> None:
         try:
             from anthropic import AsyncAnthropic
         except ImportError as e:
@@ -425,7 +441,7 @@ class AnthropicProvider(LLMProvider):
         self.config = config or load_config()
         api_key = os.environ.get("ANTHROPIC_API_KEY", "")
         self.client = AsyncAnthropic(api_key=api_key)
-        self.model = "claude-haiku-4-5-20251001"
         model_pricing = self.config.provider.models.get(self.model)
         self._input_cost = (
             model_pricing.input_cost_per_mtok if model_pricing else 0.80

 class OpenAIProvider(LLMProvider):
+    """OpenAI API provider pinned to a dated gpt-4o-mini snapshot.
+    The ``model`` parameter overrides the default pin (used by the
+    calibration runner so a row config's ``model_id`` is what actually
+    gets called — without an override, ``judge_id`` would be a label
+    that disagrees with the API request, breaking provenance).
+    """
+    def __init__(
+        self, config: AppConfig | None = None, *, model: str | None = None
+    ) -> None:
         try:
             from openai import AsyncOpenAI
         except ImportError as e:
         self.config = config or load_config()
         api_key = os.environ.get("OPENAI_API_KEY", "")
         self.client = AsyncOpenAI(api_key=api_key)
+        self.model = model or "gpt-4o-mini-2024-07-18"
         model_pricing = self.config.provider.models.get(self.model)
         self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
         self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
 class AnthropicProvider(LLMProvider):
+    """Anthropic Claude provider.
+    The ``model`` parameter overrides the default pin (used by the
+    calibration runner so a row config's ``model_id`` is what actually
+    gets called — without an override, ``judge_id`` would be a label
+    that disagrees with the API request, breaking provenance).
+    """
+    def __init__(
+        self, config: AppConfig | None = None, *, model: str | None = None
+    ) -> None:
         try:
             from anthropic import AsyncAnthropic
         except ImportError as e:
         self.config = config or load_config()
         api_key = os.environ.get("ANTHROPIC_API_KEY", "")
         self.client = AsyncAnthropic(api_key=api_key)
+        self.model = model or "claude-haiku-4-5-20251001"
         model_pricing = self.config.provider.models.get(self.model)
         self._input_cost = (
             model_pricing.input_cost_per_mtok if model_pricing else 0.80

agent_bench/evaluation/calibration/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Hand-rolled inter-rater agreement metrics + calibration report generator."""
+from agent_bench.evaluation.calibration.metrics import (
+    bootstrap_ci,
+    cohen_kappa,
+    gwets_ac2,
+)
+__all__ = ["bootstrap_ci", "cohen_kappa", "gwets_ac2"]

agent_bench/evaluation/calibration/metrics.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""Hand-rolled Cohen's kappa, Gwet's AC2, bootstrap CI.
+Hand-rolled (not sklearn) for two reasons:
+1. agent-bench's identity is "built from primitives" — adding sklearn
+   for one function (and transitively numpy + scipy + threadpoolctl +
+   joblib) contradicts that.
+2. The hand-roll demonstrates formula understanding in a way that
+   sklearn.metrics.cohen_kappa_score does not.
+Fixture-tested against sklearn run *outside* the project venv —
+see tests/evaluation/test_calibration_metrics.py and
+scripts/_dev/generate_kappa_fixtures.py.
+"""
+from __future__ import annotations
+import random
+from collections.abc import Callable
+from typing import Literal
+def cohen_kappa(
+    y1: list,
+    y2: list,
+    weights: Literal[None, "linear", "quadratic"] = None,
+) -> float:
+    """Cohen's κ = (P_o - P_e) / (1 - P_e).
+    Supports unweighted, linear-weighted, and quadratic-weighted variants
+    for ordinal scales. y1 and y2 must be parallel lists of label values
+    (int or str). Both must have the same length.
+    """
+    if len(y1) != len(y2):
+        raise ValueError(
+            f"y1 and y2 must have same length; got {len(y1)} vs {len(y2)}"
+        )
+    if not y1:
+        raise ValueError("Empty input — kappa undefined")
+    labels = sorted({*y1, *y2}, key=str)
+    k = len(labels)
+    label_idx = {lab: i for i, lab in enumerate(labels)}
+    cm = [[0] * k for _ in range(k)]
+    for a, b in zip(y1, y2):
+        cm[label_idx[a]][label_idx[b]] += 1
+    n = len(y1)
+    if weights is None:
+        w = [[1.0 if i == j else 0.0 for j in range(k)] for i in range(k)]
+    elif weights == "linear":
+        if k <= 1:
+            w = [[1.0]]
+        else:
+            w = [
+                [1.0 - abs(i - j) / (k - 1) for j in range(k)] for i in range(k)
+            ]
+    elif weights == "quadratic":
+        if k <= 1:
+            w = [[1.0]]
+        else:
+            w = [
+                [1.0 - ((i - j) / (k - 1)) ** 2 for j in range(k)] for i in range(k)
+            ]
+    else:
+        raise ValueError(f"Invalid weights {weights!r}")
+    p_o = sum(w[i][j] * cm[i][j] for i in range(k) for j in range(k)) / n
+    row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)]
+    col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)]
+    p_e = sum(
+        w[i][j] * row_marg[i] * col_marg[j] for i in range(k) for j in range(k)
+    )
+    if p_e >= 1.0:
+        return 1.0
+    return (p_o - p_e) / (1.0 - p_e)
+def gwets_ac2(
+    y1: list,
+    y2: list,
+    weights: Literal[None] = None,
+) -> float:
+    """Gwet's AC1 — chance-corrected agreement using mean marginals.
+    AC1 = (P_o - P_e) / (1 - P_e)
+    where P_e = (1/(q-1)) * Σ pi_k * (1 - pi_k)
+    and pi_k is the mean marginal probability for category k.
+    Despite the function name, v1 only supports the *unweighted* (AC1)
+    formula. The weighted AC2 variant has multiple inconsistent definitions
+    in the literature (Gwet 2008 vs Gwet 2014); without a sklearn analogue
+    to cross-check against (sklearn ships κ but not AC1/AC2), shipping a
+    weighted formula without a fixture is a methodology hazard. Pass
+    weights=None or omit; passing 'linear' or 'quadratic' raises
+    NotImplementedError. Fix the formula + fixture in v1.1 (out of scope
+    per the design's Out-of-Scope section).
+    """
+    if weights is not None:
+        raise NotImplementedError(
+            "Weighted Gwet's AC2 is not implemented in v1. The unweighted "
+            "AC1 formula is correct and tested; the weighted variant has "
+            "literature inconsistency that needs a pinned fixture before "
+            "shipping. Pass weights=None or use cohen_kappa(weights=...)."
+        )
+    if len(y1) != len(y2):
+        raise ValueError("y1 and y2 length mismatch")
+    if not y1:
+        raise ValueError("Empty input")
+    labels = sorted({*y1, *y2}, key=str)
+    k = len(labels)
+    label_idx = {lab: i for i, lab in enumerate(labels)}
+    cm = [[0] * k for _ in range(k)]
+    for a, b in zip(y1, y2):
+        cm[label_idx[a]][label_idx[b]] += 1
+    n = len(y1)
+    p_o = sum(cm[i][i] for i in range(k)) / n  # diagonal sum (unweighted)
+    row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)]
+    col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)]
+    pi = [(row_marg[i] + col_marg[i]) / 2 for i in range(k)]
+    if k <= 1:
+        return 1.0
+    # AC1 chance term: (1/(q-1)) * Σ pi_k * (1 - pi_k)
+    p_e_ac1 = sum(pi[i] * (1 - pi[i]) for i in range(k)) / (k - 1)
+    if p_e_ac1 >= 1.0:
+        return 1.0
+    return (p_o - p_e_ac1) / (1.0 - p_e_ac1)
+def bootstrap_ci(
+    y1: list,
+    y2: list,
+    metric_fn: Callable[[list, list], float],
+    n_iter: int = 1000,
+    ci: float = 0.95,
+    seed: int = 42,
+) -> tuple[float, float, float]:
+    """Bootstrap confidence interval for an inter-rater metric.
+    Returns (point_estimate, ci_lo, ci_hi). Resamples with replacement
+    n_iter times and takes the (1-ci)/2 and (1+ci)/2 percentiles.
+    """
+    if len(y1) != len(y2):
+        raise ValueError("length mismatch")
+    n = len(y1)
+    rng = random.Random(seed)
+    point = metric_fn(y1, y2)
+    samples: list[float] = []
+    for _ in range(n_iter):
+        idx = [rng.randrange(n) for _ in range(n)]
+        s1 = [y1[i] for i in idx]
+        s2 = [y2[i] for i in idx]
+        try:
+            samples.append(metric_fn(s1, s2))
+        except (ValueError, ZeroDivisionError):
+            # Degenerate resample (e.g., all one label) — skip
+            continue
+    samples.sort()
+    if not samples:
+        return point, point, point
+    lo_idx = int(((1 - ci) / 2) * len(samples))
+    hi_idx = int(((1 + ci) / 2) * len(samples)) - 1
+    return point, samples[lo_idx], samples[hi_idx]

agent_bench/evaluation/calibration/report.py ADDED Viewed

	@@ -0,0 +1,325 @@

+"""generate_kappa_table — joins predictions ⋈ labels by (item_id, dimension,
+system_output_hash); computes per-row κ + bootstrap CI + abstain breakdown;
+emits markdown table at docs/_generated/kappa_table.md.
+"""
+from __future__ import annotations
+import glob as _glob
+import json
+from collections import defaultdict
+from collections.abc import Callable
+from pathlib import Path
+import structlog
+from agent_bench.evaluation.calibration.metrics import (
+    bootstrap_ci,
+    cohen_kappa,
+    gwets_ac2,
+)
+from agent_bench.evaluation.judges.base import (
+    ABSTAIN_REASON_OUT_OF_RANGE,
+    ABSTAIN_REASON_PROVIDER_EXHAUSTED,
+    ABSTAIN_REASON_SCHEMA_PARSE,
+)
+logger = structlog.get_logger()
+ABSTAIN_THRESHOLD = 0.20  # strictly greater than fires the flag
+# Per-dimension headline metric. Cohen's κ degenerates under the prevalence
+# imbalance produced by the v1.1 strict-snippet groundedness rubric (1×score=1,
+# ~25×score=0) and by the inherent skew on relevance (29×score=2, 1×score=1):
+# both Po and Pe approach 1.0, the formula collapses to ~0/0, and the rendered
+# κ reads as 0.000 even when raw agreement is >95%. Gwet's AC1 (gwets_ac2 with
+# weights=None per metrics.py) uses mean marginals and stays informative under
+# imbalance. Completeness has a more balanced gold (23×2, 5×1, 2×Unknown) so
+# Cohen's κ is the conventional choice there. The metric per dim is rendered
+# explicitly in the footer so a writeup reader sees the methodology choice.
+# Type annotation prevents a mypy 1.20.x INTERNAL ERROR triggered by the
+# tuple-unpack of `_DIM_METRIC.get(dim, default)` further down. Without it
+# mypy fails to infer the metric_fn callable signature consistently across
+# the dict literal and the fallback default, and crashes with no real
+# user-facing type error to fix.
+_MetricFn = Callable[[list, list], float]
+_DIM_METRIC: dict[str, tuple[str, _MetricFn]] = {
+    "groundedness": ("AC1", gwets_ac2),
+    "relevance": ("AC1", gwets_ac2),
+    "completeness": ("κ", cohen_kappa),
+}
+# Filename marker for jury / permute sidecar files. Any prediction file whose
+# basename contains this token is per-member detail, not aggregate predictions,
+# and is excluded from the κ table. Pinned here so a future extension change
+# (jsonl → json) is caught at the contract site rather than at report time.
+_SIDECAR_BASENAME_MARKER = "_members."
+def _classify_abstain(reasoning: str) -> str:
+    if reasoning.startswith(ABSTAIN_REASON_PROVIDER_EXHAUSTED):
+        return "provider_exhausted"
+    if reasoning.startswith(ABSTAIN_REASON_SCHEMA_PARSE):
+        return "schema_parse"
+    if reasoning.startswith(ABSTAIN_REASON_OUT_OF_RANGE):
+        return "out_of_range"
+    return "genuine"
+def generate_kappa_table(
+    *,
+    predictions_glob: str,
+    labels_path: str,
+    output_path: str,
+    strict: bool = False,
+) -> None:
+    """Aggregate predictions across rows + dimensions into one markdown table.
+    On hash mismatch: ALWAYS raises (both modes), with first-item expected
+    /actual hashes plus full mismatched-id list.
+    On missing prediction or label: WARN+exclude in default mode; RAISE in strict.
+    On undefined κ: render '—' with a footnote (both modes).
+    On abstain rate > 20%: render κ + footnote with cause breakdown (both modes).
+    """
+    labels: list[dict] = []
+    for line in Path(labels_path).read_text().splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        labels.append(json.loads(line))
+    label_by_key: dict[tuple[str, str], dict] = {
+        (label_rec["item_id"], label_rec["dimension"]): label_rec
+        for label_rec in labels
+    }
+    pred_files = sorted(_glob.glob(predictions_glob))
+    if not pred_files:
+        raise ValueError(f"No prediction files matched: {predictions_glob}")
+    rows: list[dict] = []
+    for pf in pred_files:
+        # Skip sidecars (per-member detail, not aggregate predictions).
+        # Match the basename marker, not a specific extension, so a future
+        # jsonl → json migration of jury._DEFAULT_SIDECAR_TEMPLATE doesn't
+        # silently start contaminating the κ table.
+        if _SIDECAR_BASENAME_MARKER in Path(pf).name:
+            continue
+        row_label = (
+            Path(pf).stem.replace("calibration_v1_judge_", "")
+        )
+        preds = json.loads(Path(pf).read_text())
+        # Hash-mismatch detection (always raises)
+        mismatches: list[tuple[str, str, str]] = []
+        for p in preds:
+            key = (p["item_id"], p["dimension"])
+            if key in label_by_key:
+                expected = label_by_key[key]["system_output_hash"]
+                actual = p["system_output_hash"]
+                if expected != actual:
+                    mismatches.append((p["item_id"], expected, actual))
+        if mismatches:
+            first_id, first_exp, first_act = mismatches[0]
+            raise ValueError(
+                f"Hash mismatch in {pf}: item {first_id!r} "
+                f"label.system_output_hash={first_exp!r} but "
+                f"prediction.system_output_hash={first_act!r}. "
+                f"Full mismatched-id list ({len(mismatches)}): "
+                f"{[m[0] for m in mismatches]}. "
+                f"Labels are stale relative to predictions — regenerate one or "
+                f"the other so hashes align."
+            )
+        preds_by_dim: dict[str, list[dict]] = defaultdict(list)
+        for p in preds:
+            preds_by_dim[p["dimension"]].append(p)
+        labels_by_dim: dict[str, list[dict]] = defaultdict(list)
+        for label_rec in labels:
+            labels_by_dim[label_rec["dimension"]].append(label_rec)
+        for dim in sorted(preds_by_dim.keys()):
+            # Resolve dimension's headline metric once per dim, instead of
+            # tuple-unpacking _DIM_METRIC.get(...) at each use site below.
+            # The repeated unpack pattern triggered a mypy 1.19+ INTERNAL
+            # ERROR; one resolution call here is also less code.
+            metric_name, metric_fn = _DIM_METRIC.get(
+                dim, ("κ", cohen_kappa)
+            )
+            preds_d = {p["item_id"]: p for p in preds_by_dim[dim]}
+            labs_d = {
+                label_rec["item_id"]: label_rec
+                for label_rec in labels_by_dim.get(dim, [])
+            }
+            common = sorted(set(preds_d) & set(labs_d))
+            missing_pred = sorted(set(labs_d) - set(preds_d))
+            missing_lab = sorted(set(preds_d) - set(labs_d))
+            if missing_pred or missing_lab:
+                msg = (
+                    f"row={row_label} dim={dim} "
+                    f"missing_predictions={missing_pred} "
+                    f"missing_labels={missing_lab}"
+                )
+                if strict:
+                    raise ValueError(f"strict mode: missing items: {msg}")
+                logger.warning("calibration_report_missing", message=msg)
+            y_pred: list = []
+            y_lab: list = []
+            abstains = 0
+            abstain_causes: dict[str, int] = {
+                "provider_exhausted": 0,
+                "schema_parse": 0,
+                "out_of_range": 0,
+                "genuine": 0,
+            }
+            for iid in common:
+                p = preds_d[iid]
+                label_rec = labs_d[iid]
+                if p["score"] == "Unknown" or label_rec["score"] == "Unknown":
+                    abstains += 1
+                    if p["score"] == "Unknown":
+                        abstain_causes[
+                            _classify_abstain(p.get("reasoning", ""))
+                        ] += 1
+                    continue
+                y_pred.append(int(p["score"]))
+                y_lab.append(int(label_rec["score"]))
+            n_eligible = len(y_pred)
+            abstain_rate = abstains / max(len(common), 1)
+            if n_eligible < 3:
+                rows.append(
+                    {
+                        "row": row_label,
+                        "dim": dim,
+                        "metric": metric_name,
+                        "kappa": None,
+                        "ci_lo": None,
+                        "ci_hi": None,
+                        "n_eligible": n_eligible,
+                        "abstains": abstains,
+                        "abstain_rate": abstain_rate,
+                        "abstain_causes": abstain_causes,
+                        "footnote": (
+                            f"{metric_name} undefined: insufficient "
+                            f"agreement-eligible items (N={n_eligible})"
+                        ),
+                    }
+                )
+                continue
+            try:
+                kappa = metric_fn(y_lab, y_pred)
+                point, lo, hi = bootstrap_ci(
+                    y_lab, y_pred, metric_fn, n_iter=1000, seed=42
+                )
+            except (ValueError, ZeroDivisionError):
+                rows.append(
+                    {
+                        "row": row_label,
+                        "dim": dim,
+                        "metric": metric_name,
+                        "kappa": None,
+                        "ci_lo": None,
+                        "ci_hi": None,
+                        "n_eligible": n_eligible,
+                        "abstains": abstains,
+                        "abstain_rate": abstain_rate,
+                        "abstain_causes": abstain_causes,
+                        "footnote": (
+                            f"{metric_name} undefined: insufficient "
+                            f"variance after exclusion"
+                        ),
+                    }
+                )
+                continue
+            # Detect degenerate κ (perfectly constant labels → P_e=1 → kappa
+            # was clamped to 1.0 in metrics.py, but with no observed
+            # disagreement the result is statistically meaningless)
+            if len(set(y_lab)) <= 1 and len(set(y_pred)) <= 1:
+                rows.append(
+                    {
+                        "row": row_label,
+                        "dim": dim,
+                        "metric": metric_name,
+                        "kappa": None,
+                        "ci_lo": None,
+                        "ci_hi": None,
+                        "n_eligible": n_eligible,
+                        "abstains": abstains,
+                        "abstain_rate": abstain_rate,
+                        "abstain_causes": abstain_causes,
+                        "footnote": (
+                            f"{metric_name} undefined: all labels and "
+                            f"predictions in a single category (no variance "
+                            f"to measure)"
+                        ),
+                    }
+                )
+                continue
+            footnote = ""
+            if abstain_rate > ABSTAIN_THRESHOLD:
+                breakdown = ", ".join(
+                    f"{int(100 * v / abstains)}% {k.replace('_', ' ')}"
+                    for k, v in abstain_causes.items()
+                    if v > 0
+                )
+                footnote = (
+                    f"{metric_name} computed on N={n_eligible} of "
+                    f"{len(common)} items; high abstain rate "
+                    f"({100 * abstain_rate:.1f}% — breakdown: {breakdown}) "
+                    f"suggests rubric ambiguity."
+                )
+            rows.append(
+                {
+                    "row": row_label,
+                    "dim": dim,
+                    "metric": metric_name,
+                    "kappa": kappa,
+                    "ci_lo": lo,
+                    "ci_hi": hi,
+                    "n_eligible": n_eligible,
+                    "abstains": abstains,
+                    "abstain_rate": abstain_rate,
+                    "abstain_causes": abstain_causes,
+                    "footnote": footnote,
+                }
+            )
+    out = ["# κ ablation table — calibration v1\n"]
+    out.append(
+        "Headline metric per dimension: " + ", ".join(
+            f"**{d} → {m}**" for d, (m, _) in _DIM_METRIC.items()
+        ) + ". "
+        "AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold "
+        "is prevalence-skewed enough to make Cohen's κ degenerate "
+        "(groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics "
+        "produce ≥0.95 raw agreement on those rows but Cohen's κ collapses "
+        "to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its "
+        "gold (23×`2`/5×`1`) is balanced enough for κ to behave normally."
+    )
+    out.append("")
+    out.append("| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |")
+    out.append("|---|---|---|---|---|---|---|")
+    for r in rows:
+        if r["kappa"] is None:
+            kcell = " — "
+        else:
+            kcell = f"{r['kappa']:.3f} ({r['ci_lo']:.3f}, {r['ci_hi']:.3f})"
+        rate = f"{100 * r['abstain_rate']:.1f}%"
+        out.append(
+            f"| {r['row']} | {r['dim']} | {r['metric']} | {kcell} | "
+            f"{r['n_eligible']} | {rate} | {r['footnote']} |"
+        )
+    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
+    Path(output_path).write_text("\n".join(out) + "\n")
+    logger.info("kappa_table_written", path=output_path, rows=len(rows))

agent_bench/evaluation/datasets/calibration_v1.json ADDED Viewed

	@@ -0,0 +1,158 @@

+{
+  "version": "v1",
+  "system_config_git_sha": "3a2ed359eb16437cf95987b1fca47281a37fb74c",
+  "sample_seed": 20260504,
+  "notes": "30-item stratified calibration set per the design doc. Spare slots filled from K8s simple_w_condition and multi_hop (typically highest-variance R@5 strata).",
+  "items": [
+    {
+      "id": "q021",
+      "corpus": "fastapi",
+      "stratum": "calculation"
+    },
+    {
+      "id": "q010",
+      "corpus": "fastapi",
+      "stratum": "out_of_scope"
+    },
+    {
+      "id": "q027",
+      "corpus": "fastapi",
+      "stratum": "out_of_scope"
+    },
+    {
+      "id": "q006",
+      "corpus": "fastapi",
+      "stratum": "retrieval"
+    },
+    {
+      "id": "q011",
+      "corpus": "fastapi",
+      "stratum": "retrieval"
+    },
+    {
+      "id": "q012",
+      "corpus": "fastapi",
+      "stratum": "retrieval"
+    },
+    {
+      "id": "q023",
+      "corpus": "fastapi",
+      "stratum": "retrieval"
+    },
+    {
+      "id": "q025",
+      "corpus": "fastapi",
+      "stratum": "retrieval"
+    },
+    {
+      "id": "k8s_002",
+      "corpus": "k8s",
+      "stratum": "comparison"
+    },
+    {
+      "id": "k8s_014",
+      "corpus": "k8s",
+      "stratum": "comparison"
+    },
+    {
+      "id": "k8s_016",
+      "corpus": "k8s",
+      "stratum": "comparison"
+    },
+    {
+      "id": "k8s_004",
+      "corpus": "k8s",
+      "stratum": "false_premise"
+    },
+    {
+      "id": "k8s_022",
+      "corpus": "k8s",
+      "stratum": "false_premise"
+    },
+    {
+      "id": "k8s_024",
+      "corpus": "k8s",
+      "stratum": "false_premise"
+    },
+    {
+      "id": "k8s_003",
+      "corpus": "k8s",
+      "stratum": "multi_hop"
+    },
+    {
+      "id": "k8s_017",
+      "corpus": "k8s",
+      "stratum": "multi_hop"
+    },
+    {
+      "id": "k8s_018",
+      "corpus": "k8s",
+      "stratum": "multi_hop"
+    },
+    {
+      "id": "k8s_019",
+      "corpus": "k8s",
+      "stratum": "multi_hop"
+    },
+    {
+      "id": "k8s_025",
+      "corpus": "k8s",
+      "stratum": "set"
+    },
+    {
+      "id": "k8s_001",
+      "corpus": "k8s",
+      "stratum": "simple"
+    },
+    {
+      "id": "k8s_006",
+      "corpus": "k8s",
+      "stratum": "simple"
+    },
+    {
+      "id": "k8s_007",
+      "corpus": "k8s",
+      "stratum": "simple"
+    },
+    {
+      "id": "k8s_009",
+      "corpus": "k8s",
+      "stratum": "simple"
+    },
+    {
+      "id": "k8s_005",
+      "corpus": "k8s",
+      "stratum": "simple_w_condition"
+    },
+    {
+      "id": "k8s_012",
+      "corpus": "k8s",
+      "stratum": "simple_w_condition"
+    },
+    {
+      "id": "k8s_013",
+      "corpus": "k8s",
+      "stratum": "simple_w_condition"
+    },
+    {
+      "id": "k8s_015",
+      "corpus": "k8s",
+      "stratum": "spare_comparison"
+    },
+    {
+      "id": "k8s_023",
+      "corpus": "k8s",
+      "stratum": "spare_false_premise"
+    },
+    {
+      "id": "k8s_020",
+      "corpus": "k8s",
+      "stratum": "spare_multi_hop"
+    },
+    {
+      "id": "k8s_011",
+      "corpus": "k8s",
+      "stratum": "spare_simple_w_condition"
+    }
+  ]
+}

agent_bench/evaluation/datasets/tech_docs_golden.json CHANGED Viewed

@@ -2,8 +2,15 @@
     {
         "id": "q001",
         "question": "How do you define a path parameter in FastAPI?",
-        "expected_answer_keywords": ["curly braces", "path", "function parameter", "URL"],
-        "expected_sources": ["fastapi_path_params.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
@@ -12,8 +19,15 @@
     {
         "id": "q002",
         "question": "What is the default page size for pagination in FastAPI and what is the maximum allowed?",
-        "expected_answer_keywords": ["20", "100", "default", "maximum"],
-        "expected_sources": ["fastapi_pagination.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
@@ -22,8 +36,15 @@
     {
         "id": "q003",
         "question": "How does FastAPI handle CORS and what is the default max_age for preflight caching?",
-        "expected_answer_keywords": ["CORSMiddleware", "600", "seconds", "preflight"],
-        "expected_sources": ["fastapi_middleware.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
@@ -32,8 +53,14 @@
     {
         "id": "q004",
         "question": "What algorithm and expiry time does the FastAPI security example use for JWT tokens?",
-        "expected_answer_keywords": ["HS256", "30", "minutes"],
-        "expected_sources": ["fastapi_security.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -42,8 +69,15 @@
     {
         "id": "q005",
         "question": "What is the recommended formula for calculating the number of Gunicorn workers for a FastAPI deployment?",
-        "expected_answer_keywords": ["2", "CPU", "cores", "1"],
-        "expected_sources": ["fastapi_deployment.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -52,18 +86,35 @@
     {
         "id": "q006",
         "question": "How does dependency caching work in FastAPI, and how can you disable it?",
-        "expected_answer_keywords": ["cache", "once", "use_cache", "False"],
-        "expected_sources": ["fastapi_dependencies.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
-        "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call."
     },
     {
         "id": "q007",
         "question": "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?",
-        "expected_answer_keywords": ["500", "334", "ceil", "pages"],
-        "expected_sources": ["fastapi_pagination.md"],
         "category": "calculation",
         "difficulty": "medium",
         "requires_calculator": true,
@@ -72,7 +123,11 @@
     {
         "id": "q008",
         "question": "Does FastAPI support automatic Kubernetes deployment?",
-        "expected_answer_keywords": ["not", "does not contain", "no information"],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
@@ -82,7 +137,11 @@
     {
         "id": "q009",
         "question": "How does FastAPI integrate with Apache Kafka for event streaming?",
-        "expected_answer_keywords": ["not", "does not contain", "no information"],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
@@ -92,38 +151,69 @@
     {
         "id": "q010",
         "question": "Can FastAPI generate GraphQL schemas natively?",
-        "expected_answer_keywords": ["not", "does not contain", "no information"],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
         "requires_calculator": false,
-        "reference_answer": ""
     },
     {
         "id": "q011",
         "question": "What is the default Swagger UI endpoint in FastAPI?",
-        "expected_answer_keywords": ["/docs", "Swagger", "interactive"],
-        "expected_sources": ["fastapi_openapi.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
-        "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface."
     },
     {
         "id": "q012",
         "question": "How do you raise an HTTP error in a FastAPI route handler?",
-        "expected_answer_keywords": ["HTTPException", "status_code", "detail"],
-        "expected_sources": ["fastapi_error_handling.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
-        "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error."
     },
     {
         "id": "q013",
         "question": "How do you define a request body in FastAPI?",
-        "expected_answer_keywords": ["Pydantic", "BaseModel", "JSON"],
-        "expected_sources": ["fastapi_request_body.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
@@ -132,8 +222,14 @@
     {
         "id": "q014",
         "question": "What testing tools does FastAPI use, and what class provides the test client?",
-        "expected_answer_keywords": ["TestClient", "pytest", "Starlette"],
-        "expected_sources": ["fastapi_testing.md"],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
@@ -142,8 +238,15 @@
     {
         "id": "q015",
         "question": "How does FastAPI manage application configuration and environment variables?",
-        "expected_answer_keywords": ["BaseSettings", "pydantic", "env", "environment"],
-        "expected_sources": ["fastapi_configuration.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -152,8 +255,15 @@
     {
         "id": "q016",
         "question": "What is the minimum response size for GZip compression middleware in FastAPI, and how do you enable it?",
-        "expected_answer_keywords": ["500", "bytes", "GZipMiddleware", "minimum_size"],
-        "expected_sources": ["fastapi_middleware.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -162,8 +272,15 @@
     {
         "id": "q017",
         "question": "How do yield dependencies work in FastAPI and what is the maximum number supported per request?",
-        "expected_answer_keywords": ["yield", "cleanup", "finally", "32"],
-        "expected_sources": ["fastapi_dependencies.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -172,8 +289,15 @@
     {
         "id": "q018",
         "question": "What are the three documentation endpoints FastAPI exposes by default and what OpenAPI version does it use?",
-        "expected_answer_keywords": ["/docs", "/redoc", "/openapi.json", "3.1"],
-        "expected_sources": ["fastapi_openapi.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -182,8 +306,15 @@
     {
         "id": "q019",
         "question": "How does FastAPI handle WebSocket connections, and what must be called before sending data?",
-        "expected_answer_keywords": ["accept", "WebSocket", "send", "receive"],
-        "expected_sources": ["fastapi_websockets.md"],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
@@ -192,8 +323,16 @@
     {
         "id": "q020",
         "question": "For a server with 4 CPU cores, how many Gunicorn workers should be configured using the recommended formula?",
-        "expected_answer_keywords": ["9", "workers", "2", "CPU", "1"],
-        "expected_sources": ["fastapi_deployment.md"],
         "category": "calculation",
         "difficulty": "medium",
         "requires_calculator": true,
@@ -202,18 +341,35 @@
     {
         "id": "q021",
         "question": "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?",
-        "expected_answer_keywords": ["10", "minutes"],
-        "expected_sources": ["fastapi_middleware.md"],
         "category": "calculation",
         "difficulty": "easy",
         "requires_calculator": true,
-        "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10)."
     },
     {
         "id": "q022",
         "question": "How do route ordering and dependency injection interact when building a secure FastAPI application with scoped endpoints?",
-        "expected_answer_keywords": ["order", "Depends", "Security", "scopes"],
-        "expected_sources": ["fastapi_path_params.md", "fastapi_dependencies.md", "fastapi_security.md"],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
@@ -222,18 +378,40 @@
     {
         "id": "q023",
         "question": "How would you set up a FastAPI application with custom error handling, CORS middleware, and structured testing including dependency overrides?",
-        "expected_answer_keywords": ["HTTPException", "CORSMiddleware", "TestClient", "override"],
-        "expected_sources": ["fastapi_error_handling.md", "fastapi_middleware.md", "fastapi_testing.md"],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
-        "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests."
     },
     {
         "id": "q024",
         "question": "Explain how to deploy a FastAPI app with Docker using Gunicorn workers, health checks, and environment-based configuration via Pydantic Settings.",
-        "expected_answer_keywords": ["Docker", "Gunicorn", "health", "BaseSettings", "env"],
-        "expected_sources": ["fastapi_deployment.md", "fastapi_configuration.md"],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
@@ -242,17 +420,32 @@
     {
         "id": "q025",
         "question": "How would you build a paginated API with cursor-based navigation, response model validation, and background task processing for analytics logging?",
-        "expected_answer_keywords": ["cursor", "response_model", "BackgroundTasks"],
-        "expected_sources": ["fastapi_pagination.md", "fastapi_response_model.md", "fastapi_background_tasks.md"],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
-        "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency."
     },
     {
         "id": "q026",
         "question": "Does FastAPI have built-in support for database migrations like Alembic?",
-        "expected_answer_keywords": ["not", "does not contain", "no information"],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
@@ -262,11 +455,16 @@
     {
         "id": "q027",
         "question": "How does FastAPI handle automatic load balancing across multiple servers?",
-        "expected_answer_keywords": ["not", "does not contain", "no information"],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
         "requires_calculator": false,
-        "reference_answer": ""
     }
 ]

     {
         "id": "q001",
         "question": "How do you define a path parameter in FastAPI?",
+        "expected_answer_keywords": [
+            "curly braces",
+            "path",
+            "function parameter",
+            "URL"
+        ],
+        "expected_sources": [
+            "fastapi_path_params.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
     {
         "id": "q002",
         "question": "What is the default page size for pagination in FastAPI and what is the maximum allowed?",
+        "expected_answer_keywords": [
+            "20",
+            "100",
+            "default",
+            "maximum"
+        ],
+        "expected_sources": [
+            "fastapi_pagination.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
     {
         "id": "q003",
         "question": "How does FastAPI handle CORS and what is the default max_age for preflight caching?",
+        "expected_answer_keywords": [
+            "CORSMiddleware",
+            "600",
+            "seconds",
+            "preflight"
+        ],
+        "expected_sources": [
+            "fastapi_middleware.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
     {
         "id": "q004",
         "question": "What algorithm and expiry time does the FastAPI security example use for JWT tokens?",
+        "expected_answer_keywords": [
+            "HS256",
+            "30",
+            "minutes"
+        ],
+        "expected_sources": [
+            "fastapi_security.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q005",
         "question": "What is the recommended formula for calculating the number of Gunicorn workers for a FastAPI deployment?",
+        "expected_answer_keywords": [
+            "2",
+            "CPU",
+            "cores",
+            "1"
+        ],
+        "expected_sources": [
+            "fastapi_deployment.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q006",
         "question": "How does dependency caching work in FastAPI, and how can you disable it?",
+        "expected_answer_keywords": [
+            "cache",
+            "once",
+            "use_cache",
+            "False"
+        ],
+        "expected_sources": [
+            "fastapi_dependencies.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
+        "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call.",
+        "source_snippets": [
+            "By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once. To disable caching and force a fresh call each time, use `Depends(get_db, use_cache=False)`."
+        ]
     },
     {
         "id": "q007",
         "question": "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?",
+        "expected_answer_keywords": [
+            "500",
+            "334",
+            "ceil",
+            "pages"
+        ],
+        "expected_sources": [
+            "fastapi_pagination.md"
+        ],
         "category": "calculation",
         "difficulty": "medium",
         "requires_calculator": true,
     {
         "id": "q008",
         "question": "Does FastAPI support automatic Kubernetes deployment?",
+        "expected_answer_keywords": [
+            "not",
+            "does not contain",
+            "no information"
+        ],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
     {
         "id": "q009",
         "question": "How does FastAPI integrate with Apache Kafka for event streaming?",
+        "expected_answer_keywords": [
+            "not",
+            "does not contain",
+            "no information"
+        ],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
     {
         "id": "q010",
         "question": "Can FastAPI generate GraphQL schemas natively?",
+        "expected_answer_keywords": [
+            "not",
+            "does not contain",
+            "no information"
+        ],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
         "requires_calculator": false,
+        "reference_answer": "",
+        "source_snippets": []
     },
     {
         "id": "q011",
         "question": "What is the default Swagger UI endpoint in FastAPI?",
+        "expected_answer_keywords": [
+            "/docs",
+            "Swagger",
+            "interactive"
+        ],
+        "expected_sources": [
+            "fastapi_openapi.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
+        "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface.",
+        "source_snippets": [
+            "| `/docs`          | Swagger UI -- interactive API explorer           |",
+            "Every FastAPI application exposes three documentation-related endpoints by default:"
+        ]
     },
     {
         "id": "q012",
         "question": "How do you raise an HTTP error in a FastAPI route handler?",
+        "expected_answer_keywords": [
+            "HTTPException",
+            "status_code",
+            "detail"
+        ],
+        "expected_sources": [
+            "fastapi_error_handling.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
+        "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error.",
+        "source_snippets": [
+            "The `HTTPException` class is the primary way to return error responses from route handlers:",
+            "When raised, `HTTPException` immediately terminates request processing and returns the specified status code and detail message. The `detail` parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically."
+        ]
     },
     {
         "id": "q013",
         "question": "How do you define a request body in FastAPI?",
+        "expected_answer_keywords": [
+            "Pydantic",
+            "BaseModel",
+            "JSON"
+        ],
+        "expected_sources": [
+            "fastapi_request_body.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
     {
         "id": "q014",
         "question": "What testing tools does FastAPI use, and what class provides the test client?",
+        "expected_answer_keywords": [
+            "TestClient",
+            "pytest",
+            "Starlette"
+        ],
+        "expected_sources": [
+            "fastapi_testing.md"
+        ],
         "category": "retrieval",
         "difficulty": "easy",
         "requires_calculator": false,
     {
         "id": "q015",
         "question": "How does FastAPI manage application configuration and environment variables?",
+        "expected_answer_keywords": [
+            "BaseSettings",
+            "pydantic",
+            "env",
+            "environment"
+        ],
+        "expected_sources": [
+            "fastapi_configuration.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q016",
         "question": "What is the minimum response size for GZip compression middleware in FastAPI, and how do you enable it?",
+        "expected_answer_keywords": [
+            "500",
+            "bytes",
+            "GZipMiddleware",
+            "minimum_size"
+        ],
+        "expected_sources": [
+            "fastapi_middleware.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q017",
         "question": "How do yield dependencies work in FastAPI and what is the maximum number supported per request?",
+        "expected_answer_keywords": [
+            "yield",
+            "cleanup",
+            "finally",
+            "32"
+        ],
+        "expected_sources": [
+            "fastapi_dependencies.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q018",
         "question": "What are the three documentation endpoints FastAPI exposes by default and what OpenAPI version does it use?",
+        "expected_answer_keywords": [
+            "/docs",
+            "/redoc",
+            "/openapi.json",
+            "3.1"
+        ],
+        "expected_sources": [
+            "fastapi_openapi.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q019",
         "question": "How does FastAPI handle WebSocket connections, and what must be called before sending data?",
+        "expected_answer_keywords": [
+            "accept",
+            "WebSocket",
+            "send",
+            "receive"
+        ],
+        "expected_sources": [
+            "fastapi_websockets.md"
+        ],
         "category": "retrieval",
         "difficulty": "medium",
         "requires_calculator": false,
     {
         "id": "q020",
         "question": "For a server with 4 CPU cores, how many Gunicorn workers should be configured using the recommended formula?",
+        "expected_answer_keywords": [
+            "9",
+            "workers",
+            "2",
+            "CPU",
+            "1"
+        ],
+        "expected_sources": [
+            "fastapi_deployment.md"
+        ],
         "category": "calculation",
         "difficulty": "medium",
         "requires_calculator": true,
     {
         "id": "q021",
         "question": "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?",
+        "expected_answer_keywords": [
+            "10",
+            "minutes"
+        ],
+        "expected_sources": [
+            "fastapi_middleware.md"
+        ],
         "category": "calculation",
         "difficulty": "easy",
         "requires_calculator": true,
+        "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10).",
+        "source_snippets": [
+            "| `max_age`            | `600`   | Seconds the browser caches preflight results       |"
+        ]
     },
     {
         "id": "q022",
         "question": "How do route ordering and dependency injection interact when building a secure FastAPI application with scoped endpoints?",
+        "expected_answer_keywords": [
+            "order",
+            "Depends",
+            "Security",
+            "scopes"
+        ],
+        "expected_sources": [
+            "fastapi_path_params.md",
+            "fastapi_dependencies.md",
+            "fastapi_security.md"
+        ],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
     {
         "id": "q023",
         "question": "How would you set up a FastAPI application with custom error handling, CORS middleware, and structured testing including dependency overrides?",
+        "expected_answer_keywords": [
+            "HTTPException",
+            "CORSMiddleware",
+            "TestClient",
+            "override"
+        ],
+        "expected_sources": [
+            "fastapi_error_handling.md",
+            "fastapi_middleware.md",
+            "fastapi_testing.md"
+        ],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
+        "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests.",
+        "source_snippets": [
+            "The `HTTPException` class is the primary way to return error responses from route handlers:",
+            "Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette:"
+        ]
     },
     {
         "id": "q024",
         "question": "Explain how to deploy a FastAPI app with Docker using Gunicorn workers, health checks, and environment-based configuration via Pydantic Settings.",
+        "expected_answer_keywords": [
+            "Docker",
+            "Gunicorn",
+            "health",
+            "BaseSettings",
+            "env"
+        ],
+        "expected_sources": [
+            "fastapi_deployment.md",
+            "fastapi_configuration.md"
+        ],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
     {
         "id": "q025",
         "question": "How would you build a paginated API with cursor-based navigation, response model validation, and background task processing for analytics logging?",
+        "expected_answer_keywords": [
+            "cursor",
+            "response_model",
+            "BackgroundTasks"
+        ],
+        "expected_sources": [
+            "fastapi_pagination.md",
+            "fastapi_response_model.md",
+            "fastapi_background_tasks.md"
+        ],
         "category": "retrieval",
         "difficulty": "hard",
         "requires_calculator": false,
+        "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency.",
+        "source_snippets": [
+            "Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets:"
+        ]
     },
     {
         "id": "q026",
         "question": "Does FastAPI have built-in support for database migrations like Alembic?",
+        "expected_answer_keywords": [
+            "not",
+            "does not contain",
+            "no information"
+        ],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
     {
         "id": "q027",
         "question": "How does FastAPI handle automatic load balancing across multiple servers?",
+        "expected_answer_keywords": [
+            "not",
+            "does not contain",
+            "no information"
+        ],
         "expected_sources": [],
         "category": "out_of_scope",
         "difficulty": "easy",
         "requires_calculator": false,
+        "reference_answer": "",
+        "source_snippets": []
     }
 ]

agent_bench/evaluation/harness.py CHANGED Viewed

@@ -8,8 +8,13 @@ from pathlib import Path
 from pydantic import BaseModel, Field
 from agent_bench.agents.orchestrator import Orchestrator
 from agent_bench.core.provider import LLMProvider
 from agent_bench.core.types import TokenUsage
 from agent_bench.evaluation.metrics import (
     calculator_used_when_expected,
     citation_accuracy,
@@ -21,6 +26,18 @@ from agent_bench.evaluation.metrics import (
     tool_call_count,
 )
 class GoldenQuestion(BaseModel):
     id: str
@@ -70,9 +87,13 @@ class EvalResult(BaseModel):
     # Raw answer for reporting
     answer: str = ""
     retrieved_sources: list[str] = []
-    # LLM judge (None if not run)
-    faithfulness: float | None = None
-    correctness: float | None = None
 def load_golden_dataset(path: str | Path) -> list[GoldenQuestion]:
@@ -149,21 +170,46 @@ async def run_evaluation(
             retrieved_sources=ranked_sources,
         )
-        # Optional LLM judge
-        if judge_provider is not None and q.category != "out_of_scope":
-            from agent_bench.evaluation.metrics import answer_correctness, answer_faithfulness
-            result.faithfulness = await answer_faithfulness(
-                answer=agent_response.answer,
-                source_chunks=agent_response.source_chunks,
-                judge_provider=judge_provider,
-            )
-            if q.reference_answer:
-                result.correctness = await answer_correctness(
-                    answer=agent_response.answer,
-                    reference_answer=q.reference_answer,
                     judge_provider=judge_provider,
                 )
         results.append(result)

 from pydantic import BaseModel, Field
 from agent_bench.agents.orchestrator import Orchestrator
+from agent_bench.core.config import load_config
 from agent_bench.core.provider import LLMProvider
 from agent_bench.core.types import TokenUsage
+from agent_bench.evaluation.judges.base import Rubric, ScoreResult
+from agent_bench.evaluation.judges.completeness import CompletenessJudge
+from agent_bench.evaluation.judges.groundedness import GroundednessJudge
+from agent_bench.evaluation.judges.relevance import RelevanceJudge
 from agent_bench.evaluation.metrics import (
     calculator_used_when_expected,
     citation_accuracy,
     tool_call_count,
 )
+# Annotated as type[Judge] would lose concrete-class info and trigger
+# mypy's "cannot instantiate abstract class" on the dispatch site below.
+# The dict's runtime values are concrete, instantiable subclasses; the
+# explicit type alias below preserves that information.
+_JUDGE_CLASS_BY_DIMENSION: dict[
+    str, type[GroundednessJudge] | type[RelevanceJudge] | type[CompletenessJudge]
+] = {
+    "groundedness": GroundednessJudge,
+    "relevance": RelevanceJudge,
+    "completeness": CompletenessJudge,
+}
 class GoldenQuestion(BaseModel):
     id: str
     # Raw answer for reporting
     answer: str = ""
     retrieved_sources: list[str] = []
+    # New in judge-layer v1: per-dimension judge scores. Empty when no
+    # judge_provider is configured. With a provider, OOS items receive
+    # relevance only (refusal-vs-engagement is the L2 signal worth
+    # measuring); reference-based dimensions (groundedness, completeness)
+    # are skipped on OOS. Completeness is also skipped when
+    # reference_answer is empty regardless of category.
+    judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
 def load_golden_dataset(path: str | Path) -> list[GoldenQuestion]:
             retrieved_sources=ranked_sources,
         )
+        # Optional L2 LLM-judge layer (per-dimension; gated per-dim).
+        #
+        # OOS items get relevance scoring (a non-refusal answer to an OOS
+        # question is exactly what relevance is designed to detect — the
+        # rubric's "refusal that ignores the question" example covers this
+        # case). Groundedness and completeness are skipped on OOS because
+        # neither has a meaningful reference (no source_snippets, no
+        # reference_answer for OOS items).
+        #
+        # This per-dimension gating matches the calibration runner's
+        # behavior so the κ table's distribution of scored items lines up
+        # with what the production harness produces. Diverging gates would
+        # mean the calibration κ for relevance was estimated on items the
+        # production harness never sees, breaking the supersession's
+        # empirical backing.
+        if judge_provider is not None:
+            cfg = load_config()
+            rubric_dir = Path(__file__).resolve().parent / "rubrics"
+            is_oos = q.category == "out_of_scope"
+            for dim in cfg.evaluation.judge_dimensions:
+                if dim not in _JUDGE_CLASS_BY_DIMENSION:
+                    continue  # citation_faithfulness opt-in; not in default loop
+                # Per-dimension OOS gating: skip reference-based dimensions
+                # (groundedness, completeness) on OOS; allow relevance.
+                if is_oos and dim != "relevance":
+                    continue
+                # CompletenessJudge is reference-based on q.reference_answer;
+                # scoring an empty reference is guaranteed-noisy and burns
+                # tokens. Pre-supersession code had the same gate (correctness
+                # was conditional on reference_answer being non-empty).
+                if dim == "completeness" and not q.reference_answer:
+                    continue
+                rubric = Rubric.from_markdown_file(rubric_dir / f"{dim}.md")
+                judge = _JUDGE_CLASS_BY_DIMENSION[dim](
                     judge_provider=judge_provider,
+                    rubric=rubric,
+                    model_id=getattr(judge_provider, "model", "unknown"),
                 )
+                score_result = await judge.score(q, agent_response)
+                result.judge_scores[dim] = score_result
         results.append(result)

agent_bench/evaluation/judges/__init__.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""Discrete-scale per-dimension LLM judges with anchored rubrics."""
+from agent_bench.evaluation.judges.base import (
+    ABSTAIN_REASON_GENUINE,
+    ABSTAIN_REASON_OUT_OF_RANGE,
+    ABSTAIN_REASON_PROVIDER_EXHAUSTED,
+    ABSTAIN_REASON_SCHEMA_PARSE,
+    Judge,
+    MockJudge,
+    Rubric,
+    RubricLevel,
+    ScoreResult,
+)
+__all__ = [
+    "ABSTAIN_REASON_GENUINE",
+    "ABSTAIN_REASON_OUT_OF_RANGE",
+    "ABSTAIN_REASON_PROVIDER_EXHAUSTED",
+    "ABSTAIN_REASON_SCHEMA_PARSE",
+    "Judge",
+    "MockJudge",
+    "Rubric",
+    "RubricLevel",
+    "ScoreResult",
+]

agent_bench/evaluation/judges/base.py ADDED Viewed

	@@ -0,0 +1,628 @@

+"""Judge ABC, ScoreResult, Rubric, MockJudge, abstain-reason constants.
+The Judge layer supersedes the continuous-scale answer_faithfulness /
+answer_correctness functions in agent_bench/evaluation/metrics.py. See
+docs/plans/2026-05-04-judge-layer-v1-design.md for the supersession
+rationale and the six-axis comparison table.
+"""
+from __future__ import annotations
+import hashlib
+import json as _json
+import random
+import re
+import time
+from abc import ABC, abstractmethod
+from pathlib import Path
+from typing import TYPE_CHECKING, Literal, Self
+import structlog
+import yaml
+from pydantic import BaseModel, Field
+from agent_bench.core.provider import (
+    ProviderRateLimitError,
+    ProviderTimeoutError,
+)
+from agent_bench.core.types import Message, Role
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.core.provider import LLMProvider
+    from agent_bench.evaluation.harness import GoldenQuestion
+logger = structlog.get_logger()
+# --- Abstain-reason constants ---
+#
+# Failure-as-abstain ScoreResults carry a reasoning string with one of
+# these prefixes. The calibration report pattern-matches against these
+# constants for the four-way breakdown in the >20% abstain-rate flag.
+# Genuine model abstain (rubric-allowed) uses the empty-string sentinel.
+ABSTAIN_REASON_PROVIDER_EXHAUSTED = "judge_call_failed_after_retry: "
+ABSTAIN_REASON_SCHEMA_PARSE = "schema_parse_failed_after_retry: "
+ABSTAIN_REASON_OUT_OF_RANGE = "score_out_of_range_after_retry: "
+ABSTAIN_REASON_GENUINE = ""
+class ScoreResult(BaseModel):
+    """One judge call's result. Self-contained provenance — no run
+    metadata cross-reference needed for κ aggregation.
+    Field order matters: reasoning + evidence_quotes come BEFORE score
+    in both Pydantic field order and the JSON schema sent to the model,
+    so the score conditions on the reasoning rather than being
+    post-hoc rationalized.
+    """
+    # Reasoning-first ordering — load-bearing for the JSON schema
+    reasoning: str
+    evidence_quotes: list[str] = Field(default_factory=list)
+    score: int | Literal["Unknown"]
+    # Provenance
+    judge_id: str
+    rubric_version: str
+    prompt_seed: int = 0
+    system_output_hash: str
+    # Operations
+    cost_usd: float
+    latency_ms: float
+    @property
+    def abstained(self) -> bool:
+        return self.score == "Unknown"
+_FENCE_PATTERN = re.compile(r"^```[^\n]*\n.*?^```\n?", re.MULTILINE | re.DOTALL)
+def _mask_code_fences(text: str) -> str:
+    """Replace fenced code blocks (``` ... ```) with same-length whitespace,
+    preserving newlines so byte offsets align with the original. Used by
+    the rubric loader to skip fenced ``## Score N`` literals when scanning
+    for structural level headers.
+    """
+    def _replace(match: re.Match[str]) -> str:
+        return "".join("\n" if c == "\n" else " " for c in match.group(0))
+    return _FENCE_PATTERN.sub(_replace, text)
+class RubricLevel(BaseModel):
+    """One score level in a rubric, with anchored examples.
+    Parsed from markdown sections under `## Score N` headers. The
+    `examples` list contains the H3 sub-sections (`### Example X`)
+    each with a thinking-trace explanation of why that output got
+    that score.
+    """
+    score: int
+    description: str
+    examples: list[str]  # raw markdown of `### Example` sections
+class Rubric(BaseModel):
+    """A scoring rubric loaded from a markdown file with YAML frontmatter.
+    Construction validates aggressively: scale ∈ {binary, three_point},
+    levels arity matches scale, every level has at least one anchored
+    example. ValidationError raises with file path + field path so a
+    Day-1 rubric typo doesn't surface as a Day-2 judge.score crash with
+    API budget already spent.
+    """
+    dimension: Literal[
+        "groundedness", "relevance", "completeness", "citation_faithfulness"
+    ]
+    scale: Literal["binary", "three_point"]
+    reference_based: bool
+    abstain_allowed: bool
+    levels: list[RubricLevel]
+    body_markdown: str
+    @property
+    def source_hash(self) -> str:
+        """SHA-256 of the canonical body. Immutable per file content,
+        independent of git state. Used as ScoreResult.rubric_version.
+        """
+        return hashlib.sha256(self.body_markdown.encode("utf-8")).hexdigest()
+    @classmethod
+    def from_markdown_file(cls, path: Path | str) -> Self:
+        path = Path(path)
+        body = path.read_text(encoding="utf-8")
+        # Parse YAML frontmatter delimited by --- ... ---
+        fm_match = re.match(r"^---\n(.+?)\n---\n(.*)$", body, re.DOTALL)
+        if not fm_match:
+            raise ValueError(
+                f"Rubric {path.name}: missing YAML frontmatter "
+                f"(expected --- ... --- block at top of file)"
+            )
+        try:
+            frontmatter = yaml.safe_load(fm_match.group(1)) or {}
+        except yaml.YAMLError as e:
+            raise ValueError(
+                f"Rubric {path.name}: frontmatter YAML parse error: {e}"
+            ) from e
+        required = {"dimension", "scale", "reference_based", "abstain_allowed"}
+        missing = required - frontmatter.keys()
+        if missing:
+            raise ValueError(
+                f"Rubric {path.name}: frontmatter missing fields: {sorted(missing)}"
+            )
+        scale = frontmatter["scale"]
+        if scale not in ("binary", "three_point"):
+            raise ValueError(
+                f"Rubric {path.name}: invalid scale {scale!r}; "
+                f"must be 'binary' or 'three_point'"
+            )
+        # Parse levels by ## Score N headers. Mask fenced code blocks first
+        # so a literal "## Score N" inside an example's code fence is not
+        # interpreted as a structural level header. The mask preserves byte
+        # offsets (replacing non-newline chars with spaces) so we can slice
+        # the original `body_no_fm` at the masked-text header positions to
+        # recover level bodies with their fenced content intact.
+        body_no_fm = fm_match.group(2)
+        masked_body = _mask_code_fences(body_no_fm)
+        header_pattern = re.compile(r"^## Score (\d+)\n", re.MULTILINE)
+        header_matches = list(header_pattern.finditer(masked_body))
+        raw_levels: list[tuple[int, str]] = []
+        for i, m in enumerate(header_matches):
+            start = m.end()
+            end = (
+                header_matches[i + 1].start()
+                if i + 1 < len(header_matches)
+                else len(body_no_fm)
+            )
+            raw_levels.append((int(m.group(1)), body_no_fm[start:end]))
+        expected_arity = 2 if scale == "binary" else 3
+        if len(raw_levels) != expected_arity:
+            raise ValueError(
+                f"Rubric {path.name}: arity mismatch — scale {scale!r} "
+                f"requires {expected_arity} levels, found {len(raw_levels)}"
+            )
+        # Parse examples (### Example) per level
+        levels: list[RubricLevel] = []
+        for score, level_body in raw_levels:
+            example_pattern = re.compile(
+                r"^### (Example .+?)\n(.*?)(?=^### |\Z)", re.MULTILINE | re.DOTALL
+            )
+            examples = [m.group(0) for m in example_pattern.finditer(level_body)]
+            if not examples:
+                raise ValueError(
+                    f"Rubric {path.name}: level Score {score} has no "
+                    f"anchored example (expected at least one ### Example header)"
+                )
+            description = level_body.split("###", 1)[0].strip()
+            levels.append(
+                RubricLevel(score=score, description=description, examples=examples)
+            )
+        return cls(
+            dimension=frontmatter["dimension"],
+            scale=scale,
+            reference_based=bool(frontmatter["reference_based"]),
+            abstain_allowed=bool(frontmatter["abstain_allowed"]),
+            levels=levels,
+            body_markdown=body,
+        )
+    def render_prompt(self, *, level_permutation_seed: int = 0) -> str:
+        """Render the rubric body for inclusion in a judge prompt.
+        If level_permutation_seed > 0, levels are reordered deterministically
+        using a seeded PRNG. seed=0 returns the canonical order.
+        """
+        if level_permutation_seed == 0:
+            return self.body_markdown
+        rng = random.Random(level_permutation_seed)
+        permuted_levels = list(self.levels)
+        rng.shuffle(permuted_levels)
+        # Reconstruct: keep frontmatter + intro paragraphs intact;
+        # reorder the ## Score N sections.
+        fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL)
+        if not fm_match:
+            return self.body_markdown  # defensive — should never happen post-construction
+        head = fm_match.group(1)
+        rest = fm_match.group(2)
+        intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0]
+        permuted_body = head + intro + "\n".join(
+            f"## Score {lvl.score}\n{lvl.description}\n" + "\n".join(lvl.examples)
+            for lvl in permuted_levels
+        )
+        return permuted_body
+    def strip_anchors(self) -> Self:
+        """Return a new Rubric with anchored examples removed from every
+        level (and a regenerated body_markdown that omits the ``### Example``
+        sections). Used by the calibration runner's `use_anchors=false`
+        ablation row to measure the contribution of anchored examples.
+        source_hash naturally diverges because body_markdown changes — so
+        ScoreResults from the stripped rubric carry a different
+        rubric_version, and the calibration report can bucket them
+        correctly without requiring a separate provenance field.
+        """
+        fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL)
+        head = fm_match.group(1) if fm_match else ""
+        rest = fm_match.group(2) if fm_match else self.body_markdown
+        intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0]
+        # Render each level with its description but no examples.
+        stripped_body = head + intro + "\n".join(
+            f"## Score {lvl.score}\n{lvl.description}\n" for lvl in self.levels
+        )
+        stripped_levels = [
+            RubricLevel(score=lvl.score, description=lvl.description, examples=[])
+            for lvl in self.levels
+        ]
+        return type(self)(
+            dimension=self.dimension,
+            scale=self.scale,
+            reference_based=self.reference_based,
+            abstain_allowed=self.abstain_allowed,
+            levels=stripped_levels,
+            body_markdown=stripped_body,
+        )
+class Judge(ABC):
+    """Per-dimension LLM judge. Concrete subclasses implement score()
+    for one rubric dimension; they are thin (~30 lines) and not
+    factored against a shared base method (see design doc for why).
+    Three calibration knobs are accepted at construction so the
+    calibration runner can run baseline-vs-ablation rows from the same
+    code path without monkey-patching:
+    - ``use_cot`` (default True) — when False, the JSON schema requested
+      from the model omits the ``reasoning`` and ``evidence_quotes``
+      fields, ablating the chain-of-thought-before-score discipline.
+    - ``abstain_allowed_override`` (default None) — when set, overrides
+      the rubric's ``abstain_allowed`` flag for this judge's calls. Used
+      by the ``baseline_no_abstain`` ablation row.
+    - The ``use_anchors`` knob is implemented by passing a stripped
+      rubric (via ``Rubric.strip_anchors()``) at construction time, not
+      via a separate flag here — that way ScoreResult.rubric_version
+      naturally distinguishes anchored vs stripped variants.
+    """
+    def __init__(
+        self,
+        judge_provider: "LLMProvider",
+        rubric: Rubric,
+        model_id: str,
+        *,
+        use_cot: bool = True,
+        abstain_allowed_override: bool | None = None,
+    ) -> None:
+        self.judge_provider = judge_provider
+        self.rubric = rubric
+        self.model_id = model_id
+        self.use_cot = use_cot
+        self.abstain_allowed_override = abstain_allowed_override
+        # judge_id format: ``{model_id}_{dimension}`` — load-bearing for
+        # the calibration report's per-judge κ breakdown. Ablation knobs
+        # do NOT enter the judge_id; the row label + ScoreResult.
+        # rubric_version (which differs for stripped anchors) carry that
+        # signal. This keeps the per-judge bucketing stable across
+        # baseline + ablation rows for the same model.
+        self.judge_id = f"{model_id}_{rubric.dimension}"
+    @property
+    def effective_abstain_allowed(self) -> bool:
+        """Whether abstain is permitted for this judge's calls; the
+        override (when set) takes precedence over the rubric's flag.
+        """
+        if self.abstain_allowed_override is not None:
+            return self.abstain_allowed_override
+        return self.rubric.abstain_allowed
+    def _json_schema_clause(self, valid_scores_str: str) -> str:
+        """Render the trailing JSON-schema instruction for the prompt.
+        With ``use_cot=True`` (default) the schema asks for reasoning
+        and evidence_quotes before the score, so the model's response
+        conditions the score on the reasoning. With ``use_cot=False``
+        only the score field is requested — used for the ``no_cot``
+        ablation row.
+        """
+        if self.use_cot:
+            return (
+                f'JSON object: {{"reasoning": "...", '
+                f'"evidence_quotes": [...], "score": {valid_scores_str}}}.'
+            )
+        return f'JSON object: {{"score": {valid_scores_str}}}.'
+    @abstractmethod
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        """Score one (item, output) pair against this judge's rubric.
+        Returns a ScoreResult whose system_output_hash is computed from
+        (item.id, output.answer, sorted(output.sources)). Failures map
+        to abstain via the abstain-reason constants; provider non-
+        retryable errors raise (caller bug, not noise).
+        """
+        ...
+class MockJudge(Judge):
+    """Pre-baked-verdict judge for deterministic tests. No API calls.
+    Constructor takes verdicts: dict[item_id, ScoreResult]. score()
+    raises LookupError on missing keys — never returns a default —
+    so test fixtures are self-checking. A separate fixture-validation
+    test (test_mockjudge_coverage.py) walks item.id across all goldens
+    and asserts every MockJudge instance has coverage for the items
+    its tests reference.
+    Mirrors the MockProvider pattern at agent_bench/core/provider.py.
+    """
+    def __init__(self, verdicts: dict[str, ScoreResult]) -> None:
+        # MockJudge does not need provider/rubric/model_id; supply
+        # placeholder values so the ABC's __init__ doesn't matter.
+        self.judge_provider = None  # type: ignore[assignment]
+        self.rubric = None  # type: ignore[assignment]
+        self.model_id = "mock"
+        self.judge_id = "mock_judge"
+        self._verdicts = verdicts
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        if item.id not in self._verdicts:
+            raise LookupError(
+                f"MockJudge has no pre-baked verdict for item_id {item.id!r}; "
+                f"available: {sorted(self._verdicts.keys())[:5]}"
+                + (" ..." if len(self._verdicts) > 5 else "")
+            )
+        return self._verdicts[item.id]
+# --- _call_judge_with_retry helper ---
+_STRICT_REPROMPT_SUFFIX = (
+    "\n\nSTRICT FORMATTING NOTE: respond ONLY with a JSON object matching "
+    "the schema; reasoning first, then evidence_quotes, then score. "
+    "Do not wrap the JSON in a markdown code fence."
+)
+_MARKDOWN_FENCE_RE = re.compile(r"^\s*```(?:json|JSON)?\s*\n(.*?)\n```\s*$", re.DOTALL)
+def _strip_markdown_fence(text: str) -> str:
+    """Strip a leading/trailing ```json ... ``` markdown fence if present.
+    Some chat models wrap structured JSON in a markdown code fence even
+    when the prompt asks for a bare JSON object. The judge parser uses
+    json.loads on the raw content, which fails at char 0 on the literal
+    backtick. This helper unwraps the fence so the parse can proceed.
+    Idempotent: returns text unchanged if no fence is present.
+    """
+    m = _MARKDOWN_FENCE_RE.match(text.strip())
+    return m.group(1) if m else text
+async def _call_judge_with_retry(
+    *,
+    provider: "LLMProvider",
+    prompt: str,
+    valid_scores: set[int],
+    judge_id: str,
+    rubric_version: str,
+    prompt_seed: int,
+    system_output_hash: str,
+    item_id: str,
+    abstain_allowed: bool = True,
+    max_tokens: int = 1024,
+) -> ScoreResult:
+    """Send prompt to provider; one retry with strict reprompt on
+    schema-parse / score-out-of-range; abstain on persistent failure
+    or provider exhaustion. Re-raises unknown exceptions (caller bugs).
+    max_tokens defaults to 1024 (was 512 pre-v1.1). The v1.1 groundedness
+    rubric ships with calibration anchors whose verbose thinking traces
+    elicit longer model reasoning in turn; 512 truncated the JSON
+    response mid-reasoning and caused 78/82 schema_parse_failed
+    abstains in the first run after the rubric clarification. 1024 leaves
+    enough headroom; bump again if a future rubric revision pushes
+    reasoning longer.
+    """
+    accumulated_cost = 0.0
+    accumulated_latency = 0.0
+    for attempt in range(2):  # 2 = original + one retry
+        send_prompt = prompt if attempt == 0 else prompt + _STRICT_REPROMPT_SUFFIX
+        start = time.perf_counter()
+        try:
+            response = await provider.complete(
+                [Message(role=Role.USER, content=send_prompt)],
+                temperature=0.0,
+                max_tokens=max_tokens,
+            )
+        except (ProviderRateLimitError, ProviderTimeoutError) as e:
+            return ScoreResult(
+                reasoning=f"{ABSTAIN_REASON_PROVIDER_EXHAUSTED}{type(e).__name__}: {e}",
+                evidence_quotes=[],
+                score="Unknown",
+                judge_id=judge_id,
+                rubric_version=rubric_version,
+                prompt_seed=prompt_seed,
+                system_output_hash=system_output_hash,
+                cost_usd=accumulated_cost,
+                latency_ms=accumulated_latency + (time.perf_counter() - start) * 1000,
+            )
+        # Other exceptions (caller bugs like 401, 400) propagate.
+        accumulated_cost += response.usage.estimated_cost_usd
+        accumulated_latency += (time.perf_counter() - start) * 1000
+        last_raw = response.content[:300]
+        # Parse — reasoning and evidence_quotes are optional so judges
+        # configured with use_cot=False (which prompt for {"score": ...}
+        # only) don't fail parsing on the missing key.
+        #
+        # Some models (observed on Haiku 4.5 under the v1.1 rubric) wrap
+        # their JSON in a ```json ... ``` markdown fence. Strip the fence
+        # before parsing rather than abstaining on a syntactically valid
+        # but conventionally formatted response.
+        content = _strip_markdown_fence(response.content)
+        try:
+            data = _json.loads(content)
+            reasoning = str(data.get("reasoning", ""))
+            evidence_quotes = list(data.get("evidence_quotes", []))
+            raw_score = data["score"]
+        except (_json.JSONDecodeError, KeyError, TypeError) as e:
+            cause = ABSTAIN_REASON_SCHEMA_PARSE
+            if attempt == 0:
+                logger.warning(
+                    "judge_first_attempt_failure",
+                    judge_id=judge_id,
+                    item_id=item_id,
+                    provider=type(provider).__name__,
+                    failure_cause=cause,
+                    attempt_index=1,
+                )
+                continue
+            return ScoreResult(
+                reasoning=f"{cause}raw={last_raw!r} parse_error={e}",
+                evidence_quotes=[],
+                score="Unknown",
+                judge_id=judge_id,
+                rubric_version=rubric_version,
+                prompt_seed=prompt_seed,
+                system_output_hash=system_output_hash,
+                cost_usd=accumulated_cost,
+                latency_ms=accumulated_latency,
+            )
+        # Score validation
+        if raw_score == "Unknown":
+            if not abstain_allowed:
+                cause = ABSTAIN_REASON_OUT_OF_RANGE
+                if attempt == 0:
+                    logger.warning(
+                        "judge_first_attempt_failure",
+                        judge_id=judge_id,
+                        item_id=item_id,
+                        provider=type(provider).__name__,
+                        failure_cause=cause,
+                        attempt_index=1,
+                    )
+                    continue
+                return ScoreResult(
+                    reasoning=(
+                        f"{cause}model returned 'Unknown' but rubric "
+                        f"abstain_allowed=False"
+                    ),
+                    evidence_quotes=[],
+                    score="Unknown",
+                    judge_id=judge_id,
+                    rubric_version=rubric_version,
+                    prompt_seed=prompt_seed,
+                    system_output_hash=system_output_hash,
+                    cost_usd=accumulated_cost,
+                    latency_ms=accumulated_latency,
+                )
+            # Genuine abstain — no prefix, no retry
+            return ScoreResult(
+                reasoning=reasoning,
+                evidence_quotes=evidence_quotes,
+                score="Unknown",
+                judge_id=judge_id,
+                rubric_version=rubric_version,
+                prompt_seed=prompt_seed,
+                system_output_hash=system_output_hash,
+                cost_usd=accumulated_cost,
+                latency_ms=accumulated_latency,
+            )
+        try:
+            score_int = int(raw_score)
+        except (ValueError, TypeError):
+            cause = ABSTAIN_REASON_OUT_OF_RANGE
+            if attempt == 0:
+                logger.warning(
+                    "judge_first_attempt_failure",
+                    judge_id=judge_id,
+                    item_id=item_id,
+                    provider=type(provider).__name__,
+                    failure_cause=cause,
+                    attempt_index=1,
+                )
+                continue
+            return ScoreResult(
+                reasoning=f"{cause}non-int score: {raw_score!r}",
+                evidence_quotes=[],
+                score="Unknown",
+                judge_id=judge_id,
+                rubric_version=rubric_version,
+                prompt_seed=prompt_seed,
+                system_output_hash=system_output_hash,
+                cost_usd=accumulated_cost,
+                latency_ms=accumulated_latency,
+            )
+        if score_int not in valid_scores:
+            cause = ABSTAIN_REASON_OUT_OF_RANGE
+            if attempt == 0:
+                logger.warning(
+                    "judge_first_attempt_failure",
+                    judge_id=judge_id,
+                    item_id=item_id,
+                    provider=type(provider).__name__,
+                    failure_cause=cause,
+                    attempt_index=1,
+                )
+                continue
+            return ScoreResult(
+                reasoning=(
+                    f"{cause}model returned {score_int}, valid levels "
+                    f"{sorted(valid_scores)}"
+                ),
+                evidence_quotes=[],
+                score="Unknown",
+                judge_id=judge_id,
+                rubric_version=rubric_version,
+                prompt_seed=prompt_seed,
+                system_output_hash=system_output_hash,
+                cost_usd=accumulated_cost,
+                latency_ms=accumulated_latency,
+            )
+        # Success
+        return ScoreResult(
+            reasoning=reasoning,
+            evidence_quotes=evidence_quotes,
+            score=score_int,
+            judge_id=judge_id,
+            rubric_version=rubric_version,
+            prompt_seed=prompt_seed,
+            system_output_hash=system_output_hash,
+            cost_usd=accumulated_cost,
+            latency_ms=accumulated_latency,
+        )
+    raise RuntimeError("_call_judge_with_retry: unreachable code path")

agent_bench/evaluation/judges/citation_faithfulness.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""CitationFaithfulnessJudge — binary, per-(claim,citation) all-or-nothing."""
+from __future__ import annotations
+import re
+from typing import TYPE_CHECKING, Literal
+import structlog
+from agent_bench.evaluation.judges.base import (
+    Judge,
+    ScoreResult,
+    _call_judge_with_retry,
+)
+from agent_bench.evaluation.judges.groundedness import _system_output_hash
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+logger = structlog.get_logger()
+_CITATION_PATTERN = re.compile(r"\[source:\s*([^\]]+)\]")
+def _extract_claims_with_citations(answer: str) -> list[tuple[str, str]]:
+    """Return list of (claim_text, cited_source) pairs.
+    A "claim" is the sentence (including its terminating punctuation)
+    immediately preceding a [source:] citation. Prior citation tags
+    inside `before` are stripped so multi-citation answers yield clean
+    claim strings.
+    """
+    pairs: list[tuple[str, str]] = []
+    for match in _CITATION_PATTERN.finditer(answer):
+        cited = match.group(1).strip()
+        before = answer[: match.start()]
+        # Strip prior [source:...] tags so they don't pollute the claim
+        before_clean = _CITATION_PATTERN.sub("", before)
+        last_end = max(
+            before_clean.rfind("."), before_clean.rfind("!"), before_clean.rfind("?")
+        )
+        if last_end >= 0:
+            prev_end = max(
+                before_clean.rfind(".", 0, last_end),
+                before_clean.rfind("!", 0, last_end),
+                before_clean.rfind("?", 0, last_end),
+            )
+            claim = before_clean[prev_end + 1 : last_end + 1].strip()
+        else:
+            claim = before_clean.strip()
+        pairs.append((claim, cited))
+    return pairs
+class CitationFaithfulnessJudge(Judge):
+    """Aggregates per-(claim, citation) judgments into one item-level
+    binary ScoreResult. Per-pair detail is in evidence_quotes.
+    All-or-nothing aggregation: any unfaithful citation → score 0.
+    The rubric documents the rule explicitly.
+    """
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        pairs = _extract_claims_with_citations(output.answer)
+        # Map cited source name to its retrieved chunk text via output.source_chunks
+        # (assumes index alignment with output.sources, matching harness
+        # convention). If the same source appears multiple times in the
+        # sources list with distinct chunks (legitimate when multiple
+        # retrievals match the same doc), `setdefault` keeps only the first
+        # — every "[source: X]" claim then evaluates against that one chunk,
+        # a false-failure risk. Warn so the operator notices.
+        source_names = [s.source for s in output.sources]
+        if len(set(source_names)) < len(source_names):
+            from collections import Counter
+            duplicates = sorted(
+                name for name, n in Counter(source_names).items() if n > 1
+            )
+            logger.warning(
+                "citation_faithfulness_lossy_source_lookup",
+                item_id=item.id,
+                duplicate_source_names=duplicates,
+                detail=(
+                    "source name appears multiple times in output.sources "
+                    "with distinct chunks; only the first chunk will be "
+                    "associated with the name during citation evaluation."
+                ),
+            )
+        source_to_chunk: dict[str, str] = {}
+        for src_ref, chunk in zip(output.sources, output.source_chunks):
+            source_to_chunk.setdefault(src_ref.source, chunk)
+        per_pair_results: list[ScoreResult] = []
+        sys_hash = _system_output_hash(
+            item.id, output.answer, [s.source for s in output.sources]
+        )
+        if not pairs:
+            return ScoreResult(
+                reasoning="no_citations_in_answer",
+                evidence_quotes=[],
+                score=1,
+                judge_id=self.judge_id,
+                rubric_version=self.rubric.source_hash,
+                prompt_seed=prompt_seed,
+                system_output_hash=sys_hash,
+                cost_usd=0.0,
+                latency_ms=0.0,
+            )
+        accumulated_cost = 0.0
+        accumulated_latency = 0.0
+        any_unfaithful = False
+        for claim, cited in pairs:
+            # Empty claim → leading-citation case (e.g., answer starts with
+            # "[source: a.md] ..." with no prior content). There is no claim
+            # to evaluate against the chunk; the well-defined verdict is
+            # vacuously faithful. Skip the API call; record a synthetic
+            # ScoreResult so per-pair detail still appears in evidence_quotes.
+            if not claim:
+                per_pair_results.append(
+                    ScoreResult(
+                        reasoning="empty_claim_vacuously_faithful",
+                        evidence_quotes=[],
+                        score=1,
+                        judge_id=self.judge_id,
+                        rubric_version=self.rubric.source_hash,
+                        prompt_seed=prompt_seed,
+                        system_output_hash=sys_hash,
+                        cost_usd=0.0,
+                        latency_ms=0.0,
+                    )
+                )
+                continue
+            chunk = source_to_chunk.get(cited, "")
+            schema_clause = self._json_schema_clause('0 or 1 or "Unknown"')
+            prompt = (
+                f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
+                f"---\n\n"
+                f"## Claim (from agent's answer)\n{claim}\n\n"
+                f"## Cited chunk content\n{chunk}\n\n"
+                f"Does the cited chunk support the claim? Respond with ONLY a "
+                f"{schema_clause}"
+            )
+            sub_result = await _call_judge_with_retry(
+                provider=self.judge_provider,
+                prompt=prompt,
+                valid_scores={0, 1},
+                judge_id=self.judge_id,
+                rubric_version=self.rubric.source_hash,
+                prompt_seed=prompt_seed,
+                system_output_hash=sys_hash,
+                item_id=f"{item.id}::{cited}",
+                abstain_allowed=self.effective_abstain_allowed,
+            )
+            per_pair_results.append(sub_result)
+            accumulated_cost += sub_result.cost_usd
+            accumulated_latency += sub_result.latency_ms
+            if sub_result.score == 0:
+                any_unfaithful = True
+        aggregate_score: int | Literal["Unknown"] = 0 if any_unfaithful else 1
+        # Any sub-call abstain → propagate Unknown (consistent with strict-quorum)
+        if any(r.abstained for r in per_pair_results):
+            aggregate_score = "Unknown"
+        return ScoreResult(
+            reasoning=(
+                f"all_or_nothing aggregate over {len(per_pair_results)} (claim, citation) pairs; "
+                f"unfaithful={sum(1 for r in per_pair_results if r.score == 0)}, "
+                f"abstained={sum(1 for r in per_pair_results if r.abstained)}"
+            ),
+            evidence_quotes=[r.reasoning[:120] for r in per_pair_results],
+            score=aggregate_score,
+            judge_id=self.judge_id,
+            rubric_version=self.rubric.source_hash,
+            prompt_seed=prompt_seed,
+            system_output_hash=sys_hash,
+            cost_usd=accumulated_cost,
+            latency_ms=accumulated_latency,
+        )

agent_bench/evaluation/judges/completeness.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""CompletenessJudge — three-point, reference-based on item.reference_answer."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from agent_bench.evaluation.judges.base import (
+    Judge,
+    ScoreResult,
+    _call_judge_with_retry,
+)
+from agent_bench.evaluation.judges.groundedness import _system_output_hash
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+# v1.1.1: recency-positioned restatement of the rubric's "paraphrase
+# allowed" semantics. Earned by the 3A probe (3/5 disputed items shifted
+# 1→2 on gpt-4o-mini) which validated that gpt-4o-mini's directional
+# downward bias on 3-point completeness was prompt-positionally
+# correctable rather than model-intrinsic. The clause appears immediately
+# before the score instruction so the conditioning isn't lost across the
+# rubric body and the reasoning step. See DECISIONS "Plan 3A" entry.
+PARAPHRASE_RECENCY_CLAUSE = (
+    "Note: a paraphrase that captures the same meaning as a gold-answer "
+    "point counts as covered. Score on content equivalence, not surface form."
+)
+class CompletenessJudge(Judge):
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"')
+        prompt = (
+            f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
+            f"---\n\n"
+            f"## Reference answer (gold)\n{item.reference_answer}\n\n"
+            f"## Answer to score\n{output.answer}\n\n"
+            f"{PARAPHRASE_RECENCY_CLAUSE}\n\n"
+            f"Score this answer against the rubric above. Respond with ONLY a "
+            f"{schema_clause}"
+        )
+        return await _call_judge_with_retry(
+            provider=self.judge_provider,
+            prompt=prompt,
+            valid_scores={0, 1, 2},
+            judge_id=self.judge_id,
+            rubric_version=self.rubric.source_hash,
+            prompt_seed=prompt_seed,
+            system_output_hash=_system_output_hash(
+                item.id, output.answer, [s.source for s in output.sources]
+            ),
+            item_id=item.id,
+            abstain_allowed=self.effective_abstain_allowed,
+        )

agent_bench/evaluation/judges/groundedness.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""GroundednessJudge — binary, reference-based on item.source_snippets."""
+from __future__ import annotations
+import hashlib
+from typing import TYPE_CHECKING
+from agent_bench.evaluation.judges.base import (
+    Judge,
+    ScoreResult,
+    _call_judge_with_retry,
+)
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+def _system_output_hash(item_id: str, answer: str, sources: list[str]) -> str:
+    sorted_sources = sorted(sources)
+    canonical = f"{item_id}\x00{answer}\x00{','.join(sorted_sources)}"
+    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()
+class GroundednessJudge(Judge):
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        snippets_block = "\n".join(
+            f"[{i + 1}] {s}" for i, s in enumerate(item.source_snippets)
+        )
+        schema_clause = self._json_schema_clause('0 or 1 or "Unknown"')
+        prompt = (
+            f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
+            f"---\n\n"
+            f"## Gold source snippets\n{snippets_block}\n\n"
+            f"## Answer to score\n{output.answer}\n\n"
+            f"Score this answer against the rubric above. Respond with ONLY a "
+            f"{schema_clause}"
+        )
+        return await _call_judge_with_retry(
+            provider=self.judge_provider,
+            prompt=prompt,
+            valid_scores={0, 1},
+            judge_id=self.judge_id,
+            rubric_version=self.rubric.source_hash,
+            prompt_seed=prompt_seed,
+            system_output_hash=_system_output_hash(
+                item.id, output.answer, [s.source for s in output.sources]
+            ),
+            item_id=item.id,
+            abstain_allowed=self.effective_abstain_allowed,
+        )

agent_bench/evaluation/judges/relevance.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""RelevanceJudge — three-point, reference-free."""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from agent_bench.evaluation.judges.base import (
+    Judge,
+    ScoreResult,
+    _call_judge_with_retry,
+)
+from agent_bench.evaluation.judges.groundedness import _system_output_hash
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+class RelevanceJudge(Judge):
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult:
+        schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"')
+        prompt = (
+            f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
+            f"---\n\n"
+            f"## Question\n{item.question}\n\n"
+            f"## Answer to score\n{output.answer}\n\n"
+            f"Score this answer against the rubric above. Respond with ONLY a "
+            f"{schema_clause}"
+        )
+        return await _call_judge_with_retry(
+            provider=self.judge_provider,
+            prompt=prompt,
+            valid_scores={0, 1, 2},
+            judge_id=self.judge_id,
+            rubric_version=self.rubric.source_hash,
+            prompt_seed=prompt_seed,
+            system_output_hash=_system_output_hash(
+                item.id, output.answer, [s.source for s in output.sources]
+            ),
+            item_id=item.id,
+            abstain_allowed=self.effective_abstain_allowed,
+        )

agent_bench/evaluation/metrics.py CHANGED Viewed

@@ -1,15 +1,19 @@
-"""Deterministic and LLM-judge evaluation metrics."""
 from __future__ import annotations
-import json
 import re
 import structlog
 from agent_bench.agents.orchestrator import AgentResponse
-from agent_bench.core.provider import LLMProvider
-from agent_bench.core.types import Message, Role
 logger = structlog.get_logger()
@@ -125,84 +129,4 @@ def calculator_used_when_expected(
     return "calculator" in response.tools_used
-# --- LLM-judge metrics (costs money, manual) ---
-_FAITHFULNESS_PROMPT = """\
-You are evaluating whether an AI assistant's answer \
-is fully supported by the provided source passages.
-Source passages:
-{chunks}
-Answer to evaluate:
-{answer}
-Score the answer's faithfulness to the sources from 0.0 to 1.0:
-- 1.0: Every claim is directly supported by the sources
-- 0.5: Some claims are supported, others are extrapolated
-- 0.0: The answer contradicts or is entirely unsupported
-Respond with ONLY a JSON object:
-{{"score": 0.8, "reasoning": "brief explanation"}}"""
-_CORRECTNESS_PROMPT = """\
-You are evaluating whether an AI assistant's answer \
-is factually correct compared to a reference answer.
-Reference answer:
-{reference}
-Answer to evaluate:
-{answer}
-Score correctness from 0.0 to 1.0:
-- 1.0: All key facts match the reference
-- 0.5: Some facts are correct, some are missing or wrong
-- 0.0: The answer is factually incorrect
-Respond with ONLY a JSON object:
-{{"score": 0.8, "reasoning": "brief explanation"}}"""
-async def answer_faithfulness(
-    answer: str,
-    source_chunks: list[str],
-    judge_provider: LLMProvider,
-) -> float | None:
-    """LLM-judged: is the answer supported by the sources? 0.0-1.0."""
-    chunks_text = "\n\n".join(f"[{i + 1}] {c}" for i, c in enumerate(source_chunks))
-    prompt = _FAITHFULNESS_PROMPT.format(chunks=chunks_text, answer=answer)
-    return await _judge_call(prompt, judge_provider)
-async def answer_correctness(
-    answer: str,
-    reference_answer: str,
-    judge_provider: LLMProvider,
-) -> float | None:
-    """LLM-judged: is the answer factually correct vs reference? 0.0-1.0."""
-    prompt = _CORRECTNESS_PROMPT.format(reference=reference_answer, answer=answer)
-    return await _judge_call(prompt, judge_provider)
-async def _judge_call(prompt: str, provider: LLMProvider) -> float | None:
-    """Make a judge call and parse the JSON response."""
-    try:
-        response = await provider.complete(
-            [Message(role=Role.USER, content=prompt)],
-            temperature=0.0,
-            max_tokens=256,
-        )
-        data = json.loads(response.content)
-        score = float(data["score"])
-        reasoning = data.get("reasoning", "")
-        logger.info("llm_judge_result", score=score, reasoning=reasoning)
-        return max(0.0, min(1.0, score))
-    except (json.JSONDecodeError, KeyError, ValueError, TypeError) as e:
-        logger.warning("llm_judge_parse_error", error=str(e), raw=response.content[:200])
-        return None
-    except Exception as e:
-        logger.error("llm_judge_call_error", error=str(e))
-        return None

+"""Deterministic evaluation metrics.
+The continuous-scale LLM-judge functions (answer_faithfulness,
+answer_correctness, _judge_call) were removed in the judge-layer v1
+supersession. The replacement lives at agent_bench/evaluation/judges/
+as discrete-anchored, per-dimension judges with κ-validated calibration.
+See docs/plans/2026-05-04-judge-layer-v1-design.md for the rationale.
+"""
 from __future__ import annotations
 import re
 import structlog
 from agent_bench.agents.orchestrator import AgentResponse
 logger = structlog.get_logger()
     return "calculator" in response.tools_used
+# LLM-judge metrics moved to agent_bench/evaluation/judges/ in judge-layer v1.

agent_bench/evaluation/report.py CHANGED Viewed

@@ -52,9 +52,18 @@ def generate_report(
     total_cost = sum(r.tokens_used.estimated_cost_usd for r in results)
     avg_cost = total_cost / max(len(results), 1)
-    # Optional faithfulness
-    faith_scores = [r.faithfulness for r in positive if r.faithfulness is not None]
-    avg_faith = _safe_avg(faith_scores) if faith_scores else None
     lines.append("| Metric | Value |")
     lines.append("|--------|-------|")
@@ -65,8 +74,8 @@ def generate_report(
     lines.append(f"| Citation Accuracy | {avg_citation:.2f} |")
     lines.append(f"| Grounded Refusal Rate | {refusal_rate}/{len(negative)} |")
     lines.append(f"| Calculator Accuracy | {calc_correct}/{len(calc_qs)} |")
-    if avg_faith is not None:
-        lines.append(f"| Answer Faithfulness (LLM) | {avg_faith:.2f} |")
     lines.append(f"| Latency p50 | {p50:,.0f} ms |")
     lines.append(f"| Latency p95 | {p95:,.0f} ms |")
     lines.append(f"| Cost per query | ${avg_cost:.4f} |")

     total_cost = sum(r.tokens_used.estimated_cost_usd for r in results)
     avg_cost = total_cost / max(len(results), 1)
+    # Optional groundedness (replaces continuous faithfulness in v1).
+    # Discrete-anchored binary 0/1; abstain ('Unknown' score) is excluded
+    # from the average. The float() cast narrows ScoreResult.score from
+    # `int | Literal["Unknown"]` to float for _safe_avg — abstained=False
+    # already guarantees the value is int but mypy doesn't propagate that.
+    grounded_scores: list[float] = [
+        float(r.judge_scores["groundedness"].score)  # type: ignore[arg-type]
+        for r in positive
+        if "groundedness" in r.judge_scores
+        and not r.judge_scores["groundedness"].abstained
+    ]
+    avg_grounded = _safe_avg(grounded_scores) if grounded_scores else None
     lines.append("| Metric | Value |")
     lines.append("|--------|-------|")
     lines.append(f"| Citation Accuracy | {avg_citation:.2f} |")
     lines.append(f"| Grounded Refusal Rate | {refusal_rate}/{len(negative)} |")
     lines.append(f"| Calculator Accuracy | {calc_correct}/{len(calc_qs)} |")
+    if avg_grounded is not None:
+        lines.append(f"| Answer Groundedness (LLM judge) | {avg_grounded:.2f} |")
     lines.append(f"| Latency p50 | {p50:,.0f} ms |")
     lines.append(f"| Latency p95 | {p95:,.0f} ms |")
     lines.append(f"| Cost per query | ${avg_cost:.4f} |")

agent_bench/evaluation/rubrics/citation_faithfulness.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+dimension: citation_faithfulness
+scale: binary
+reference_based: true
+abstain_allowed: true
+---
+# Citation faithfulness (binary, all-or-nothing aggregation per item)
+For each [source: X.md] citation in the answer, is the cited chunk's
+content actually relevant to the claim it supports? This is stricter
+than the deterministic citation_accuracy metric, which only checks
+that the cited chunk_id appears in the retrieved set — citation
+faithfulness checks the **relevance** of the chunk to the claim.
+**Aggregation rule (item-level):** any unfaithful citation in the
+answer → item score = 0. A single bad citation in a multi-citation
+answer is a real failure that all-or-nothing surfaces; treating it as
+partial would obscure the failure mode.
+## Score 0
+The cited chunk's content does not support the adjacent claim.
+### Example A — citation drift
+Claim: "The default port is 8080."
+Cited chunk content: "The dashboard supports OAuth and SAML authentication."
+Score=0 because the chunk talks about authentication, not the port.
+The citation is misleading even though the claim happens to be true.
+### Example B — wrong topic citation
+Claim: "StatefulSet pods get ordinal indices."
+Cited chunk content: "Deployments support rolling updates with maxSurge and maxUnavailable parameters."
+Score=0 — the cited chunk is about Deployments, not StatefulSets.
+The citation does not support the claim about StatefulSet ordinals.
+## Score 1
+The cited chunk's content directly supports the adjacent claim.
+### Example C — single accurate citation
+Claim: "The default port is 8080."
+Cited chunk content: "The dashboard listens on port 8080 by default."
+Score=1.
+### Example D — paraphrase-supported citation
+Claim: "Each pod has a stable hostname."
+Cited chunk content: "StatefulSet pods receive hostnames derived from the StatefulSet name plus their ordinal, and these hostnames persist across reschedules."
+Score=1 — the chunk supports the claim via paraphrase.

agent_bench/evaluation/rubrics/completeness.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+dimension: completeness
+scale: three_point
+reference_based: true
+abstain_allowed: true
+---
+# Completeness (three-point)
+Score how much of the gold reference answer is covered by the agent's
+answer. This is reference-based — the judge sees the gold reference
+and the agent's answer; score on **coverage of facts** in the
+reference, not on additional facts the agent may have included.
+The judge does not penalize the agent for adding correct extra detail
+(that's a separate concern). Score only on what fraction of the
+reference's points are present.
+## Score 0
+None of the reference's key points are present in the answer.
+### Example A — answer addresses different facts
+Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
+Answer: "Kubernetes uses YAML manifests to declare resources."
+Score=0 — none of the three reference points (ordinal, hostname, storage) appear.
+### Example B — refusal that covers nothing
+Reference: "The default port is 8080."
+Answer: "I cannot find that information."
+Score=0 — the reference's single point (port=8080) is not in the answer.
+## Score 1
+Some but not all of the reference's points are present.
+### Example C — partial coverage
+Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
+Answer: "StatefulSet pods get ordinal indices."
+Score=1 — one of three points covered.
+### Example D — half a comparison
+Reference: "Deployments manage stateless replicas; StatefulSets manage stateful pods with stable identities."
+Answer: "Deployments manage stateless replicas with rolling updates."
+Score=1 — Deployment side covered, StatefulSet side missing.
+## Score 2
+All of the reference's key points are present (paraphrase allowed).
+### Example E — full coverage with paraphrase
+Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
+Answer: "Each pod gets an ordinal number, a stable DNS name, and storage that survives restarts."
+Score=2 — all three points covered with paraphrase.
+### Example F — full coverage of single-fact reference
+Reference: "The default port is 8080."
+Answer: "Port 8080."
+Score=2 — the only reference point is covered.

agent_bench/evaluation/rubrics/groundedness.md ADDED Viewed

	@@ -0,0 +1,142 @@

+---
+dimension: groundedness
+scale: binary
+reference_based: true
+abstain_allowed: true
+---
+# Groundedness (binary)
+Score whether **every claim** in the agent's answer is entailed by the gold
+source snippets attached to this item. Paraphrase is allowed; what matters
+is content equivalence, not surface form.
+## Reference scope (strict, clarified in v1.1)
+Reference scope is the **gold snippets only**, not the broader corpus, not
+the retrieved chunks, not the LLM's general domain knowledge. A claim that
+is factually correct in the world but not entailed by the snippets **must
+score 0**. The "must" forecloses the "well, mostly grounded" reading: a
+single ungrounded claim in an otherwise solid answer fails the binary
+rubric.
+The strict-entailment posture is a methodological choice. A claim that is
+correct because the model happened to know it isn't grounded — it's lucky.
+Strict-snippet groundedness measures *retrieval-grounded behavior*, not
+LLM general knowledge passing through a RAG harness.
+## Trivial inference is entailment
+Some surface-form variations of a snippet's content are entailment, not
+new claims. The test is **content equivalence**, not surface form:
+- **Paraphrase.** "X causes Y" ↔ "Y is caused by X".
+- **Unit conversion.** "600 seconds" ↔ "10 minutes".
+- **Syntactic variation.** Pluralization, tense, voice, declarative ↔ imperative.
+- **Canonical name of the snippet's concept.** When the snippet describes
+  a field, header, or API element by configuration syntax (e.g., a
+  `max_age` table row), the canonical name (`Access-Control-Max-Age` HTTP
+  header) is the same content in different surface form. This is a
+  separate carve-out from pure paraphrase: it admits domain knowledge
+  tightly bound to the snippet's referent.
+> **v1.2 debt.** The trivial-inference clause — especially the
+> canonical-name carve-out — is the strictest-rubric concession most
+> likely to require revision in v1.2. If labelers find themselves
+> applying it broadly to rescue answers from score-0, the clause is
+> too permissive and should be tightened.
+**When to abstain (`"Unknown"`)**: if the answer is a refusal ("I don't
+know" / "not in the documentation") and there is nothing to ground, score
+abstain rather than 1.
+## Score 0
+At least one claim in the answer is not entailed by any snippet, after
+applying the trivial-inference clause.
+### Example A — calibration anchor `k8s_006` (dramatic over-extension)
+Question: "What is a ConfigMap in Kubernetes and what kind of data should you store in it?"
+Snippet: "A ConfigMap is an API object used to store non-confidential data in key-value pairs."
+Answer (excerpted): The agent gives a comprehensive multi-section answer
+covering (i) the definition, (ii) three consumption methods (env vars,
+command-line args, volumes), (iii) a warning not to store
+passwords/tokens/certificates, (iv) a recommendation to use Secrets
+instead, and (v) details about `data` and `binaryData` fields.
+Thinking trace: Score = 0. Only the definition (i) is entailed by the
+snippet. Claims (ii)–(v) are factually correct against the underlying
+`k8s_configmap.md` doc, but **none are entailed by the one-sentence
+snippet**. The snippet does not describe consumption methods, security
+guidance, or schema fields. The strict-conjunction rule applies: even
+though most of the answer is well-supported by the broader corpus, the
+gold-snippet scope is what the rubric measures, and the answer goes
+dramatically beyond it.
+### Example B — calibration anchor `q006` (subtle embellishment)
+Question: "How does dependency caching work in FastAPI, and how can you disable it?"
+Snippet: "By default, if the same dependency is used multiple times within
+a single request (e.g., both a route and a sub-dependency use
+`Depends(get_db)`), FastAPI caches the result and calls the dependency
+only once. To disable caching and force a fresh call each time, use
+`Depends(get_db, use_cache=False)`."
+Answer (excerpted): Most of the answer is entailed by the snippet — the
+default-caching behavior, the once-per-request semantics, the
+`use_cache=False` disable mechanism, and an example. But the answer also
+states that caching is "particularly useful for expensive operations like
+database connections."
+Thinking trace: Score = 0. The "particularly useful for expensive
+operations like database connections" claim is **not entailed** by the
+snippet. The snippet describes *what* caching does (a mechanism); the
+claim adds *why* it is useful (a use-case argument the snippet does not
+make). The example uses `Depends(get_db)` as an identifier, not as a
+normative argument that caching is useful for database connections. This
+is the subtle case: most of the answer is good, but a single embellishment
+that imports general LLM knowledge ("expensive operations like database
+connections") fails strict snippet entailment. This is the boundary case
+the rubric exists to catch — easier to mis-label than `k8s_006` because
+the failure is one claim in the middle of a mostly-grounded answer.
+## Score 1
+Every claim in the answer is entailed by at least one snippet, including
+trivial inferences (paraphrase, unit conversion, syntactic variation,
+canonical-name substitution).
+### Example C — calibration anchor `q021` (trivial inference + canonical name)
+Question: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?"
+Snippet (table row): `| max_age | 600 | Seconds the browser caches preflight results |`
+Answer: "If the CORS `max_age` is 600 seconds, the browser caches
+preflight results for 10 minutes. The `max_age` directive in the
+`Access-Control-Max-Age` header specifies the maximum time (in seconds)
+that the results of a preflight request can be cached. So 600 seconds
+÷ 60 seconds/minute = 10 minutes."
+Thinking trace: Score = 1. Every claim is entailed under the
+trivial-inference clause:
+- "max_age=600 → 10 minutes" — unit conversion.
+- "Access-Control-Max-Age header" — canonical-name carve-out: the snippet
+  describes `max_age` by its configuration syntax; the answer uses the
+  canonical HTTP header name for the same field.
+- "specifies the maximum time (in seconds)" — paraphrase of "Seconds the
+  browser caches preflight results".
+- "600 ÷ 60 = 10 minutes" — arithmetic, the same trivial-inference class
+  as unit conversion.
+The canonical-name carve-out is doing the heaviest lifting in this
+example. Without it, "Access-Control-Max-Age" would be ungrounded
+(domain knowledge not in the snippet text). With it, the answer is a
+clean strict-snippet pass. This is exactly the v1.2-debt sentence above
+— if many future labels rescue score-1 via canonical-name appeals, the
+clause is over-rescuing and should be tightened.

agent_bench/evaluation/rubrics/relevance.md ADDED Viewed

	@@ -0,0 +1,74 @@

+---
+dimension: relevance
+scale: three_point
+reference_based: false
+abstain_allowed: true
+---
+# Relevance (three-point)
+Does the agent's answer address the user's question? This is reference-free
+— the judge sees only the question and the answer, not gold snippets or a
+reference answer. Score the topic-match, not the truth-value.
+## Score 0
+Off-topic. The answer addresses a different question, is unintelligible,
+or is a refusal that does not engage with the question's premise.
+### Example A — wrong topic
+Question: "How do I deploy to Kubernetes?"
+Answer: "Python virtual environments isolate dependencies between projects."
+Score=0 — the answer is about Python venvs, not Kubernetes deployment.
+### Example B — refusal that ignores the question
+Question: "What's the default replica count for a StatefulSet?"
+Answer: "I cannot help with that request."
+Score=0 — the refusal does not engage with the StatefulSet topic. A
+proper grounded refusal ("the documentation does not specify a default
+replica count for StatefulSets") would score higher.
+## Score 1
+Partially relevant. The answer touches the question's topic but misses
+the core ask, or addresses a related-but-different question.
+### Example C — adjacent but off-target
+Question: "How do I deploy a StatefulSet?"
+Answer: "Kubernetes runs containerized workloads on a cluster of nodes."
+Score=1 because it's about Kubernetes but doesn't address StatefulSet
+deployment specifically.
+### Example D — answers a sibling question
+Question: "What's the difference between Deployment and StatefulSet?"
+Answer: "A Deployment manages stateless replicas with rolling updates."
+Score=1 because it describes Deployment but doesn't compare it to
+StatefulSet — only half the question is addressed.
+## Score 2
+Directly addresses the question's core ask.
+### Example E — on-target single-fact answer
+Question: "What's the default port for kubelet?"
+Answer: "Port 10250."
+Score=2 because it directly answers the question.
+### Example F — on-target comparison
+Question: "What's the difference between Deployment and StatefulSet?"
+Answer: "Deployments manage stateless, interchangeable pods with rolling
+updates; StatefulSets manage stateful pods with stable identities,
+ordered rollouts, and persistent per-pod storage."
+Score=2 — both sides of the comparison are addressed.

agent_bench/evaluation/variance/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Variance-control wrappers around Judge instances."""
+from agent_bench.evaluation.variance.jury import Jury, jury
+from agent_bench.evaluation.variance.rubric_permute import (
+    PermutedJudge,
+    rubric_permute,
+)
+__all__ = ["Jury", "PermutedJudge", "jury", "rubric_permute"]

agent_bench/evaluation/variance/jury.py ADDED Viewed

	@@ -0,0 +1,181 @@

+"""Jury — multi-judge aggregator with strict-quorum default and sidecar."""
+from __future__ import annotations
+import asyncio
+from pathlib import Path
+from typing import TYPE_CHECKING, Literal
+import structlog
+from agent_bench.evaluation.judges.base import Judge, ScoreResult
+from agent_bench.evaluation.variance.rubric_permute import _aggregate_scores
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+_DEFAULT_SIDECAR_TEMPLATE = "results/calibration_v1_judge_{aggregation}_members.jsonl"
+logger = structlog.get_logger()
+def _discretize_mean(mean: float, scale: str) -> int:
+    """Discretize a float mean to a discrete level per scale, ties → lower
+    (mirrors `_aggregate_scores`'s policy without going through int(round())
+    which would invoke Python's banker's rounding and silently violate the
+    tie-breaking contract).
+    """
+    if scale == "binary":
+        return 1 if mean > 0.5 else 0
+    floor = int(mean)
+    frac = mean - floor
+    return floor + 1 if frac > 0.5 else floor
+class Jury:
+    """Aggregates a list of Judge instances into one ScoreResult per item.
+    Strict quorum default (quorum = len(judges)): any member abstain →
+    aggregate abstain. The parameter exists in v1 so v1.1's 3-judge jury
+    can shift to quorum=2 (majority) without rearchitecting failure
+    semantics.
+    Per-member ScoreResults always written to sidecar (successes and
+    failure-as-abstains alike). Provider non-retryable exceptions in
+    any member raise immediately, cancelling sibling gather tasks.
+    """
+    def __init__(
+        self,
+        judges: list[Judge],
+        aggregation: Literal["mean", "kappa_weighted"],
+        weights: dict[str, float] | None = None,
+        quorum: int | None = None,
+        sidecar_path: Path | str | None = None,
+    ) -> None:
+        if not judges:
+            raise ValueError("jury requires at least one judge")
+        if aggregation == "kappa_weighted" and not weights:
+            raise ValueError(
+                "kappa_weighted aggregation requires explicit weights "
+                "(computed offline on calibration set; not at jury construction)"
+            )
+        self.judges = judges
+        self.aggregation = aggregation
+        self.weights = weights or {}
+        self.quorum = quorum if quorum is not None else len(judges)
+        self.sidecar_path = (
+            Path(sidecar_path)
+            if sidecar_path is not None
+            else Path(_DEFAULT_SIDECAR_TEMPLATE.format(aggregation=aggregation))
+        )
+        self.judge_id = f"jury_v1_{aggregation}"
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+    ) -> ScoreResult:
+        # return_exceptions=False → first exception cancels siblings
+        member_results: list[ScoreResult] = await asyncio.gather(
+            *[j.score(item, output) for j in self.judges],
+            return_exceptions=False,
+        )
+        # Sidecar (append; one line per member per call)
+        self.sidecar_path.parent.mkdir(parents=True, exist_ok=True)
+        with self.sidecar_path.open("a", encoding="utf-8") as f:
+            for r in member_results:
+                f.write(r.model_dump_json() + "\n")
+        successful = [r for r in member_results if not r.abstained]
+        sys_hash = member_results[0].system_output_hash
+        if len(successful) < self.quorum:
+            return ScoreResult(
+                reasoning=(
+                    f"jury_below_quorum: {len(successful)}/{len(self.judges)} "
+                    f"members succeeded; required {self.quorum}"
+                ),
+                evidence_quotes=[],
+                score="Unknown",
+                judge_id=self.judge_id,
+                rubric_version=member_results[0].rubric_version,
+                prompt_seed=0,
+                system_output_hash=sys_hash,
+                cost_usd=sum(r.cost_usd for r in member_results),
+                latency_ms=max(r.latency_ms for r in member_results),
+            )
+        # Aggregate over successful members
+        scores = [int(r.score) for r in successful]
+        scale = self.judges[0].rubric.scale
+        applied_weights: list[float] = []
+        if self.aggregation == "mean":
+            agg = _aggregate_scores(scores, scale)
+        else:  # kappa_weighted
+            # Weight successful members by judge_id. v1.1: missing weight is
+            # a hard error (was a silent fallback to 1.0 in v1, which let an
+            # asymmetric weights source amplify the unweighted member rather
+            # than suppressing it — see the v1.1 jury-rescue entry in
+            # DECISIONS.md for the calibration evidence).
+            missing = [r.judge_id for r in successful if r.judge_id not in self.weights]
+            if missing:
+                raise ValueError(
+                    f"jury kappa_weighted: weights dict missing entries for "
+                    f"member judge_ids {sorted(set(missing))}. Configured "
+                    f"weights cover {sorted(self.weights.keys())}. "
+                    f"v1.1 requires symmetric coverage — every jury member "
+                    f"must have an explicit weight in the source. The v1 "
+                    f"silent fallback to 1.0 was a documented contract "
+                    f"violation that masked the source's asymmetric coverage."
+                )
+            for r in successful:
+                applied_weights.append(self.weights[r.judge_id])
+            weighted_sum = sum(s * w for s, w in zip(scores, applied_weights))
+            weight_total = sum(applied_weights)
+            weighted_mean = (
+                weighted_sum / weight_total if weight_total > 0 else 0.0
+            )
+            # Discretize via the shared ties-to-lower policy (NOT int(round())
+            # which uses banker's rounding and would diverge from the `mean`
+            # path on half-integer aggregates).
+            agg = _discretize_mean(weighted_mean, scale)
+        # Reasoning string reports the per-member weights actually applied
+        # (not the constructor's dict — the dict may be missing entries that
+        # silently fell back to 1.0; printing the constructor's dict would
+        # conceal that fallback from anyone debugging a calibration row).
+        weights_str = applied_weights if self.aggregation == "kappa_weighted" else "n/a"
+        return ScoreResult(
+            reasoning=(
+                f"jury_{self.aggregation}: "
+                f"members={[r.score for r in successful]}, "
+                f"weights={weights_str}"
+            ),
+            evidence_quotes=[],
+            score=agg,
+            judge_id=self.judge_id,
+            rubric_version=member_results[0].rubric_version,
+            prompt_seed=0,
+            system_output_hash=sys_hash,
+            cost_usd=sum(r.cost_usd for r in member_results),
+            latency_ms=max(r.latency_ms for r in member_results),
+        )
+def jury(
+    judges: list[Judge],
+    aggregation: Literal["mean", "kappa_weighted"],
+    weights: dict[str, float] | None = None,
+    quorum: int | None = None,
+    sidecar_path: Path | str | None = None,
+) -> Jury:
+    return Jury(
+        judges=judges,
+        aggregation=aggregation,
+        weights=weights,
+        quorum=quorum,
+        sidecar_path=sidecar_path,
+    )

agent_bench/evaluation/variance/rubric_permute.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""rubric_permute — runs the same judge with permuted rubric levels and aggregates."""
+from __future__ import annotations
+from pathlib import Path
+from typing import TYPE_CHECKING, Literal
+from agent_bench.evaluation.judges.base import Judge, ScoreResult
+if TYPE_CHECKING:
+    from agent_bench.agents.orchestrator import AgentResponse
+    from agent_bench.evaluation.harness import GoldenQuestion
+def _aggregate_scores(scores: list[int], scale: str) -> int:
+    """Discretize aggregated score per scale.
+    Binary: threshold 0.5 with ties → 0 (conservative).
+    Three-point: round to nearest with ties → lower level (conservative).
+    """
+    mean = sum(scores) / len(scores)
+    if scale == "binary":
+        return 1 if mean > 0.5 else 0
+    floor = int(mean)
+    frac = mean - floor
+    if frac > 0.5:
+        return floor + 1
+    return floor
+class PermutedJudge:
+    """Wraps a Judge; runs N permutations with different prompt_seeds.
+    Aggregation:
+    - Any abstain in any permutation → aggregate score = "Unknown".
+    - Otherwise, discretize the per-permutation scores per scale.
+    Per-permutation ScoreResults are written to the sidecar JSONL on
+    every score() call (one batch per call, append-mode JSONL across calls).
+    """
+    def __init__(
+        self,
+        judge: Judge,
+        n: int = 2,
+        seeds: list[int] | None = None,
+        sidecar_path: Path | str | None = None,
+    ) -> None:
+        self.judge = judge
+        self.n = n
+        self.seeds = seeds if seeds is not None else list(range(1, n + 1))
+        if len(self.seeds) != n:
+            raise ValueError(f"seeds length {len(self.seeds)} != n {n}")
+        self.sidecar_path = Path(sidecar_path) if sidecar_path else None
+        self.judge_id = f"{judge.judge_id}_perm{n}"
+    async def score(
+        self,
+        item: "GoldenQuestion",
+        output: "AgentResponse",
+    ) -> ScoreResult:
+        per_perm_results: list[ScoreResult] = []
+        for seed in self.seeds:
+            r = await self.judge.score(item, output, prompt_seed=seed)
+            per_perm_results.append(r)
+        if self.sidecar_path is not None:
+            self.sidecar_path.parent.mkdir(parents=True, exist_ok=True)
+            with self.sidecar_path.open("a", encoding="utf-8") as f:
+                for r in per_perm_results:
+                    f.write(r.model_dump_json() + "\n")
+        any_abstain = any(r.abstained for r in per_perm_results)
+        if any_abstain:
+            score: int | Literal["Unknown"] = "Unknown"
+            reasoning = (
+                f"any_abstain_propagated: "
+                f"{sum(1 for r in per_perm_results if r.abstained)}/{self.n} "
+                f"permutations abstained"
+            )
+        else:
+            score = _aggregate_scores(
+                [int(r.score) for r in per_perm_results],
+                self.judge.rubric.scale,
+            )
+            reasoning = (
+                f"perm_mean over {self.n} seeds: {[r.score for r in per_perm_results]}"
+            )
+        return ScoreResult(
+            reasoning=reasoning,
+            evidence_quotes=[],
+            score=score,
+            judge_id=self.judge_id,
+            rubric_version=self.judge.rubric.source_hash,
+            prompt_seed=0,
+            system_output_hash=per_perm_results[0].system_output_hash,
+            cost_usd=sum(r.cost_usd for r in per_perm_results),
+            latency_ms=sum(r.latency_ms for r in per_perm_results),
+        )
+def rubric_permute(
+    judge: Judge,
+    n: int = 2,
+    seeds: list[int] | None = None,
+    sidecar_path: Path | str | None = None,
+) -> PermutedJudge:
+    return PermutedJudge(judge=judge, n=n, seeds=seeds, sidecar_path=sidecar_path)

agent_bench/serving/static/index.html CHANGED Viewed

@@ -721,6 +721,141 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
   border: 1px solid var(--rule-2); background: var(--paper); color: var(--ink);
 }
 /* ── Responsive ────────────────────────────── */
 @media (max-width: 880px){
   .grid{grid-template-columns: 1fr}
@@ -742,6 +877,7 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
   <div class="wordmark">agent-bench</div>
   <nav>
     <a href="#demo">Demo</a>
     <a href="#findings">Findings</a>
     <a href="#log">Log</a>
     <a href="https://github.com/tyy0811/agent-bench" target="_blank" rel="noopener">GitHub ↗</a>
@@ -933,6 +1069,62 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
   </div>
 </section>
 <!-- Findings -->
 <section class="section" id="findings">
   <div class="section-head">
@@ -1028,6 +1220,49 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
   </div>
 </section>
 <!-- Footer -->
 <footer class="footer">
   <div class="who">agent-bench · MIT · built by Jane Yeung · Munich</div>

   border: 1px solid var(--rule-2); background: var(--paper); color: var(--ink);
 }
+/* ── Evaluation harness section ───────────── */
+.harness-intro{
+  display: grid; grid-template-columns: 1.2fr 1fr; gap: 32px;
+  align-items: end; margin-bottom: 28px;
+}
+.harness-intro p{color: var(--ink-2); max-width: 56ch}
+.harness-intro .sig{
+  font-family: var(--font-mono); font-size: 0.74rem; color: var(--ink-3);
+  display: flex; flex-direction: column; gap: 4px;
+}
+.harness-intro .sig b{color: var(--ink); font-weight: 600}
+.rubric-grid{
+  display: grid; grid-template-columns: repeat(4, 1fr);
+  gap: 1px; background: var(--rule);
+  border: 1px solid var(--rule);
+}
+.rubric{
+  background: var(--paper); padding: 22px 20px;
+  display: flex; flex-direction: column; gap: 10px;
+}
+.rubric .dim{
+  font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600;
+  letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink);
+}
+.rubric .scale{
+  font-family: var(--font-mono); font-size: 0.7rem; color: var(--ink-3);
+  display: flex; gap: 8px; flex-wrap: wrap;
+}
+.rubric .scale span{border: 1px solid var(--rule); padding: 1px 6px}
+.rubric .scale span.on{border-color: var(--ink); color: var(--ink)}
+.rubric .desc{
+  font-size: 0.86rem; color: var(--ink-2); line-height: 1.5;
+}
+.rubric .anchor{
+  font-family: var(--font-mono); font-size: 0.72rem;
+  border-left: 2px solid var(--rule-2); padding: 8px 10px;
+  background: var(--paper-2); color: var(--ink-2); line-height: 1.5;
+  margin-top: auto;
+}
+.rubric .anchor b{color: var(--ink); font-weight: 600; font-size: 0.7rem; letter-spacing: 0.06em}
+/* Compact one-row κ summary that lives above Findings (deep table is in appendix) */
+.kappa-summary{
+  margin-top: 22px; border: 1px solid var(--rule);
+  padding: 14px 18px;
+  display: flex; flex-direction: column; gap: 10px;
+  background: var(--paper-2);
+}
+.kappa-summary .ks-head{
+  font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600;
+  letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink-3);
+}
+.kappa-summary .ks-head .ks-sub{
+  letter-spacing: 0.04em; text-transform: none; color: var(--ink-3);
+  font-weight: 400; margin-left: 4px;
+}
+.kappa-summary .ks-row{
+  display: flex; flex-wrap: wrap; align-items: baseline; gap: 22px;
+  font-family: var(--font-mono); font-size: 0.85rem;
+  font-feature-settings: "tnum","zero";
+}
+.kappa-summary .ks-stat{display: flex; align-items: baseline; gap: 8px}
+.kappa-summary .ks-stat .k{color: var(--ink-3); font-size: 0.78rem}
+.kappa-summary .ks-stat .v{color: var(--ink); font-weight: 600}
+.kappa-summary .ks-stat .v.win{color: var(--ok)}
+.kappa-summary .ks-link{
+  margin-left: auto; font-size: 0.78rem; color: var(--ink-2);
+  border-bottom: 1px solid var(--rule-2);
+}
+.kappa-summary .ks-link:hover{color: var(--ink); border-color: var(--ink)}
+.kappa-wrap{
+  margin-top: 28px; border: 1px solid var(--rule);
+  display: grid; grid-template-columns: 1.4fr 1fr;
+}
+.kappa-table{
+  border-right: 1px solid var(--rule);
+  padding: 22px 24px;
+}
+.kappa-table h4{
+  font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600;
+  letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3);
+  margin-bottom: 14px;
+}
+.kappa-table table{width: 100%; border-collapse: collapse; font-family: var(--font-mono); font-size: 0.78rem}
+.kappa-table th, .kappa-table td{
+  text-align: left; padding: 7px 10px; border-bottom: 1px solid var(--rule);
+  font-feature-settings: "tnum","zero";
+}
+.kappa-table th{
+  font-weight: 600; color: var(--ink-3); font-size: 0.68rem;
+  letter-spacing: 0.08em; text-transform: uppercase;
+}
+.kappa-table td.num{text-align: right; color: var(--ink)}
+.kappa-table td.num.win{color: var(--ok); font-weight: 600}
+.kappa-table tr.config-row td{background: var(--paper)}
+.kappa-table tr:last-child td{border-bottom: none}
+.kappa-note{
+  font-family: var(--font-ui); font-size: 0.78rem; color: var(--ink-3);
+  margin-top: 10px; line-height: 1.5; max-width: 60ch;
+}
+.variance{
+  padding: 22px 24px;
+  display: flex; flex-direction: column; gap: 14px;
+  background: var(--paper-2);
+}
+.variance h4{
+  font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600;
+  letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3);
+}
+.variance .v-row{
+  display: flex; flex-direction: column; gap: 4px;
+  padding: 12px 14px; background: var(--paper); border: 1px solid var(--rule);
+}
+.variance .v-row .name{
+  font-family: var(--font-mono); font-size: 0.82rem; font-weight: 600; color: var(--ink);
+}
+.variance .v-row .name code{
+  font-family: var(--font-mono); font-size: 0.78rem; color: var(--accent-ink);
+  background: var(--accent-soft); padding: 1px 5px;
+}
+.variance .v-row .why{
+  font-size: 0.82rem; color: var(--ink-2); line-height: 1.5;
+}
+/* Harness responsive overrides — collapse rubric grid + κ split at narrower viewport */
+@media (max-width: 1000px){
+  .rubric-grid{grid-template-columns: repeat(2, 1fr)}
+  .kappa-wrap{grid-template-columns: 1fr}
+  .kappa-table{border-right: none; border-bottom: 1px solid var(--rule)}
+  .harness-intro{grid-template-columns: 1fr; gap: 16px}
+}
 /* ── Responsive ────────────────────────────── */
 @media (max-width: 880px){
   .grid{grid-template-columns: 1fr}
   <div class="wordmark">agent-bench</div>
   <nav>
     <a href="#demo">Demo</a>
+    <a href="#harness">Harness</a>
     <a href="#findings">Findings</a>
     <a href="#log">Log</a>
     <a href="https://github.com/tyy0811/agent-bench" target="_blank" rel="noopener">GitHub ↗</a>
   </div>
 </section>
+<!-- Evaluation harness (LLM-as-judge methodology) -->
+<section class="section" id="harness">
+  <div class="section-head">
+    <h2>How we grade it</h2>
+    <span class="sub">4 anchored rubrics · LLM-as-judge · κ-calibrated against human labels</span>
+  </div>
+  <div class="harness-intro">
+    <p class="deck">Benchmark numbers are only as good as the grader. Each answer is scored by an LLM judge against an anchored markdown rubric — strict scope, fixed scale, abstain-allowed — and the judges themselves are calibrated against human labels on a held-out set before they're trusted on the main run.</p>
+    <div class="sig">
+      <span><b>30</b> calibration items · human-labeled</span>
+      <span><b>v1.1</b> rubric · sha-pinned per result</span>
+      <span>headline metric: <b>Cohen's κ</b> · <b>Gwet's AC1</b> on prevalence-skewed dims</span>
+    </div>
+  </div>
+  <!-- Rubric cards -->
+  <div class="rubric-grid">
+    <div class="rubric">
+      <div class="dim">Groundedness</div>
+      <div class="scale"><span class="on">0</span><span class="on">1</span><span>abstain</span></div>
+      <div class="desc">Every claim must be entailed by gold snippets. A claim that's correct in the world but not in the snippets scores 0 — strict-snippet measures retrieval-grounded behavior, not LLM general knowledge passing through.</div>
+      <div class="anchor"><b>ANCHOR · q006</b><br>Answer adds "particularly useful for expensive operations like database connections" — not in snippet → 0.</div>
+    </div>
+    <div class="rubric">
+      <div class="dim">Relevance</div>
+      <div class="scale"><span class="on">0</span><span class="on">1</span><span class="on">2</span><span>abstain</span></div>
+      <div class="desc">Reference-free. Does the answer address the user's question? Score the topic-match, not the truth-value. A refusal that doesn't engage with the premise scores 0.</div>
+      <div class="anchor"><b>ANCHOR</b><br>Q: "How do I deploy to Kubernetes?"<br>A: "Python virtual environments isolate dependencies." → 0.</div>
+    </div>
+    <div class="rubric">
+      <div class="dim">Completeness</div>
+      <div class="scale"><span class="on">0</span><span class="on">1</span><span class="on">2</span><span>abstain</span></div>
+      <div class="desc">Reference-based against gold answer. Score coverage of the reference's key points only — extra correct detail isn't penalized here.</div>
+      <div class="anchor"><b>ANCHOR</b><br>Reference covers ordinal, hostname, storage. Answer covers ordinal, hostname only → 1.</div>
+    </div>
+    <div class="rubric">
+      <div class="dim">Citation faithfulness</div>
+      <div class="scale"><span class="on">0</span><span class="on">1</span><span>abstain</span></div>
+      <div class="desc">For every <code>[source: X.md]</code> in the answer, does the cited chunk actually support the claim next to it? <b>All-or-nothing</b> per item — one bad citation fails the whole answer.</div>
+      <div class="anchor"><b>ANCHOR</b><br>Claim: "default port is 8080." Cited chunk: about OAuth and SAML auth → 0 (citation drift).</div>
+    </div>
+  </div>
+  <!-- Compact κ summary → deep methodology lives in the appendix below the log -->
+  <div class="kappa-summary">
+    <div class="ks-head">Inter-rater agreement vs. human labels <span class="ks-sub">(calibration v1, baseline)</span></div>
+    <div class="ks-row">
+      <div class="ks-stat"><span class="k">groundedness</span><span class="v win">AC1 = 1.000</span></div>
+      <div class="ks-stat"><span class="k">relevance</span><span class="v win">AC1 = 0.964</span></div>
+      <div class="ks-stat"><span class="k">completeness</span><span class="v">κ = 0.416</span></div>
+      <a class="ks-link" href="#harness-appendix">Full table + variance hardening ↓</a>
+    </div>
+  </div>
+</section>
 <!-- Findings -->
 <section class="section" id="findings">
   <div class="section-head">
   </div>
 </section>
+<!-- Methodology appendix — deep dive that was demoted from the main flow -->
+<section class="section" id="harness-appendix">
+  <div class="section-head">
+    <h2>Methodology appendix</h2>
+    <span class="sub">κ ablations · variance hardening · abstain semantics</span>
+  </div>
+  <div class="kappa-wrap">
+    <div class="kappa-table">
+      <h4>κ ablation table · calibration v1</h4>
+      <table>
+        <thead>
+          <tr><th>Configuration</th><th>Groundedness<br><span style="font-weight:400">AC1</span></th><th>Relevance<br><span style="font-weight:400">AC1</span></th><th>Completeness<br><span style="font-weight:400">κ</span></th></tr>
+        </thead>
+        <tbody>
+          <tr><td>baseline (v1.1, anchors, CoT)</td><td class="num win">1.000</td><td class="num win">0.964</td><td class="num">0.416</td></tr>
+          <tr><td>baseline · no anchors</td><td class="num">0.953</td><td class="num">0.964</td><td class="num">0.623</td></tr>
+          <tr><td>baseline · no CoT</td><td class="num">0.897</td><td class="num">0.963</td><td class="num win">1.000</td></tr>
+          <tr><td>permute (n=2 seeds)</td><td class="num win">1.000</td><td class="num">0.966</td><td class="num">0.506</td></tr>
+          <tr><td>jury · κ-weighted (haiku + gpt-4o-mini)</td><td class="num win">1.000</td><td class="num win">1.000</td><td class="num">0.416</td></tr>
+        </tbody>
+      </table>
+      <p class="kappa-note"><b>Reading this:</b> groundedness and relevance gold are prevalence-skewed (29×<code>0</code> / 1×<code>1</code> and 29×<code>2</code> / 1×<code>1</code> respectively), which makes Cohen's κ degenerate to ≈0 even at 95%+ raw agreement. AC1 is the right metric there. Completeness gold is balanced enough (23×<code>2</code> / 5×<code>1</code>) for κ to behave normally. The <b>no-CoT κ=1.000</b> looks like a win but comes with an 11.5% abstain rate — the headline is the baseline row.</p>
+    </div>
+    <div class="variance">
+      <h4>Variance hardening</h4>
+      <div class="v-row">
+        <div class="name"><code>PermutedJudge</code> · level-order permutation</div>
+        <div class="why">Wrap a judge with n=2 prompt-seed permutations of the rubric's level order; aggregate by mean. Catches judges whose verdict flips when "Score 0" anchor moves above "Score 2" — a presentation-order artifact, not a content disagreement.</div>
+      </div>
+      <div class="v-row">
+        <div class="name"><code>Jury</code> · κ-weighted multi-judge aggregation</div>
+        <div class="why">Run the same item through claude-haiku-4-5 and gpt-4o-mini, weight each judge's vote by its calibration κ, abstain if any member abstains. Surfaces single-model bias without flattening to majority-rule, and keeps abstain as a first-class outcome.</div>
+      </div>
+      <div class="v-row">
+        <div class="name">Abstain semantics · <code>"Unknown"</code> sentinel</div>
+        <div class="why">Schema-parse failures retry once, then abstain with a typed prefix; rubric-allowed model abstains use the empty-string sentinel. The metric drops the item, doesn't pretend it scored 0 — visible in the abstain rate column above.</div>
+      </div>
+    </div>
+  </div>
+</section>
 <!-- Footer -->
 <footer class="footer">
   <div class="who">agent-bench · MIT · built by Jane Yeung · Munich</div>

configs/calibration/rows/baseline.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+# Baseline: single Claude-Haiku judge per dimension, all variance controls on.
+# CoT is implicit (the rubric prompts ask for reasoning before score).
+# Anchors come from the rubric files. Abstain comes from rubric.abstain_allowed=true.
+label: baseline
+provider: anthropic
+model_id: claude-haiku-4-5-20251001
+dimensions: [groundedness, relevance, completeness]
+strategy: single
+options:
+  use_cot: true
+  use_anchors: true
+  abstain_allowed: true
+output_path: results/calibration_v1_judge_baseline.json

configs/calibration/rows/baseline_no_abstain.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+# Ablation: rubric.abstain_allowed forced false at scoring time. Measures
+# the contribution of the abstain option. Out-of-range schema violations
+# (model returns "Unknown" anyway) abstain via ABSTAIN_REASON_OUT_OF_RANGE.
+label: baseline_no_abstain
+provider: anthropic
+model_id: claude-haiku-4-5-20251001
+dimensions: [groundedness, relevance, completeness]
+strategy: single
+options:
+  use_cot: true
+  use_anchors: true
+  abstain_allowed: false
+output_path: results/calibration_v1_judge_baseline_no_abstain.json

configs/calibration/rows/baseline_no_anchors.yaml ADDED Viewed

	@@ -0,0 +1,13 @@

+# Ablation: rubric anchored examples stripped from the prompt; only the
+# level descriptions are sent. Measures the contribution of anchored examples.
+label: baseline_no_anchors
+provider: anthropic
+model_id: claude-haiku-4-5-20251001
+dimensions: [groundedness, relevance, completeness]
+strategy: single
+options:
+  use_cot: true
+  use_anchors: false
+  abstain_allowed: true
+output_path: results/calibration_v1_judge_baseline_no_anchors.json

configs/calibration/rows/baseline_no_cot.yaml ADDED Viewed

	@@ -0,0 +1,13 @@

+# Ablation: same as baseline but the judge prompt does NOT request reasoning
+# before the score. Used to measure the contribution of CoT-before-score.
+label: baseline_no_cot
+provider: anthropic
+model_id: claude-haiku-4-5-20251001
+dimensions: [groundedness, relevance, completeness]
+strategy: single
+options:
+  use_cot: false
+  use_anchors: true
+  abstain_allowed: true
+output_path: results/calibration_v1_judge_baseline_no_cot.json

configs/calibration/rows/jury_kappa_weighted.yaml ADDED Viewed

	@@ -0,0 +1,23 @@

+# 2-judge jury: Claude-Haiku + gpt-4o-mini, kappa-weighted aggregation.
+# Strict quorum default (any member abstain → jury abstain).
+#
+# v1.1: weights are computed by `_compute_kappa_weights` from the prior
+# jury-run sidecar (which has predictions from BOTH members), not the
+# baseline.json (Haiku-only). v1's pointer at baseline.json was the
+# asymmetric-coverage bug — see DECISIONS "v1.1 jury rescue" entry.
+# This is pragmatic-circular: weights are derived from the same
+# calibration set used for κ reporting; v1.2 will use a held-out set.
+label: jury_kappa_weighted
+strategy: jury
+aggregation: kappa_weighted
+quorum: null  # null = strict default (= len(judges) = 2)
+members:
+  - provider: anthropic
+    model_id: claude-haiku-4-5-20251001
+  - provider: openai
+    model_id: gpt-4o-mini-2024-07-18
+dimensions: [groundedness, relevance, completeness]
+weights_source: results/calibration_v1_judge_jury_kappa_weighted_members.jsonl
+output_path: results/calibration_v1_judge_jury_kappa_weighted.json
+sidecar_path: results/calibration_v1_judge_jury_kappa_weighted_members.jsonl

configs/calibration/rows/permute.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+# Rubric permutation: N=2 seeded prompt-level permutations per item, mean-
+# aggregated. Per-permutation results land in the sidecar JSONL.
+label: permute
+provider: anthropic
+model_id: claude-haiku-4-5-20251001
+dimensions: [groundedness, relevance, completeness]
+strategy: rubric_permute
+options:
+  n_permutations: 2
+  seeds: [1, 2]
+  abstain_allowed: true
+output_path: results/calibration_v1_judge_permute.json
+sidecar_path: results/calibration_v1_judge_permute_members.jsonl

docs/_generated/kappa_table.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# κ ablation table — calibration v1
+Headline metric per dimension: **groundedness → AC1**, **relevance → AC1**, **completeness → κ**. AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold is prevalence-skewed enough to make Cohen's κ degenerate (groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics produce ≥0.95 raw agreement on those rows but Cohen's κ collapses to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its gold (23×`2`/5×`1`) is balanced enough for κ to behave normally.
+| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |
+|---|---|---|---|---|---|---|
+| baseline | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% |  |
+| baseline | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| baseline | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% |  |
+| baseline_no_abstain | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% |  |
+| baseline_no_abstain | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| baseline_no_abstain | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% |  |
+| baseline_no_anchors | completeness | κ | 0.623 (-0.054, 1.000) | 26 | 0.0% |  |
+| baseline_no_anchors | groundedness | AC1 | 0.953 (0.834, 1.000) | 24 | 7.7% |  |
+| baseline_no_anchors | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% |  |
+| baseline_no_cot | completeness | κ | 1.000 (1.000, 1.000) | 24 | 7.7% |  |
+| baseline_no_cot | groundedness | AC1 | 0.897 (0.707, 1.000) | 23 | 11.5% |  |
+| baseline_no_cot | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% |  |
+| jury_kappa_weighted | completeness | κ | 0.014 (-0.077, 0.112) | 26 | 0.0% |  |
+| jury_kappa_weighted | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| jury_kappa_weighted | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% |  |
+| jury_kappa_weighted_v1_1 | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% |  |
+| jury_kappa_weighted_v1_1 | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| jury_kappa_weighted_v1_1 | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% |  |
+| permute | completeness | κ | 0.506 (-0.061, 1.000) | 26 | 0.0% |  |
+| permute | groundedness | AC1 | 1.000 (1.000, 1.000) | 25 | 3.8% |  |
+| permute | relevance | AC1 | 0.966 (0.890, 1.000) | 30 | 0.0% |  |

docs/judge-design.md ADDED Viewed

	@@ -0,0 +1,687 @@

+# Judge Layer — calibration writeup (v1.1.1)
+## TL;DR
+The v1 deliverable is a per-dimension LLM-judge layer (groundedness,
+relevance, completeness) with anchored discrete rubrics, abstain
+support, rubric permutation as a variance control, and a 2-judge
+kappa-weighted jury. It supersedes the previous continuous-score
+single-call judges. v1 was validated against a 30-item hand-labeled
+calibration set spanning two corpora (FastAPI + Kubernetes); the
+calibration surfaced six findings organized below as a methodology
+arc rather than a flat ablation table. The interpretive headline:
+- The shared retrieval stack does the heavy lifting on retrieval
+  metrics (P@5, R@5, KHR vary < 0.12 across all four custom/LangChain
+  × OpenAI/Anthropic configurations); the judge layer's value is in
+  *measuring* the orchestrator's grounded-citation behavior, not in
+  driving it.
+- Calibration caught a published-rubric drift between human-grader
+  and rubric-as-written (22/30 disagreements at v1.0); rubric
+  clarification + re-labeling brought v1.1 inter-rater agreement to
+  29/30 on groundedness.
+- The 2-judge jury under v1's weighting pipeline fired both branches
+  of the design doc's tracked risk simultaneously: the weights-source
+  was a stub and the missing-weight fallback to 1.0 silently
+  amplified an unweighted member. v1.1 fixed both; the corrected
+  jury matches the calibrated single-judge baseline (κ 0.014 → 0.416
+  on completeness, no API spend).
+- A second-order finding the v1 design didn't anticipate: small
+  models on 3-point ordinal scales with paraphrase semantics exhibit
+  *at least two* distinct failure modes — one rubric-positional and
+  prompt-engineering-fixable, one capacity-limited and only
+  addressable by model selection. The 4A A/B against GPT-4o (full)
+  is the empirical separator.
+- A methodological observation that's the deepest finding of the
+  calibration: Cohen's κ as a jury weight has a self-defeating
+  property under intervention-induced marginal shifts. AC1 reads the
+  signal correctly. v1.2 fix-list addresses this.
+The closing position is *when not to use LLM-judge*: 3-point ordinal
+scoring with paraphrase semantics is at the boundary where mid-tier
+models (gpt-4o-mini class) exhibit capacity limits independent of
+prompt engineering, and the right architectural choice is per-
+dimension judge selection rather than further prompt iteration.
+---
+## 1. Methodology arc
+The findings below are ordered as the calibration produced them, not
+re-ordered for clarity. Each one has its own supporting evidence
+file; the κ table at `docs/_generated/kappa_table.md` is the
+quantitative summary; `DECISIONS.md` carries the per-decision
+rationale that informs but doesn't repeat the writeup.
+### 1.1 Rubric drift caught by frontier-model stress-test
+The v1.0 hand-labeled calibration set (29 items, single-rater) ran
+through a 90-cell Opus-4 stress-test (`measurements/2026-05-05-judge-
+rubric-opus-stress.jsonl`, $0.20) against the published rubrics. The
+test surfaced a 22/30 disagreement on groundedness — high enough to
+indicate one of three things: (a) the rubric was wrong, (b) the
+labels were wrong, (c) Opus was wrong.
+Investigation localized the cause to a *scope mismatch* between the
+rubric and the human-grader's labeling procedure. The groundedness
+rubric scopes entailment to the *retrieval snippets* — a specific
+binary check: every claim in the agent's answer must be entailed by
+at least one retrieved snippet. The human grader had instead been
+checking against *corpus documents* (which the snippets are drawn
+from but which contain additional context). Under the corpus-
+supported reading, claims like "useful for expensive operations like
+database connections" pass; under the strict-snippets-only reading,
+they fail.
+The fix: the rubric was clarified with an explicit "must score 0"
+reference-scope sentence, a trivial-inference clause with a
+canonical-name carve-out (e.g., the snippet says "FastAPI's
+`HTTPException`" and the answer says "the `HTTPException` class" —
+that's still grounded), and three calibration anchors covering the
+boundary cases (`q006` subtle embellishment, `k8s_006` dramatic
+over-extension, `q021` trivial-inference positive).
+22 v1.0 labels were flipped against the strict rubric. v1.1 inter-
+rater agreement on groundedness rose to **29/30**. The methodology
+note: *the rubric's reference scope was load-bearing for the dimension
+to measure retrieval-grounded behavior rather than LLM general
+knowledge*; relaxing it would have re-introduced the failure mode the
+supersession was designed to remove.
+**Why this matters for the writeup:** the strict-snippet groundedness
+rubric is the v1 deliverable's identity. The benchmark is *zero
+hallucinated citations on all API provider configurations* — that
+claim is only meaningful under strict scope. Stress-testing the rubric
+against a frontier model before publication is the cheap intervention
+that catches the labeling-vs-rubric drift before the artifact ships.
+### 1.2 CoT-before-score asymmetry across dimensions (tangent — see appendix)
+The `baseline_no_cot` ablation row reached κ = 1.000 on completeness
+— counterintuitive given the conventional CoT-helps-judging story —
+but at n = 24 (vs n = 26 for `baseline`), and the no_cot row's
+groundedness AC1 falls from 1.000 to 0.897, so the finding is real
+but doesn't drive v1.1 design choices. The longer treatment with the
+n = 24 caveat surfaced honestly is in **Appendix B — CoT-before-
+score by dimension**.
+### 1.3 v1 jury bug — two compounding weight-pipeline bugs
+The v1 design doc's risks subsection listed *"jury κ worse than the
+better individual judge — (a) kappa-weighting wrong, or (b) worse
+judge drags mean"* as a tracked risk. The v1.0 calibration fired both
+branches simultaneously.
+The κ table row `jury_kappa_weighted` reads κ = 0.014 on
+completeness, vs the single-judge `baseline` (Haiku) at κ = 0.416 —
+a 30× regression. Per-member analysis from
+`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
+| Member | n | raw% | κ | AC1 |
+|---|---|---|---|---|
+| Haiku 4.5 alone (gold ⋈ pred) | 26 | 84.6% | +0.416 | +0.792 |
+| gpt-4o-mini-2024-07-18 alone | 26 | 26.9% | +0.020 | +0.006 |
+| Jury aggregate (v1) | 26 | 26.9% | +0.014 | +0.016 |
+The jury aggregate matches gpt-4o-mini almost exactly. The mechanism
+is not "weighted voting in the usual sense" but *missing-weight + tie-
+break compounding*:
+- `scripts/run_calibration.py::_load_weights_from_baseline` was a
+  documented v1 stub returning `1.0` for every judge_id present in
+  `baseline.json`. `baseline.json` contains only Haiku predictions
+  (the baseline ablation is single-judge), so Haiku got `1.0` from
+  the stub.
+- gpt-4o-mini was not in the baseline file — its judge_id never
+  appears there. v1's `Jury.score` had a fallback policy of
+  `weights.get(judge_id, 1.0)` with a `logger.warning` for visibility.
+  gpt-4o-mini got `1.0` from this fallback.
+- Equal weights make a disputed (Haiku=2, gpt=1) cell aggregate as
+  `(2 × 1 + 1 × 1) / 2 = 1.5`. The discretization rule
+  (`_aggregate_scores`'s policy, mirrored in `_discretize_mean`) is
+  *ties to lower*: `frac > 0.5 → ceil else floor`, and `0.5 > 0.5` is
+  false, so 1.5 floors to 1. gpt-4o-mini's verdict wins every
+  disputed cell.
+The deeper structural point: weighting alone cannot rescue a
+systematically miscalibrated member. Even held-out validation that
+correctly assigned gpt-4o-mini's true low weight on completeness
+would still let it dominate disputed ties unless its weight were
+driven near zero — and at that point exclusion is more honest than
+near-zero inclusion.
+**v1.1 fix.** Two coordinated changes (single bundled commit, see
+`ab0e054`):
+- `agent_bench/evaluation/variance/jury.py`: missing-weight fallback
+  to `1.0` → hard `ValueError`. v1.1 requires symmetric coverage in
+  the weights source.
+- `scripts/run_calibration.py::_load_weights_from_baseline` →
+  `_compute_kappa_weights`: replaces the stub with real per-judge
+  Cohen's κ on the dimension. Negative κ clipped to 0 (soft exclusion
+  via weight). Hard-errors when any expected member is missing from
+  the source.
+- Configuration: `weights_source` re-pointed from
+  `calibration_v1_judge_baseline.json` (Haiku-only, asymmetric) to
+  `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (sidecar
+  from a prior jury run; both judges present). The source has
+  documented circularity — weights are computed from the same
+  calibration set used for κ reporting; v1.2 will use a held-out
+  validation set.
+**Re-aggregation (no API spend).** Re-running the existing 164
+sidecar rows with κ-derived weights (Haiku 0.416, gpt-4o-mini 0.020):
+| | n | raw% | κ |
+|---|---|---|---|
+| Jury (v1.0, broken)            | 26 | 26.9% | +0.014 |
+| Jury (v1.1, corrected weights) | 26 | 84.6% | **+0.416** |
+| Haiku-baseline (control)       | 26 | 84.6% | +0.416 |
+The corrected jury matches the Haiku-baseline κ exactly. The
+mechanism: with corrected weights, a disputed (Haiku=2, gpt=1) cell
+aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 = 1.954`, frac 0.954 >
+0.5, ceil to 2. Haiku's verdict wins. gpt-4o-mini's near-zero weight
+correctly suppresses its verdict.
+This is the **pre-committed Outcome 2** from the v1.1 jury-rescue
+plan: jury matches baseline within ±0.05 → "soft exclusion via
+weighting." The weighting suppresses the biased member to near-
+irrelevance; the jury isn't *worse* than baseline, but it isn't
+*doing meaningful work* either. The intervention is necessary but
+not sufficient — the jury's value-add over single-judge depends on
+the second judge being calibrated, which on completeness it isn't.
+### 1.4 v1.1.1 prompt-positional intervention — one of two failure modes
+The next investigation localized *why* gpt-4o-mini was so badly
+miscalibrated on completeness. Confusion-matrix analysis (1A in the
+investigation plan) on the existing sidecar showed:
+- **17 of 19 disagreements** are gold=2/pred=1 (one-step-down)
+- 1 is gold=2/pred=0, 1 is gold=1/pred=0
+- **0 disagreements** are pred > gold
+This is direction-aware structure, not balanced random labeling. The
+probability of producing 19 same-direction disagreements by chance
+under a balanced labeler is ~2⁻¹⁹. The bias is structural and
+reproducible; gpt-4o-mini *consistently applies* a stricter standard
+than the rubric specifies.
+Reading the per-item reasoning surfaced an **extraction-vs-reasoning
+split**: gpt-4o-mini's `evidence_quotes` field correctly extracts the
+paraphrased coverage from the agent's answer, and then its `reasoning`
+field denies that those quotes constitute coverage. The cleanest
+example is `k8s_002` (Deployment vs StatefulSet) — gpt's
+`evidence_quotes` literally contain the strings `"declarative
+updates"` and `"sticky identity"`, while its `reasoning` says "the
+answer does not explicitly mention 'declarative updates' and 'sticky
+identity'." The score follows the reasoning, not the evidence. (Two
+more examples in `measurements/2026-05-06-gpt4o-extraction-reasoning-
+split.md`.)
+The *intervention* that follows from this hypothesis: the model loses
+the rubric's "paraphrase allowed" instruction across the rubric body,
+the gold reference, the system answer, and its own reasoning step.
+By the time it commits to a score, the literal-string-match standard
+has displaced the rubric's permissive one. **Recency-positioning**
+the paraphrase clause adjacent to the score instruction tests this:
+```
+{rubric body}
+---
+## Reference answer (gold)
+{reference}
+## Answer to score
+{system_answer}
+Note: a paraphrase that captures the same meaning as a gold-answer
+point counts as covered. Score on content equivalence, not surface
+form.
+Score this answer against the rubric above. Respond with ONLY a {schema}.
+```
+**3A 5-item probe** (`q006`, `q011`, `k8s_002`, `k8s_006`, `k8s_018`,
+$0.0013): 3/5 disputed items shifted 1 → 2 — at the binomial-
+significance threshold per the pre-committed criteria. The protocol
+triggered the full-26 re-run on gpt-4o-mini only (Haiku held as
+control to make the v1.1 → v1.1.1 delta cleanly attributable).
+**Full-26 re-run** (`scripts/_dev/rerun_completeness_v1_1_1.py`,
+$0.0075):
+| | n | raw% | κ | AC1 |
+|---|---|---|---|---|
+| v1.1   gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 |
+| **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** |
+| v1.1   Haiku (control) | 26 | 84.6% | +0.416 | +0.792 |
+7 items shifted up (6 correct: gold=2/pred=1 → gold=2/pred=2 on
+`q006`, `k8s_002`, `k8s_013`, `k8s_015`, `k8s_016`, `k8s_017`; 1
+regression: `k8s_025` over-credited gold=1/pred=2). Net per-item
+correctness delta: +5 items.
+**Cohen's κ flat-lined** despite a 38× AC1 improvement and +16pp raw
+agreement. This is the κ-as-weight degeneracy — section 1.6 below
+covers the mechanism.
+The intervention is real and partial: 5/19 disputed items recovered
+via prompt positioning. 14 disagreements remained uncharacterized
+after this step.
+### 1.5 4A residual characterization — model-class-specific
+The v1.1.1 result is interview-precarious framed as "fixed" (5/19 is
+a partial fix, not a complete one). The right diagnostic for the
+residual was the originally-deferred 4A: run a frontier-class model
+on 5 of the 14 unchanged items at the same v1.1.1 prompt, and see
+whether the residual is small-model-specific or rubric-under-
+specified.
+**4A** (`gpt-4o-2024-08-06`, items `k8s_006`, `k8s_018`, `q011`,
+`q012`, `k8s_001`, $0.005–0.01): **5/5 scored correctly** — every
+item that gpt-4o-mini got wrong at the v1.1.1 prompt, GPT-4o got
+right at the same prompt. Clean A/B at fixed prompt varying only
+the model.
+The cleanest side-by-side is `k8s_018` (autoscaling/v2 vs v1). The
+reference specifies three points: stable API version, memory metrics
+support, custom metrics support. Both models receive the same
+prompt:
+- **gpt-4o-mini (score 1):** "It mentions some key points from the
+  reference, including the stable version of `autoscaling/v2`,
+  support for custom metrics, and memory metrics, but it does not
+  explicitly state that the new fields in `autoscaling/v2` are
+  preserved as annotations when using `autoscaling/v1`, nor does it
+  mention the need to use `autoscaling/v2` directly for memory or
+  custom metric scaling for a Deployment or StatefulSet."
+- **gpt-4o (score 2):** "The answer covers all the key points from
+  the reference. It mentions that the current stable version is
+  autoscaling/v2, which supports scaling on memory and custom
+  metrics, similar to the reference. It also notes that
+  autoscaling/v1 only supports CPU-based scaling, aligning with the
+  reference's points."
+gpt-4o-mini's reasoning step **invents additional gold-criteria the
+reference doesn't require** — "preserved as annotations," "use v2
+directly for a Deployment or StatefulSet" — and deducts against
+them. gpt-4o reads the reference's three points and scores against
+exactly those. This is a **second, distinct failure mode** from the
+1.4 finding:
+- **Failure mode A (rubric-positional):** literal-match regression
+  on paraphrased coverage. *Fixable* by recency-positioning the
+  paraphrase clause. Recovers 5/19 items. (Section 1.4.)
+- **Failure mode B (capacity-limited):** criteria-invention during
+  the reasoning step — the model manufactures additional gold
+  criteria the reference never specified, then deducts against them.
+  *Not fixable* by the same prompt; demonstrably absent in gpt-4o.
+  (This section.)
+The v1.1.1 prompt addresses A but not B. B is what 4A characterizes.
+### 1.6 κ-as-weight degeneracy — methodological observation
+> **This section is the writeup's deepest finding.** The methodology
+> arc 1.1–1.5 leads here: an intervention that improved a judge
+> member at the per-cell level (raw 26.9% → 42.9%, AC1 0.006 → 0.232)
+> was *silently excluded* from the jury aggregate by the weighting
+> metric itself. The mechanism below generalizes beyond the v1.1.1
+> instance and is what motivates v1.2 fix #5.
+The v1.1.1 gpt-4o-mini result reveals a property of Cohen's κ as a
+jury weight that the v1 design didn't anticipate: κ has a **self-
+defeating property** under intervention-induced marginal shifts. An
+intervention that improves a member can *lower* its weight even as
+the member gets more accurate.
+**Mechanism.** Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where
+`P_e = Σ_k P(gold=k) × P(pred=k)`. P_e is *not* invariant to the
+predictor's marginal distribution. When a member's predictions
+become more diverse — closer to gold's marginals — P_e rises in
+lockstep with P_o. The numerator stays small, and κ deflates even
+as raw accuracy improves.
+**Empirical instance.** v1.1 gpt-4o-mini completeness pred dist:
+`{0:2, 1:19, 2:5}` (concentrated at 1). v1.1.1 dist: `{0:4, 1:12,
+2:12}` (more diverse, closer to gold's `{1:5, 2:23}`). Per-cell raw
+accuracy 26.9% → 42.9%. AC1 (Gwet 2008, prevalence-robust):
+0.006 → 0.232 (38×). Cohen's κ: 0.020 → 0.000.
+`_compute_kappa_weights` clips κ < 0 to weight = 0. v1.1.1's
+gpt-4o-mini κ = 0.000 → weight = 0.000 → contribution to jury
+verdict is multiplied by zero. The improved member is invisible at
+the aggregate level. **The κ table doesn't move at v1.1.1** despite
+a real per-member improvement; the visible artifact disagrees with
+the per-judge measurement.
+Why this is non-obvious: in static conditions (no intervention,
+fixed prompts), κ as weight is a sensible default. The self-
+defeating property is invisible until you observe a real
+intervention that shifts marginals. v1.0's calibration sweep
+couldn't surface it because nothing was changing the marginals;
+v1.1.1's intervention is the first time the calibration set has
+produced an intervention-induced marginal shift.
+The same prevalence trap is what motivates AC1 over κ on the
+relevance and groundedness *reporting* rows of the κ table. The
+v1.1.1 finding is that the same trap also affects κ when used as a
+*weight*, with worse consequences: a reporting-degenerate κ is just
+visually surprising; a weighting-degenerate κ silently excludes a
+correctly-improved member from the aggregate.
+**Implication.** The v1.2 fix-list (section 3) splits weighting and
+reporting cleanly: per-dimension weight metric reusing the
+`_DIM_METRIC` mapping already used for reporting. AC1 where κ
+degenerates; κ where the gold's prevalence supports it.
+---
+## 2. Position statement — when not to use LLM-judge
+The combined findings support a sharper position than "small models
+are bad at completeness." Two distinct failure modes were surfaced
+on the same dimension, and they have different intervention classes:
+|                    | Failure mode A (1.4) | Failure mode B (1.5) |
+|--------------------|----------------------|----------------------|
+| Mechanism          | Literal-match regression on paraphrased coverage | Criteria-invention during reasoning |
+| Diagnostic         | 1A confusion matrix (17/19 disagreements one-step-down) | 4A A/B against gpt-4o (5/5 model-class swap fixes) |
+| Intervention class | Rubric-positional prompt engineering | Model selection |
+| Outcome            | Recovers 5/19 items | Recovers all 5 sampled at the same prompt |
+The v1.1.1 prompt-positional fix exhausts what prompt engineering
+can do on this rubric: the recency clause directs the model to
+paraphrase semantics, and that's the only failure mode the
+intervention can address. Iterating further on prompt design to
+address criteria-invention would either (a) need a longer prompt
+that re-explains the rubric's score levels in the score-decision
+adjacency — which would cost tokens and likely confuse smaller
+models more — or (b) require rubric simplification (binary instead
+of 3-point), which is a v1.2 design change, not a tuning change.
+**The structural answer for v1.2 is per-dimension judge selection.**
+3-point ordinal completeness with paraphrase semantics is at the
+boundary where mid-tier models exhibit capacity limits independent
+of prompt engineering. Two defensible v1.2 paths:
+1. **Exclude gpt-4o-mini from completeness scoring.** Per-dimension
+   judge membership; jury reduces to single-judge Haiku on
+   completeness; explicit and visible in the jury config (not
+   emergent from κ-weight collapse).
+2. **Replace gpt-4o-mini with GPT-4o on completeness.** Per-
+   dimension judge selection; jury keeps two members; the second is
+   a frontier-class model on the dimension that needs it.
+The choice depends on cost budget. agent-bench's calibration scale
+(~30 items × per-row × dimension-count) is trivially cheap on either
+model; production deployment evaluating thousands of agent outputs
+makes the trade-off material. For v1.2 the calibration cost
+difference between the two paths is on the order of $0.15 per full
+calibration sweep — well below the threshold where cost should
+constrain the choice.
+The honest interview answer to *"did you fix gpt-4o-mini on
+completeness?"* is **no, deliberately**: the GPT-4o A/B showed the
+residual bias is model-class-specific. The fix isn't another prompt
+intervention; it's per-dimension judge selection. v1.1.1
+demonstrated that rubric-engineering can address one of two failure
+modes; the second one is what model choice is for.
+**This generalizes beyond the specific dimension as a hypothesis the
+v1 data is consistent with, not a claim the v1 data establishes.**
+The empirical scope is narrow: 3-point ordinal × paraphrase ×
+completeness, n = 26–28 items, one mid-tier model (gpt-4o-mini)
+tested against one frontier model (gpt-4o) at the same prompt.
+Within that scope, the combination of (multi-class discrimination) ×
+(paraphrase tolerance) × (reasoning-induced elaboration latitude) is
+at the capacity boundary where mid-tier models manufacture failure
+modes that look like they should be prompt-tunable but aren't. Within
+the same scope, frontier-class models on those dimensions; mid-tier
+models on binary or strict-match dimensions where they perform
+identically (groundedness AC1 = 1.000, relevance AC1 = 1.000 on the
+same gpt-4o-mini that fails on completeness).
+Whether this generalizes to other ordinal arities (4-point, 5-point),
+other mid-tier models (Mistral, Sonnet, Gemini-Flash), or other
+dimensions with paraphrase tolerance is *open* and worth replication
+in v1.2. The v1 data is one mid-tier vs one frontier on one
+dimension; the broader categorical claim ("don't use mid-tier on any
+ordinal-with-paraphrase task") needs replication across model
+families and ordinal arities before it's defensible as a general
+recommendation.
+---
+## 3. v1.2 fix-list with empirical justification
+Five items, ordered by methodology depth. Items 1–4 are escalations
+of known v1 risks the calibration confirmed; item 5 is the new
+finding from the v1.1.1 + 4A investigation.
+### 3.1 Held-out jury weights
+**v1 state.** v1.1 weights are computed on the same calibration set
+used for κ reporting (circular). The pragmatic choice was driven by
+N = 30 — splitting into a held-out subset would lose statistical
+power on both halves.
+**v1.2 fix.** A held-out 20-item validation set used solely for
+jury-weight estimation; the 30-item calibration set retained for κ
+reporting. Items selected by stratification across (corpus, gold-
+class) so the validation set reflects the calibration set's
+prevalence distribution.
+**Empirical justification.** v1.1's circular weighting is documented
+honestly (DECISIONS "v1.1 jury rescue" entry); a held-out set would
+make the jury-weight numbers reproducible across calibration set
+revisions without re-circularity.
+### 3.2 Symmetric coverage / hard-error on missing weights — DONE in v1.1
+The v1 silent fallback to `1.0` was the second of the two compounding
+bugs in section 1.3. v1.1 made this a hard `ValueError` per
+DECISIONS commit `ab0e054`. Listed here for completeness; closed.
+### 3.3 Per-dimension judge membership
+**v1 state.** Jury config declares members globally across all
+dimensions (`configs/calibration/rows/jury_kappa_weighted.yaml`).
+Weights are per-(member, dimension) but membership is per-jury.
+**v1.2 fix.** Membership declared per-dimension in the jury config:
+```yaml
+jury:
+  groundedness:
+    - haiku
+    - gpt-4o-mini
+  relevance:
+    - haiku
+    - gpt-4o-mini
+  completeness:
+    - haiku            # gpt-4o-mini excluded; see writeup §1.5 + 4A
+```
+The exclusion is *visible* in the config, with a comment pointing
+to the rationale. Not buried in code logic.
+**Empirical justification.** 4A (writeup §1.5): GPT-4o handles 5/5
+of the v1.1.1-residual items at the same prompt; gpt-4o-mini's
+residual bias is model-class-specific (criteria-invention during
+reasoning). v1.1's κ-as-weight handles this by collapsing the
+member's weight to 0; v1.2 makes the exclusion explicit.
+### 3.4 Per-dimension tie-break rule
+`_discretize_mean` currently uses *ties to lower* (`floor + 1 if frac
+> 0.5 else floor`) globally — selected for conservative behavior on
+binary scales where "score 0 on uncertainty" matches the conservative
+direction (hallucination, off-topic). v1.2 flips this per-dimension:
+on 3-point completeness, "conservative" means scoring toward
+*incomplete*, which is the wrong default given member miscalibration
+already biases toward 1.
+**This fix is independent of §3.5; even with correct AC1-weighted
+aggregation, the global ties-to-lower default mis-handles ordinal
+scales where the conservative direction differs from binary scales'
+conservative direction.** Per-dimension tie-break is the *structural*
+fix for ordinal asymmetry; per-dimension weight metric in §3.5 is the
+*distributional* fix for prevalence-induced κ degeneracy. Different
+defects, different fixes.
+### 3.5 Per-dimension weight metric (NEW from v1.1.1)
+**v1 state.** `_compute_kappa_weights` uses Cohen's κ for every
+dimension. Section 1.6 demonstrated that κ has a self-defeating
+property under intervention-induced marginal shifts — an
+intervention that improves a member can lower its weight to zero,
+silently excluding it from the aggregate.
+**v1.2 fix.** Per-dimension weight metric reusing the `_DIM_METRIC`
+mapping already used in
+`agent_bench/evaluation/calibration/report.py`. Use AC1 (Gwet 2008)
+where the dimension's gold prevalence makes κ degenerate;
+κ where the gold's prevalence supports it. Same lookup, same per-
+dimension policy at both reporting and weighting layers.
+**Empirical justification.** v1.1.1's gpt-4o-mini intervention
+(writeup §1.4 + 1.6): raw 26.9% → 42.9%, AC1 0.006 → 0.232 (38×),
+κ 0.020 → 0.000. v1.1's `_compute_kappa_weights` clips the new κ at
+zero, weight = 0, member silently excluded from the aggregate. AC1
+as weight would have given the v1.1.1-improved member a non-zero
+contribution proportional to its actual reliability, surfacing the
+intervention's per-member improvement in the jury aggregate.
+This is the writeup's deepest finding. The interaction between
+Cohen's κ and prevalence-induced marginal skew is well-documented in
+the κ-reporting literature — Gwet (2008) introduced AC1 specifically
+to address it, and the κ table at `docs/_generated/kappa_table.md`
+already uses AC1 over κ on relevance and groundedness for that
+reason. *What's underexplored, to the author's knowledge,* is the
+specific case where κ is used as a jury *weight* rather than as a
+reporting statistic, and where an intervention shifts the predictor's
+marginals while the gold's marginals stay fixed. v1.2's per-dimension
+weight metric addresses this case structurally.
+---
+## 4. Closing position
+The v1 calibration set — 30 hand-labeled items, two corpora, three
+dimensions — was small enough that every finding above lived inside
+single-digit item counts on the disputed surface. The fact that the
+calibration produced six *separable* findings rather than one or two
+flat κ numbers is itself a signal about evaluation design: a
+calibration set sized to support stratified ablation (rubric × CoT ×
+abstain × jury × prompt-positional × model-class) returns more per
+item than a larger flat set used only for headline-κ reporting.
+The methodology arc the calibration produced is reproducible from
+the artifacts on disk:
+- `docs/_generated/kappa_table.md` — the headline κ table, joined
+  on `(item_id, dimension)` from
+  `results/calibration_v1_judge_*.json` ⋈
+  `measurements/2026-05-04-judge-calibration-labels.jsonl`. v1.1
+  jury-rescue row visible at `jury_kappa_weighted_v1_1` (κ = 0.416,
+  vs `jury_kappa_weighted` at κ = 0.014).
+- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — Opus-4
+  stress-test that surfaced the rubric drift (§1.1).
+- `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` —
+  three side-by-side reasoning + evidence_quotes excerpts
+  demonstrating the literal-match regression mechanism (§1.4).
+- `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the
+  5-item probe artifact for the prompt-positional intervention
+  (§1.4).
+- `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl` — GPT-4o A/B
+  on the v1.1.1 residual; the empirical separator between the two
+  failure modes (§1.5).
+- `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
+  — merged sidecar (v1.1 unchanged dims + v1.1.1 fresh gpt-4o-mini
+  completeness rows). The data behind the per-member numbers in §1.4.
+- `DECISIONS.md` — per-decision rationale for v1.1, v1.1.1, 3A, 4A.
+**Total session API spend:** ~$0.013–0.018. v1.1 introduced no API
+spend (re-aggregated existing predictions). v1.1.1 spent $0.0088 on
+the prompt-positional intervention (5-item probe + 30-item full re-
+run). 4A spent $0.005–0.01 on the diagnostic A/B.
+**The v1 deliverable's position on when not to use LLM-judge:** mid-
+tier models (gpt-4o-mini class) on 3-point ordinal scales with
+paraphrase semantics exhibit capacity limits independent of prompt
+engineering. The right architectural choice is per-dimension judge
+selection, not iterative prompt tuning. Two defensible v1.2 paths
+are listed in §3.3; the empirical evidence supports either one. The
+choice between them depends on the cost of frontier inference at
+production scale, which is a separate v1.2 decision.
+---
+## Appendix A — reproducer index
+| Script | What it does | Cost |
+|---|---|---|
+| `scripts/_dev/reaggregate_jury_v1_1.py` | Re-aggregates the existing 164 sidecar rows with κ-derived weights; produces v1.1-corrected jury verdicts. Mirrors the production `Jury.score` aggregation logic offline. | $0.00 |
+| `scripts/_dev/probe_3a_paraphrase_recency.py` | 5-item probe of the prompt-positional intervention on disputed completeness items; tests whether recency-positioning the paraphrase clause shifts gpt-4o-mini's verdicts. | $0.0013 |
+| `scripts/_dev/rerun_completeness_v1_1_1.py` | Full-26 re-run of gpt-4o-mini completeness with the v1.1.1 production prompt. Haiku held as control. | $0.0075 |
+| `scripts/_dev/probe_4a_gpt4o_full.py` | GPT-4o (full) A/B on 5 of the 14 v1.1.1-unchanged items at the same v1.1.1 prompt. Diagnostic for whether the residual is small-model-specific or rubric-under-specified. | $0.005–0.01 |
+The production calibration runner (`scripts/run_calibration.py`) is
+not in this list because it produces the headline κ table from the
+canonical row configs; the `_dev` scripts above are one-off
+diagnostics that produce the writeup's interpretive evidence.
+---
+## Appendix B — CoT-before-score by dimension
+The `baseline_no_cot` ablation row (`use_cot=false`, schema requests
+only the `score` field; reasoning + evidence_quotes omitted) shows a
+per-dimension asymmetry that's interesting on its own but didn't
+drive v1.1 design choices. Pulled out of the body to keep the
+methodology arc focused on the v1.1 → v1.1.1 → 4A path.
+| Dimension | baseline (CoT) | baseline_no_cot |
+|---|---|---|
+| completeness | κ = 0.416 (n = 26) | **κ = 1.000** (n = 24) |
+| groundedness | AC1 = 1.000 (n = 26) | AC1 = 0.897 (n = 23) |
+| relevance | AC1 = 0.964 (n = 29) | AC1 = 0.963 (n = 28) |
+**Counterintuitive headline on completeness.** With CoT, the judge's
+reasoning step over-emphasizes partial coverage and rationalizes
+score = 1 ("the answer covers most of the points but misses
+detail X") even when the gold's holistic reading is "covers the
+points." Without CoT, the judge commits to a verdict against the
+rubric directly, and the verdict aligns with the holistic reading.
+The mechanism generalizes specifically to *ordinal scales with
+permissive semantics* — where reasoning-induced elaboration can
+manufacture grounds for downward verdicts.
+**The n = 24 caveat.** `baseline_no_cot` excludes 2 cells (`q021`,
+`k8s_012`) due to provider rate-limit retry exhaustion. Both were
+gold = 2; neither was in `baseline`'s disagreement set. So the
+agreement *isn't* selective in the misleading sense (the abstain set
+isn't disproportionately drawn from `baseline`'s mistakes), but the
+n = 24 vs n = 26 comparison is asymmetric across rows, and the
+κ = 1.000 number is partly an abstain-exclusion artifact rather than
+a pure counterfactual against `baseline`. The point estimate is real;
+the bootstrap CI is wider than the table cell suggests.
+**Why this didn't drive v1.1 design.** The no_cot row's groundedness
+AC1 falls from 1.000 to 0.897 — meaningfully worse on the dimension
+where CoT *does* help. Across dimensions: CoT helps on groundedness,
+hurts on completeness, neutral on relevance. The right path is
+*per-dimension* CoT selection (independent of v1.2 fix-list items
+3.1–3.5; tracked separately as a v1.2 follow-up). Not included in
+the §3 fix-list because the empirical evidence is partial (n = 24
+caveat) and the asymmetric effect across dimensions makes a single
+global change incorrect.
+**Interview-readiness note.** A reader probing the κ table will see
+the no_cot row's completeness κ = 1.000 and ask. The honest answer
+is "interesting tangent, see appendix B, didn't change v1.1 design
+choices because the asymmetry across dimensions doesn't support a
+global flip." That answer is defensible because the appendix is
+honest about the n = 24 caveat; it would not be defensible if the
+body claimed CoT-before-score was load-bearing for v1's design.

docs/plans/2026-05-04-judge-layer-v1-design.md ADDED Viewed

	@@ -0,0 +1,613 @@

+# Judge Layer v1 — Design Document
+**Date:** 2026-05-04
+**Status:** Approved — ready for implementation
+**Author:** Jane Yeung
+**Scope:** v1 of a discrete-scale, per-dimension LLM-judge layer with a κ-validated 2-judge jury and a 30-item hand-labeled calibration set. Supersedes the existing continuous-scale `answer_faithfulness` / `answer_correctness` judges. Mistral self-hosted 3rd judge, Langfuse self-host, dual-pass intra-rater calibration, and DSPy/GEPA prompt optimization are explicitly v1.1+.
+---
+## Goal
+Replace the existing single-call, continuous-score, no-abstain LLM-judge implementation in `agent_bench/evaluation/metrics.py` with a per-dimension judge layer that supports anchored discrete rubrics, abstain, evidence quotes, judge identity, rubric versioning, and variance-controlled aggregation (rubric permutation, jury). Validate the new layer against a 30-item hand-labeled calibration set with Cohen's κ and bootstrap CIs. Produce a κ ablation table that quantifies the contribution of each variance control (anchored rubric, abstain option, rubric permutation, 2-judge jury) on top of the single-judge baseline.
+The deliverable is the merged PR. The interpretive artifact is `judge-design.md` (a separate writeup file, not this design doc) which presents the κ table, the methodology, and the closing position on when *not* to use LLM-judge — drafted in the third day of the v1 scope window, sourced from the calibration runs produced by this design.
+## Non-Goals
+- **3rd-judge Mistral self-hosted via Modal.** Modal serving substrate exists from PR #8; deferring the third judge to v1.1 keeps the v1 jury at 2 members and the inference cost at the API-only floor.
+- **Multi-seed self-consistency** (T=0 ensemble across seeds). Variance control via rubric permutation only in v1.
+- **DSPy / GEPA / MIPROv2 prompt optimization.** Rubrics are hand-authored with anchored examples; automated optimization is v1.1+.
+- **Length-bias study, bypass tests, full pass^k sweep.** Out-of-scope for v1.
+- **Langfuse self-host integration.** Position paragraph in writeup §10 instead.
+- **Dual-pass intra-rater calibration.** v1 cites the UK AISI bio/chem ceiling (κ ~0.8) as the literature reference; v1.1 may add intra-rater κ as an empirical ceiling.
+- **Synthetic-anchor calibration set** (frontier-model-as-anchor). Methodologically delicate; v1.1+ if pursued.
+- **Backward-compatible Optional fields on `EvalResult`.** Hard cut: `EvalResult.faithfulness` and `EvalResult.correctness` are removed. Existing run artifacts in `results/*.json` will not deserialize against the new schema; this is acceptable because those artifacts are documentation-of-history (read by humans), not inputs to live code, and none of the README's published numbers depend on the removed fields.
+## Architecture
+### Three-layer evaluation hierarchy
+| Layer | What | Where | Cost | When |
+|---|---|---|---|---|
+| **L1 — Deterministic** | retrieval P@k/R@k, KHR, source_presence, grounded_refusal, citation_accuracy, calculator_used | `agent_bench/evaluation/metrics.py` (existing, untouched) | $0, CI-safe | every harness run |
+| **L2 — LLM-judge** | per-dimension judges (groundedness, relevance, completeness; +citation_faithfulness opt-in), 2-judge jury, variance-controlled | `agent_bench/evaluation/{judges,rubrics,variance}/` (new) | ~$0.001–0.005/query | optional (`evaluation.judge_provider` set + `evaluation.judge_dimensions` non-empty) |
+| **L3 — Human** | calibration set hand-labels (30 items × 3 dimensions) | `measurements/2026-05-04-judge-calibration-labels.jsonl` (new, hand-authored) | manual, one-time | once; locked |
+L3 wraps L2 via the κ table; L1 wraps L2 by handling the cases regex can see (citation accuracy is the canonical example — v1 keeps the existing deterministic check; the writeup's §6 argues this is the right cut even after L2 exists).
+### Module layout
+Four new sibling subpackages under `agent_bench/evaluation/`. Sibling siblings — not nested under a single `judging/` parent — because the file tree should make the L1/L2/L3 hierarchy legible and `calibration/` is L3 evaluation infrastructure that *uses* `judges/`, not a sub-concern of judging.
+```
+agent_bench/evaluation/
+  harness.py             # MIGRATED — drop inline _judge_call; plug in jury
+  metrics.py             # KEEP deterministic; DELETE answer_faithfulness/answer_correctness/_judge_call/_FAITHFULNESS_PROMPT/_CORRECTNESS_PROMPT
+  report.py              # existing
+  datasets/
+    tech_docs_golden.json    # existing — 8 items get source_snippets added (calibration subset only)
+    k8s_golden.json          # existing
+    k8s_golden_pilot.json    # existing
+    calibration_v1.json      # NEW — 30 stratified item IDs, version field, system_config_git_sha
+  judges/                # NEW
+    __init__.py
+    base.py              # Judge ABC, ScoreResult, Rubric loader, MockJudge, abstain-reason constants
+    groundedness.py
+    relevance.py
+    completeness.py
+    citation_faithfulness.py    # opt-in v1; default-on v1.1
+  rubrics/               # NEW (markdown)
+    groundedness.md
+    relevance.md
+    completeness.md
+    citation_faithfulness.md
+  variance/              # NEW
+    __init__.py
+    rubric_permute.py    # wraps Judge; permutes rubric levels; aggregates
+    jury.py              # multi-judge aggregation: mean | kappa_weighted; quorum
+  calibration/           # NEW
+    __init__.py
+    metrics.py           # cohen_kappa (linear/quadratic), gwets_ac2, bootstrap_ci — hand-rolled
+    report.py            # markdown table generator → docs/_generated/kappa_table.md
+tests/evaluation/        # NEW directory (precedent: tests/test_langchain_baseline/)
+  __init__.py
+  test_judges.py
+  test_rubric_loading.py
+  test_calibration_metrics.py
+  test_jury_aggregation.py
+  test_calibration_report.py
+  test_harness_migration.py
+  test_mockjudge_coverage.py
+```
+### Supersession of existing judges (dedicated subsection)
+The new `Judge` ABC fully supersedes `answer_faithfulness`, `answer_correctness`, and `_judge_call` in `agent_bench/evaluation/metrics.py:167-208`. The old code is **deleted** (no deprecation cycle). The supersession changes six axes:
+| Axis | Old (`_judge_call`) | New (`Judge` ABC) |
+|---|---|---|
+| **Scale** | continuous 0.0–1.0, no anchors | discrete (binary or 3-point) with rubric-anchored examples per level |
+| **Reasoning placement in JSON** | `{"score": …, "reasoning": …}` — score first | `{reasoning, evidence_quotes, score}` — score conditions on reasoning |
+| **Granularity** | combined "faithfulness" / "correctness" | per-dimension (groundedness / relevance / completeness; citation_faithfulness opt-in) |
+| **Versioning** | none — judge_id, rubric, prompt all unrecorded | `judge_id`, `rubric_version` (SHA-256 of rubric file content), `prompt_seed`, `system_output_hash` traceable in every `ScoreResult` |
+| **Variance control** | single call only | composable wrappers (`rubric_permute`, `jury`) |
+| **Failure mode** | bare `except Exception` returns `None`; harness silently drops | intentional: `"Unknown"` abstain on rubric/model noise (with structured-prefix reason); raise on caller bugs (see Error Handling) |
+**Config knob preservation.** `evaluation.judge_provider` YAML field stays (5 configs reference it; `core/config.py:89`). New judges accept `judge_provider: LLMProvider` matching the existing harness signature pattern. Zero user-facing config migration. New `evaluation.judge_dimensions: list[str]` field (default `["groundedness", "relevance", "completeness"]`); `citation_faithfulness` is opt-in v1, default-on v1.1, decoupling the citation deterministic-vs-LLM head-to-head from the harness migration.
+**Coupled artifact updates** (in scope of the judge PR):
+- `docs/DESIGN.md:346-356, 395` — rewrite §"LLM-judge metrics (costs money, manual)" to point at this design doc and `judge-design.md` (the writeup).
+- `DECISIONS.md` — append one supersession entry. Entry references file paths explicitly: `measurements/2026-05-04-judge-calibration-labels.jsonl`, the relevant `results/calibration_v1_judge_*.json` files, and the κ table file path. References by file path, not abstract claim — the supersession is defended by the calibration data, not by description.
+- `measurements/README.md` — append one row pointing at the new calibration-labels file (otherwise it orphans next to the cold-start logs).
+- `README.md` — add a "Targets that cost money" subheading (separate concern; see the README cost-disclosure obligation under Testing).
+### Dependency direction
+Judge → Rubric (filesystem markdown loader) → existing `LLMProvider` ABC at `agent_bench/core/provider.py`. **No new external runtime dependencies.** Cohen's κ, Gwet's AC2, and bootstrap CI are hand-rolled (rationale in `calibration/metrics.py` under Components). scikit-learn is *not* added to the project; sklearn appears only in dev tooling under `scripts/_dev/` (see the sklearn fixture pattern under Testing).
+## Components
+### Rubric (the spec object)
+```python
+class Rubric(BaseModel):
+    dimension: Literal["groundedness", "relevance", "completeness", "citation_faithfulness"]
+    scale: Literal["binary", "three_point"]
+    reference_based: bool
+    abstain_allowed: bool
+    levels: list[RubricLevel]   # parsed from markdown sections
+    body_markdown: str           # full file contents
+    @property
+    def source_hash(self) -> str:
+        # SHA-256 of body_markdown — immutable per file content, independent of git
+        ...
+    def render_prompt(self, *, level_permutation_seed: int = 0) -> str:
+        # if seed > 0, permute self.levels deterministically using PRNG(seed)
+        ...
+```
+**Two-hash provenance.** `source_hash` (SHA-256 of canonical body) is immutable per rubric file; `prompt_seed` (per-call int, 0 = no permutation) is recorded on the call. κ aggregation groups by `source_hash`; ScoreResults with the same `source_hash` and different `prompt_seed` are agreement-eligible against the same label. Both fields appear in every `ScoreResult` so records are self-contained.
+Loader reads markdown with YAML frontmatter (matching repo convention). Anchored examples are parsed by section header pattern (`## Score 0`, `## Score 1`, …) so level-permutation rewrites the prompt by reordering sections.
+**Construction validates aggressively** (see Rubric construction validation under Error Handling): scale ∈ {binary, three_point}, levels arity matches scale, every level has at least one anchored example with thinking-trace explanation, frontmatter has all required fields. ValidationError raises with file path + field path. Failing at rubric construction (Day 1) is much cheaper than failing on first `judge.score` call (Day 2 with API budget already spent).
+### ScoreResult (per-call record)
+```python
+class ScoreResult(BaseModel):
+    # Reasoning-first ordering — matters for Pydantic field order
+    # AND for the JSON schema sent to the model
+    reasoning: str
+    evidence_quotes: list[str] = Field(default_factory=list)
+    score: int | Literal["Unknown"]
+    # Provenance (self-contained — no run-metadata cross-reference needed)
+    judge_id: str              # f"{model_id}_{dimension}", e.g. "claude-haiku-4-5_groundedness"
+    rubric_version: str        # = Rubric.source_hash
+    prompt_seed: int = 0
+    system_output_hash: str    # SHA-256 of canonical (item.id, output.answer, sorted(output.sources))
+    # Operations
+    cost_usd: float
+    latency_ms: float
+    @property
+    def abstained(self) -> bool:
+        return self.score == "Unknown"
+```
+`score` is `int | Literal["Unknown"]` (not `int | None`) so abstain is structurally distinct from "we don't have a value yet" — the silent-`None` failure mode that the old `_judge_call` exhibited becomes impossible.
+`system_output_hash` is the cross-run-aggregation guard: scores are agreement-eligible iff `(item.id, dimension, system_output_hash)` match. Any mismatch between labels and predictions raises in the calibration report (see Calibration report failure modes under Error Handling).
+### Judge ABC + concrete judges
+```python
+class Judge(ABC):
+    def __init__(self, judge_provider: LLMProvider, rubric: Rubric, model_id: str):
+        self.judge_provider = judge_provider
+        self.rubric = rubric
+        self.model_id = model_id
+        self.judge_id = f"{model_id}_{rubric.dimension}"
+    @abstractmethod
+    async def score(
+        self,
+        item: GoldenQuestion,
+        output: AgentResponse,
+        *,
+        prompt_seed: int = 0,
+    ) -> ScoreResult: ...
+```
+Concrete judges (`GroundednessJudge`, `RelevanceJudge`, `CompletenessJudge`, `CitationFaithfulnessJudge`) are thin per-dimension classes (~30 lines each), no shared base method. Factoring the prompt-assembly into a base method is rejected: at 3–4 judges of 30 lines each, each is more readable in full than as a delta against a base, and a shared base creates a future trap where dimension-specific logic creeps into the base via `if self.dimension == ...` branches.
+**Per-judge input expectations** (matters for the FastAPI snippet-authoring scope):
+| Judge | Reads from `item` | Reads from `output` |
+|---|---|---|
+| `GroundednessJudge` | `source_snippets` (the 8 FastAPI calibration items get hand-snippeted; see FastAPI snippet authoring under Calibration Methodology) | `answer` |
+| `RelevanceJudge` | `question` only | `answer` |
+| `CompletenessJudge` | `reference_answer` | `answer` |
+| `CitationFaithfulnessJudge` | `source_chunk_ids` + retrieved-chunk text | `answer` (parsed for claims + citations) |
+`CitationFaithfulnessJudge` returns one aggregate `ScoreResult` per item (preserving ABC polymorphism), with per-pair (claim, citation) detail in `evidence_quotes`. Aggregation rule for binary: **all-or-nothing** — any unfaithful citation → score=0. The rule is documented explicitly in `rubrics/citation_faithfulness.md`.
+### MockJudge
+Same shape as `Judge`; constructor takes `verdicts: dict[str, ScoreResult]` keyed by `item.id`. Returns the pre-baked verdict on `score()`, no API call. **Raises `LookupError` on missing keys** — never returns a default — so test fixtures are self-checking. A separate fixture-validation test (`test_mockjudge_coverage.py`) walks `item.id` across all goldens and asserts every MockJudge instance has coverage for items its tests reference. Two-layer defense against the rename-breaking-tests failure mode. Mirrors the `MockProvider` pattern at `agent_bench/core/provider.py:118`.
+### rubric_permute (variance wrapper)
+```python
+def rubric_permute(judge: Judge, n: int = 2, seeds: list[int] | None = None) -> PermutedJudge: ...
+```
+`PermutedJudge.score(item, output)` runs `judge.score(item, output, prompt_seed=s)` for each `s` in `seeds` (default `[1, 2]`), aggregates:
+- Binary: majority (n=2 → tie-break to lower score, more conservative)
+- Three-point: mean, rounded to nearest level **with ties broken downward** (e.g., 1.5 → 1, 0.5 → 0); same conservative principle as the binary tie-break
+- **Any abstain → "Unknown"** (any sample, not all): the whole point of rubric permutation is to surface whether judge behavior depends on prompt structure; averaging an abstain away with a confident sample defeats the technique. At N=2, "all abstain" essentially never fires, making it a silent aggressive default. "Any abstain → Unknown" is the conservative choice that preserves the variance signal.
+Returns one `ScoreResult` with `judge_id = f"{judge.judge_id}_perm{n}"`, `prompt_seed=0` on the aggregate. Per-permutation results are written to a sidecar JSONL (same pattern as the jury subsection below) for traceability.
+### jury (multi-judge aggregator)
+```python
+def jury(
+    judges: list[Judge],
+    aggregation: Literal["mean", "kappa_weighted"],
+    weights: dict[str, float] | None = None,         # required if kappa_weighted
+    quorum: int | None = None,                       # default: len(judges) — strict
+    sidecar_path: str | None = None,                 # default: results/calibration_v1_judge_{aggregation}_members.jsonl
+) -> Jury: ...
+```
+`Jury.score(item, output)` runs `asyncio.gather(*[j.score(item, output) for j in judges], return_exceptions=False)` with try/except at the jury level (so non-retryable exceptions cancel sibling tasks immediately — failing fast on caller bugs). Per-member ScoreResults always written to sidecar (successes and failure-as-abstains alike). Aggregate behavior:
+1. Count `successful_members = sum(1 for r in member_results if not r.abstained)`.
+2. If `successful_members < quorum`: aggregate = `ScoreResult(score="Unknown", reasoning=f"jury_below_quorum: {successful_members}/{len(judges)} members succeeded; required {quorum}", ...)`.
+3. Else: aggregate using `aggregation` strategy over the successful members' scores. **Discretization rule (same as `rubric_permute`):** binary scores threshold at 0.5 with ties → 0; three-point scores round to nearest with ties → lower level. Discretization happens at the aggregation step, before the κ join — Cohen's κ requires both inputs discrete.
+**Strict quorum default for v1.** `quorum=N` (= `len(judges)`) at v1's 2-judge jury means any member abstain → jury abstain. Tolerant defaults at N=2 are silent single-judge in jury clothing. The parameter exists in v1 so v1.1's 3-judge jury can shift to `quorum=2` (majority) without rearchitecting failure semantics.
+`kappa_weighted` requires explicit `weights` injection — computed offline once on the calibration set, *not* at jury construction (would be circular).
+### calibration/metrics.py (hand-rolled κ + bootstrap)
+```python
+def cohen_kappa(
+    y1: list[int | str], y2: list[int | str],
+    weights: Literal[None, "linear", "quadratic"] = None,
+) -> float
+def gwets_ac2(
+    y1: list[int | str], y2: list[int | str],
+    weights: Literal[None, "linear", "quadratic"] = None,
+) -> float
+def bootstrap_ci(
+    y1: list, y2: list, metric_fn: Callable[[list, list], float],
+    n_iter: int = 1000, ci: float = 0.95, seed: int = 42,
+) -> tuple[float, float, float]   # (point_estimate, ci_lo, ci_hi)
+```
+**Hand-rolled, not sklearn.** Adding scikit-learn for one function (and transitively numpy + scipy + threadpoolctl + joblib) contradicts agent-bench's "built from primitives" identity. The hand-roll also serves the writeup: `(P_o − P_e) / (1 − P_e)` with explicit `P_e` computation demonstrates formula understanding in a way that an `sklearn.metrics.cohen_kappa_score` import does not. Fixture-tested against sklearn run *outside* the project venv (see the sklearn fixture pattern under Testing).
+**Abstain handling in κ.** Excluded pairwise — if either side abstains on item *i*, item *i* drops from that κ calculation. Standard treatment (Tu et al. 2024, *Beyond Correlation*); abstain as "I don't know" is neither agreement nor disagreement. Abstain count per dimension is reported separately by the calibration report (see `calibration/report.py` below).
+**Gwet's AC2 deferred from headline numbers.** AC2 is implemented in v1 but the published numbers in the v1 writeup come from κ only; AC2 fixture-test rigor (sympy-derived intermediate steps, not arithmetic-derived) is v1.1 work. Hand-computed AC2 fixtures in v1 cover three inspection-verifiable cases (perfect agreement, perfect disagreement, mid-range).
+### calibration/report.py
+One function: `generate_kappa_table(predictions_glob, labels_path, output_path, *, strict: bool = False)` → writes `docs/_generated/kappa_table.md`. Idempotent. Joins predictions ⋈ labels on `(item_id, dimension, system_output_hash)`; raises on hash mismatch (collect-all, error includes first-item expected/actual hashes plus full mismatched-id list). Computes per-config, per-dimension κ + bootstrap CI + abstain rate; flags rows where abstain rate **strictly greater than** 20% with a footnote (`"κ computed on N=X of 30 items; high abstain rate (Y% — breakdown: Z% schema parse, W% genuine abstain) suggests rubric ambiguity"`).
+**Two modes for missing predictions/labels:**
+- Default: WARN-and-exclude (Day-2 development loop — partial coverage is real interim state).
+- `--strict`: RAISE on any missing prediction/label (final-artifact path; `make calibrate` invokes this; the writeup is by-construction produced from `--strict` output).
+The κ table is copy-pasted into the writeup at draft time, not include-by-reference — the writeup is a frozen v1 artifact and copy-paste lets the writeup add inline annotations to specific cells.
+## Data Flow
+### Production harness run (existing, migrated)
+```
+golden file → load_golden_dataset() → list[GoldenQuestion]
+  → for each item, parallel:
+      orchestrator.run() → AgentResponse
+      compute L1 metrics (existing — untouched)
+      if judge_provider is not None and item.category != "out_of_scope":
+          system_output_hash = hash(item.id, response.answer, sorted(response.sources))
+          for each Judge in evaluation.judge_dimensions:
+              ScoreResult = await judge.score(item, response)
+          attach to EvalResult.judge_scores: dict[str, ScoreResult]
+  → write results/{run_label}.json
+```
+**Migration delta** at `agent_bench/evaluation/harness.py:153-166`:
+- DELETE inline import of `answer_faithfulness, answer_correctness`
+- DELETE `result.faithfulness = ...` and `result.correctness = ...` assignments
+- ADD: load configured judges from `evaluation.judge_dimensions` config; build with existing `judge_provider`
+- ADD: `result.judge_scores: dict[str, ScoreResult]` field on `EvalResult`
+- KEEP: `if judge_provider is not None and q.category != "out_of_scope"` gate (out-of-scope items still bypass L2; refusal is deterministic)
+- KEEP: `evaluation.judge_provider` YAML field (5 configs reference it)
+### Calibration run (new)
+```
+calibration_v1.json (30 IDs + version + system_config_git_sha)
+  → filter k8s_golden.json + tech_docs_golden.json → 30 GoldenQuestions
+Step A (once, frozen): generate system outputs
+  → orchestrator.run() with frozen config for each item
+  → write results/calibration_v1_system_outputs.json
+     (each record includes system_output_hash, item_id, answer, sources, source_chunks, citations)
+Step B (manual): hand-label
+  → labeling notebook reads system_outputs file, injects system_output_hash automatically
+  → for each (item, dimension), human authors score + notes
+  → append to measurements/2026-05-04-judge-calibration-labels.jsonl
+     {item_id, dimension, score | "Unknown", abstained, notes, label_timestamp, system_output_hash}
+Step C (per ablation row): score with judges
+  → load row config from configs/calibration/rows/{label}.yaml
+  → load system_outputs file (frozen)
+  → for each item, judge.score(item, output) per row's judge configuration
+  → write results/calibration_v1_judge_{row_label}.json
+     and (jury rows) results/calibration_v1_judge_jury_{aggregation}_members.jsonl
+Step D (κ table):
+  → calibration/report.generate_kappa_table(strict=True for final artifact)
+  → join predictions ⋈ labels on (item_id, system_output_hash); raise on mismatch
+  → exclude pairs where either side abstains
+  → cohen_kappa + bootstrap_ci + abstain_rate per (config, dimension)
+  → write docs/_generated/kappa_table.md
+```
+**Hash propagation through labels** is intentional: labels carry `system_output_hash` because they are tied to specific outputs. If `system_outputs` are ever regenerated (config change, retry), labels become stale and the κ join raises loudly. This eliminates the cross-run aggregation bug class.
+### Concurrency
+- **Within an item, across judges (jury):** `asyncio.gather` over `judges`; existing provider rate-limit/retry kicks in.
+- **Across items in a calibration row:** `asyncio.gather` with semaphore, default concurrency=5, configurable via CLI flag with config-field fallback. **Resolved value logged at run start** so artifacts capture which concurrency was used.
+- **Across rows of the ablation:** rows run sequentially. Each row writes its predictions file before the next starts — partial progress survives interruption.
+### New scripts and Makefile targets
+```
+scripts/
+  evaluate.py             # existing — full-corpus harness runs
+  run_calibration.py      # NEW — orchestrates Steps A, C, D
+                          #   subcommands: generate-outputs | run-judges --row-config=<path> | build-table [--strict]
+                          # Step B (labeling) is manual — done in a notebook
+configs/calibration/rows/  # NEW — one YAML per ablation row (config-file-per-row)
+  baseline.yaml
+  baseline_no_cot.yaml
+  baseline_no_anchors.yaml
+  baseline_no_abstain.yaml
+  permute.yaml
+  jury_kappa_weighted.yaml
+Makefile:
+  calibrate             # runs full pipeline: generate-outputs → run-judges (all rows) → build-table --strict
+  evaluate-judges       # runs run-judges + build-table against existing system_outputs (no regeneration)
+```
+Row configs are independently versioned reproducible artifacts in the PR. `run-judges` is a generic runner taking `--row-config=<path>`; the script does not own the row inventory. Discovering a bug in row 4 means fixing row 4's config and rerunning rows 4-6 without touching 1-3.
+### Failure modes eliminated by this design
+| Bug class | Eliminated by |
+|---|---|
+| Cross-run aggregation (run-A outputs scored against run-B labels) | `system_output_hash` join with raise-on-mismatch |
+| Stale labels after system re-run | Same |
+| MockJudge silently passing tests with renamed item IDs | `LookupError` on missing keys + fixture-validation test |
+| Single-call judge bias hidden | Rubric permutation surfaces it via abstain propagation |
+| Per-judge κ unrecoverable from jury aggregate | Sidecar JSONL with deterministic path |
+| Partial progress lost on Step C interruption | One predictions file per row, written sequentially |
+| Schema parse failures silently dropped (old `_judge_call` `None`) | Discrete abstain-with-prefix; abstain rate flagged at >20% |
+| Final writeup citing N=28 while prose claims N=30 | `--strict` mode for final-artifact build; default warns |
+## Error Handling
+### Failure taxonomy at L2
+| Category | Source | Where caught | Decision |
+|---|---|---|---|
+| Provider retryable (rate limit, timeout, network) | Infra | Existing `LLMProvider` retry/backoff | Bubbles up only on retry exhaustion |
+| Provider exhausted (retries exhausted) | Infra | `Judge.score` | Abstain with `ABSTAIN_REASON_PROVIDER_EXHAUSTED` |
+| Provider non-retryable (401, 400) | Caller misconfig | `Judge.score`; jury cancels siblings | **Raise** — bug, not noise |
+| Schema parse error | Model glitch or broken prompt | `Judge.score` | Abstain after one strict-reprompt retry; `ABSTAIN_REASON_SCHEMA_PARSE` |
+| Score out of range | Model glitch | `Judge.score` | Abstain after one strict-reprompt retry; `ABSTAIN_REASON_OUT_OF_RANGE` |
+| Genuine model abstain (rubric allows) | Model judgment | `Judge.score` | Abstain with empty-prefix sentinel (`ABSTAIN_REASON_GENUINE` = `""`) |
+| Hash mismatch on κ join | Stale labels | `calibration/report.py` | Raise after collect-all; first-item expected/actual hashes in message |
+### The abstain-vs-raise discipline
+**One retry with strict reprompt** on schema parse / score out of range. Original prompt's formatting instructions are augmented at the end with a recency-positioned reminder: `STRICT FORMATTING NOTE: respond ONLY with a JSON object matching the schema; reasoning first, then evidence_quotes, then score`. If second attempt also fails, abstain with structured-prefix reason. **Exactly one retry** — zero retries throws away signal that recovers cheaply; N>1 retries silently mask systematic schema breaks.
+**Failure-reason prefixes as constants** in `judges/base.py`:
+```python
+ABSTAIN_REASON_PROVIDER_EXHAUSTED = "judge_call_failed_after_retry: "
+ABSTAIN_REASON_SCHEMA_PARSE       = "schema_parse_failed_after_retry: "
+ABSTAIN_REASON_OUT_OF_RANGE       = "score_out_of_range_after_retry: "
+ABSTAIN_REASON_GENUINE            = ""   # empty-prefix sentinel for rubric-allowed abstain
+```
+Calibration report imports + pattern-matches against typed constants for the four-way abstain-cause breakdown in the >20% threshold flag.
+### First-attempt-failure log schema (fires on success-after-retry too)
+WARN-level structured log line, fixed key set, no schema drift. Uses `structlog` matching repo precedent at `agent_bench/evaluation/metrics.py:14` (`logger = structlog.get_logger()`):
+```python
+logger.warning(
+    "judge_first_attempt_failure",
+    judge_id=self.judge_id,
+    item_id=item.id,
+    provider=type(self.judge_provider).__name__,
+    failure_cause=ABSTAIN_REASON_SCHEMA_PARSE,  # one of the four constants
+    attempt_index=1,
+)
+```
+Fires on first-attempt failure regardless of whether the second attempt succeeds. The "first failed, second succeeded" branch is the most analytically interesting case — it tells you the reprompt is doing work rather than just consuming budget. Without this log, that branch is invisible.
+### Jury partial-failure (quorum)
+Per the jury subsection above: strict quorum default; per-member ScoreResults always written to sidecar; aggregate is `score="Unknown"` with `jury_below_quorum` reason if `successful_members < quorum`. Provider non-retryable in any member → jury raises immediately, cancels sibling `gather` tasks (the `return_exceptions=False` + try/except pattern; *not* `return_exceptions=True` + inspection — the two look identical to a careless reader but only the former cancels siblings).
+### Permutation wrapper failure
+Per the `rubric_permute` subsection above: any-permutation abstain → aggregate abstain. Per-permutation results written to sidecar.
+### Rubric construction validation
+`Rubric.from_markdown_file()` validates aggressively: scale ∈ {binary, three_point}, levels arity matches scale, every level has at least one anchored example with thinking-trace explanation, frontmatter has all required fields. ValidationError raises with file path + field path. Validation discipline is named explicitly in the spec because the alternative ("validate lazily on first score call") is the kind of thing that creeps in if not specified — and a malformed-rubric error on Day 2 after API budget has been spent is materially worse than a malformed-rubric error on Day 1.
+### Calibration report failure modes
+| Condition | Default behavior | `--strict` behavior |
+|---|---|---|
+| Hash mismatch | Raise after collect-all (first item expected/actual + full id list) — **applies to both modes; never warn** | Same |
+| Missing prediction (label exists, no prediction for `(item_id, dim)`) | WARN; exclude from κ; coverage row in footer | RAISE |
+| Missing label (prediction exists, no label) | WARN; exclude; coverage row in footer | RAISE |
+| κ undefined (insufficient variance after exclusion, or N<3 agreement-eligible) | Render `"—"` with footnote — **applies to both modes** | Same |
+| Abstain rate > 20% (strictly greater) | Render κ + footnote with cause breakdown — **applies to both modes** | Same |
+## Testing
+### File layout
+Six new files under `tests/evaluation/` matching the new module subpackages. Existing `tests/test_evaluation.py` stays at top level (precedent: `tests/test_langchain_baseline/`); the existing file's faithfulness/correctness assertions are dropped, but the file is not renamed (preserves git blame).
+### sklearn fixture pattern (κ parity tests)
+Four-part discipline:
+1. **Generation script** at `scripts/_dev/generate_kappa_fixtures.py` — committed; `_dev` prefix marks as not-runtime. Imports sklearn; documented to run from a venv outside the project. **Action item:** verify `_dev/*` is excluded from ruff/mypy via `pyproject.toml` (currently no `extend-exclude` set; add as part of this PR).
+2. **Inline constants** in `test_calibration_metrics.py` — `SKLEARN_KAPPA_FIXTURES: dict[str, float]` and `SKLEARN_KAPPA_INPUTS: dict[str, dict]`. Locality preserved, type-checked.
+3. **Version-pinned comment header** — `# Fixtures generated against scikit-learn==1.5.2 cohen_kappa_score on 2026-05-04` with regeneration instructions. Drift detection if sklearn behavior changes in a future version.
+4. **Load-bearing comment** — `# DO NOT add scikit-learn to the project's dependencies — these constants are the contract.` Prevents the well-meaning future contributor from "fixing" tests by importing sklearn at runtime.
+**Cross-check CI test:** the generation script writes its inputs to a JSON sidecar under `tests/evaluation/fixtures/sklearn_kappa_inputs.json`; a CI test asserts `SKLEARN_KAPPA_INPUTS` matches that JSON. Catches the "updated CASES list, forgot to regenerate" failure mode at CI time. Five lines of test code.
+**No sklearn parity for AC2 in v1.** sklearn doesn't have AC2; pulling `irrCAC` reintroduces the dependency problem one level over. Three hand-computed AC2 cases (perfect agreement, perfect disagreement, mid-range) where the formula reduces to inspection-verifiable values. v1.1 may add sympy-derived AC2 fixtures (script under `scripts/_dev/generate_ac2_fixtures.py` with sympy as dev-only dep, sympy intermediate steps printed for audit). v1.1 spec line: *"AC2 hand-computed fixtures are sympy-derived not arithmetic-derived; verification requires reading the sympy intermediate output, not just inspecting the test."*
+### Test inventory (~30 tests total)
+| File | Tests | Notes |
+|---|---|---|
+| `test_judges.py` | ~7 | ABC contract, MockJudge round-trip + LookupError, ScoreResult validation, abstain-with-prefix (parameterized over 3 causes), raise on non-retryable, first-attempt-failure log fires |
+| `test_rubric_loading.py` | ~6 | Construction validation (parameterized over 4 invalid cases), source_hash determinism, source_hash changes with content, permutation seed reproducibility, permutation changes prompt |
+| `test_calibration_metrics.py` | ~7 | 3 hand-computed κ cases + 3 sklearn-fixture parity + 1 bootstrap-CI seed reproducibility |
+| `test_jury_aggregation.py` | ~5 | mean, kappa_weighted, strict-quorum-abstain, sidecar capture, cancel-on-non-retryable |
+| `test_calibration_report.py` | ~6 | hash-mismatch with first-item detail, --strict raise, default WARN, undefined-κ dash, abstain-flag boundary 6/30 (does not fire) and 7/30 (fires), abstain breakdown by cause |
+| `test_harness_migration.py` | ~3 | judge_scores populated when configured, out_of_scope skipped, judge_provider config preserved |
+| `test_mockjudge_coverage.py` | ~1 | item.id walk across all goldens |
+| **Total** | **~35** | |
+The original "~15–20" estimate was made before the Error Handling section was designed. Designing error handling and not expanding the test count is the inconsistency: the abstain-cause logic is the highest-stakes-when-silently-wrong piece of the project (wrong abstain semantics → quietly wrong κ in the published report). If Day 3 budget runs short, the cuttable margin is `test_harness_migration.py` (integration-y, failures show up loudly); the metric-correctness and judge-failure-handling tests do not get cut.
+### Discipline conventions
+- Mocked providers everywhere. Zero network calls in CI. `MockProvider` for the underlying LLM; `MockJudge` for tests that need pre-baked verdicts.
+- `pytest-asyncio` (`asyncio_mode = "auto"` already set) for async tests.
+- Hand-computed κ cases include worked-out arithmetic in a comment block so a reader can verify the formula without running the test.
+- Larger reusable fixtures live under `tests/evaluation/fixtures/`; one-off small fixtures stay inline.
+### CI scope
+- All ~35 new tests run in `make test` in the existing GitHub Actions workflow. No new workflow files.
+- `make lint` covers new modules (ruff + mypy).
+- `make calibrate` and `make evaluate-judges` are **not** run in CI — they require API keys and burn budget. Manual invocation only.
+- **GitHub Actions config** explicitly omits provider keys via an empty `env:` block, preventing the "PR worked in upstream because secret was injected; fails in contributor's fork because no secret" failure mode.
+### README cost-disclosure obligation (separate from spec)
+`README.md` gets a "Targets that cost money" subheading with a four-column table (target, requires API key, approximate cost, what it produces). Not part of the spec body — a doc obligation owed to anyone running `make help` who shouldn't have to read the spec to know that `make calibrate` costs ~$2.
+## Calibration Methodology
+### Stratified sampling (30 items)
+Stratification across the actual 52 golden items (FastAPI 27 + K8s 25):
+FastAPI uses `category` as the stratification axis (the only typing in `tech_docs_golden.json`); K8s uses `question_type` (the CRAG 8-type taxonomy in `k8s_golden.json`). The 2 K8s items with `category: out_of_scope` are subsumed within their question_type stratum (most are within `false_premise`); they are not a separate K8s stratum.
+| Stratum | Available | Sampled |
+|---|---|---|
+| FastAPI retrieval | 19 | 5 |
+| FastAPI calculation | 3 | 1 |
+| FastAPI out-of-scope | 5 | 2 |
+| K8s simple | 6 | 4 |
+| K8s simple_w_condition | 4 | 3 |
+| K8s comparison | 4 | 3 |
+| K8s multi_hop | 6 | 4 |
+| K8s false_premise | 4 | 3 |
+| K8s set | 1 | 1 |
+| **Subtotal stratified** | **52** | **26** |
+| Spare slots (filled from highest-variance R@5 strata) | — | 4 |
+| **Total** | — | **30** |
+The K8s `time_sensitive=True` flag is an overlay attribute, not an exclusive stratum — 2 K8s items carry the flag and are sampled incidentally based on the question_type they belong to. The flag does not constrain sampling.
+**OOS items in calibration.** The 2 FastAPI items with `category: out_of_scope` (and however many of the sampled K8s false_premise items also carry `category: out_of_scope` — at most 2, since K8s has 2 OOS items total) follow the production harness gate: L2 judges are **skipped** for items where `category == "out_of_scope"` (the existing gate at `harness.py:153`). OOS items are still in the calibration set so that L1's `grounded_refusal` is exercised on the same items that produced labels. The κ-eligible item count per dimension is therefore at most 28 (30 minus the 2 FastAPI OOS) and possibly 26 (if both K8s OOS items get sampled into the K8s false_premise stratum); the writeup's κ table reports the actual N per row. This is the right cut because OOS handling is L1's job (deterministic refusal check) — judging "groundedness of a refusal" is methodologically incoherent (nothing to ground against).
+IDs locked in `agent_bench/evaluation/datasets/calibration_v1.json` with `version: "v1"` field and `system_config_git_sha: <commit>` (the git SHA of the commit producing `system_outputs_v1.json` — name carries the limitation; v1.1 may add `system_config_resolved_hash` for stricter reproducibility).
+### FastAPI snippet authoring (calibration set only)
+The 8 FastAPI items in the calibration set get hand-snippeted before labeling begins. Snippets are **verbatim spans** from `data/tech_docs/`, not paraphrases — same convention as the existing K8s `source_snippets`. **Scope discipline:** only the 8 calibration items, not the full 27-item FastAPI golden. The remaining 19 FastAPI items can be backfilled in v1.1.
+If a verbatim span supporting the gold answer cannot be found, the gold answer is itself underspecified and the item is removed from the calibration set (replaced from the spare-slot stratum).
+Slots into Day 1 between sampling and labeling; ~30 min of additional work; Day 1 budget shifts from 8h to 8.5h.
+### Hand-labeling rules
+- Score by the rubric, not by intuition. If the rubric and intuition disagree, fix the rubric *after* the labeling pass — do not change the labels mid-pass.
+- Genuine uncertainty → `abstained: true` with note. Abstains are signal.
+- Track time per item; >2 minutes → rubric ambiguity, note it.
+- **No AI assistance on label values.** AI may help with the labeling notebook, JSONL formatting, schema validation. Label values are hand-authored.
+### Opus stress-test (rubric ambiguity assist)
+After hand-labeling, Claude Opus labels the same 30 items × 3 dimensions blind to the human labels. Disagreements are flagged as `rubric_ambiguous` for v1.1 rubric revision. **Labels are not changed.** The Opus output is a rubric-quality signal, not a ground-truth substitute. ~20 minutes of work; methodological texture for the writeup's calibration section.
+## Implementation Sequencing Notes
+### Rubric authoring order
+Write the **groundedness rubric first**, alone. Dry-fit it against 3–4 calibration items to test operationalizability before authoring the other two. *Then* write relevance and completeness using whatever pattern worked for groundedness. This converts rubric authoring from "three parallel risky tasks" into "one risky task plus two near-mechanical replications," compressing realistic time variance and reducing spillover risk. The dry-fit step is what makes the tactic load-bearing: if groundedness turns out to be ill-shaped, you know after one rubric, not after three.
+### Contingency cuts (priority order)
+If scope pressure forces cuts:
+1. Drop the citation deterministic-vs-LLM head-to-head section of the writeup (this section was already a stretch goal).
+2. Drop the per-judge individual κ table — keep only the variance ablation.
+3. Reduce the variance ablation to 4 rows (baseline → CoT → rubric+abstain → 2-judge jury), skipping rubric-permute.
+4. Reduce calibration set to 20 items if labeling has slipped — cite literature ceiling more heavily.
+**Do not cut:** the writeup itself, the κ numbers, the rubric files, the closing position-statement paragraph (when NOT to use LLM-judge). Those are non-negotiable.
+## Acceptance Gates
+Two gates with different scopes. The code PR is reviewable and mergeable independently of the writeup; coupling them creates an artificial blocker.
+### PR-open gate (required to merge `feat/judge-layer-v1`)
+- All ~35 new tests pass; full `make test` suite green; `make lint` clean.
+- `make calibrate --strict` runs end-to-end from a clean checkout (with API keys) and produces `docs/_generated/kappa_table.md`.
+- `agent_bench/evaluation/metrics.py` no longer contains `answer_faithfulness`, `answer_correctness`, `_judge_call`, `_FAITHFULNESS_PROMPT`, or `_CORRECTNESS_PROMPT`.
+- `agent_bench/evaluation/harness.py` no longer imports the deleted functions; new judges populate `EvalResult.judge_scores`.
+- `evaluation.judge_provider` YAML field still functions (regression test).
+- DECISIONS.md has the supersession entry referencing file paths explicitly.
+- `docs/DESIGN.md` §"LLM-judge metrics" is rewritten to point at this design doc and `judge-design.md`.
+- `measurements/README.md` has the new row.
+- `README.md` has the "Targets that cost money" subheading.
+- `pyproject.toml` excludes `scripts/_dev/*` from ruff/mypy if not already excluded.
+- GitHub Actions workflow has an explicit empty `env:` block on the test job (verified to be documentation of existing behavior, not a behavior change — current workflow has no `env:` block and tests already run without provider keys via MockProvider).
+### v1-completion gate (lags PR merge by 1–2 days)
+The writeup is interview material, not a PR-merge dependency. It is produced from the merged PR's calibration runs and is committed separately.
+- `judge-design.md` (the writeup, separate file at `docs/judge-design.md`) is drafted with the κ ablation table copy-pasted in from `docs/_generated/kappa_table.md`.
+- DECISIONS supersession entry's file-path references resolve (the calibration-labels JSONL and the relevant `results/calibration_v1_judge_*.json` files exist on `main` post-merge).
+## Out of Scope (v1.1+)
+- 3rd judge (Mistral self-hosted via Modal) and quorum=2 default for the 3-judge jury.
+- Multi-seed self-consistency (T=0 ensemble) on top of rubric permutation.
+- DSPy / GEPA / MIPROv2 prompt optimization for rubric refinement.
+- Length-bias study, bypass tests, full pass^k sweep.
+- Langfuse self-host integration (judge call traces, cost dashboards).
+- Dual-pass intra-rater calibration (4–6 day calendar gap; replaces literature ceiling with measured intra-rater κ in the writeup).
+- Synthetic-anchor calibration set scaling (frontier-model-as-anchor on 200 items).
+- AC2 sympy-derived parity tests (sympy as dev-only dep; intermediate steps printed for audit).
+- Backfill `source_snippets` for the remaining 19 FastAPI golden items (only the 8 calibration items get snippets in v1).
+- `system_config_resolved_hash` (canonical serialization of resolved config) added alongside `system_config_git_sha` for stricter reproducibility across noise commits.
+- Citation faithfulness default-on (currently opt-in v1; `judge_dimensions` default extends to include it in v1.1).
+## Risks
+| Risk | Mitigation |
+|---|---|
+| Day 1 rubric authoring overflows 2.5h budget | The rubric-authoring sequencing tactic (Implementation Sequencing Notes) compresses variance; if all three rubrics need full 2.5h each, fall back to the Contingency cuts subsection |
+| Bootstrap CI half-width >0.15 at N=30 (κ values not defensibly distinct between rows) | Note in writeup; reduces strength of comparative claims but doesn't invalidate the table |
+| Jury κ worse than the better individual judge (kappa-weighting wrong, or worse judge drags mean) | Sanity-check before final table; possible switch to trimmed mean; sidecar JSONL preserves per-judge data either way |
+| Schema parse failures spike >20% on one dimension (rubric-prompt mismatch) | Abstain-rate flag surfaces in the report; fix prompt or rubric, rerun affected row only (config-file-per-row makes this cheap) |
+| Hand-labeling time exceeds 2h budget | Reduce to 20-item subset (contingency cut #4); cite literature ceiling more heavily in writeup |
+| Branch state at start (in-flight `docs/readme-test-count` README diff) | Land that 4-line PR first (~5 min — README test-count only; the previously-pending Option A DECISIONS entries and the warmup-penalty addendum already landed via commit `6409a40` on 2026-04-22, so they are not on the docs-PR critical path); branch `feat/judge-layer-v1` off updated main |
+---
+**End of design document.** Implementation plan to follow in `docs/plans/2026-05-04-judge-layer-v1-implementation.md` (produced via the `writing-plans` skill).

docs/plans/2026-05-04-judge-layer-v1-implementation.md ADDED Viewed

The diff for this file is too large to render. See raw diff

measurements/2026-05-04-judge-calibration-labels.jsonl ADDED Viewed

	@@ -0,0 +1,90 @@

+{"item_id": "q021", "dimension": "groundedness", "score": 1, "abstained": false, "notes": "600 seconds and preflight caching are supported; conversion is arithmetic", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers the minutes conversion", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers 600/60 = 10 minutes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported GraphQL and library claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses whether native GraphQL schema generation exists", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported load-balancing claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses FastAPI load balancing", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "v1.0 -> v1.1 flip: claim 'particularly useful for expensive operations like database connections' adds a use-case argument the snippet does not make. Snippet's get_db is an identifier in the example, not a use-case claim. Other claims entailed; this one is general LLM knowledge.", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers caching behavior and disabling mechanism", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers once-per-request cache and use_cache=False", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported claims about other endpoints and customization beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly gives the default Swagger UI endpoint", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers /docs and interactive documentation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported headers, response format, inheritance, and custom-handler claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains raising HTTPException in a route", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers HTTPException with status_code and detail", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "large code sample adds many unsupported implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q023", "dimension": "relevance", "score": 1, "abstained": false, "notes": "on-topic but truncated before testing and dependency overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers error handling and CORS but misses TestClient/dependency_overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "only cursor pagination is supported; response_model/background task claims are unsupported by snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses pagination, validation, and analytics logging", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers cursor navigation, response_model, and BackgroundTasks", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported naming, storage, ordering, examples, and YAML details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares StatefulSet and Deployment use cases", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stateless Deployment vs sticky StatefulSet identity and when to use each", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported port range, cloud-provider, production, and allocation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares NodePort and LoadBalancer Services", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node IP/static port versus external load balancer and relationship", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported scheduler, nodeAffinity, and nodeName implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers Deployment vs DaemonSet scheduling difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers fixed replicas versus one copy on all or selected nodes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Jaeger configuration guidance", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Jaeger sidecar injection setup", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_004", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Jaeger docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported CEL and admission-controller deny alternatives", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers RBAC deny-rule question", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no deny rules and not granting delete permission", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Envoy ADS configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Envoy ADS sidecar configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Envoy ADS docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress-controller, EndpointSlice, kube-proxy, and DNAT details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains external HTTP traffic flow", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Ingress to Service to Pod routing and selector/load-balancing role", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "snippets do not support sequential order, retry policy, or lifecycle details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains init-container startup order", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers sequential init containers, completion before app containers, and failure retry", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported autoscaling/v2, memory/custom metric, and v1 comparison details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly identifies the HPA API version to use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_018", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers autoscaling/v2 and why it supports memory/custom metrics", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source only defines ConfigMap; mechanisms and update behavior are unsupported", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains how ConfigMap values reach Pods", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers env vars, volume mounts, and update behavior", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported NodePort, ExternalIPs, Ingress, and Gateway claims beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly lists Service exposure options", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers NodePort/LoadBalancer and ClusterIP/Ingress but misses ExternalName", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported sticky identity, rescheduling, headless service, and policy details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers StatefulSet Pod identity guarantees", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_001", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers ordinal, network identity, stable storage, and sticky identity", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported consumption mechanisms and Secret guidance beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains ConfigMap purpose and data type", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers non-confidential key-value config and not storing secrets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported completion modes and configuration details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains what Jobs do and completion criteria", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_007", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Pod creation, successful completions, retries, and completion state", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source snippet names the four kinds but not the detailed role/binding explanations", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly names and explains the four RBAC object kinds", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Role, ClusterRole, RoleBinding, and ClusterRoleBinding with scope/use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported warn-mode and workload-resource behavior beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares enforce and warn modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stable PSA, enforce rejection, warn allowance, and combined modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported default medium, memory accounting, size, and performance claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares default emptyDir and Memory medium", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node-backed default, tmpfs Memory, speed, and memory-limit accounting", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported kill, Service traffic, grace-period, and best-practice details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares failed liveness and readiness probes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers liveness restart and readiness removal from traffic without restart", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported idempotency and deadline details beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Job and CronJob usage", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers one-off task versus recurring cron-scheduled Jobs", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "single snippet does not support bypass, host-network, or trusted-workload details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers what privileged profile enforces", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no restrictions, unrestricted policy, and bypassing isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress/egress isolation rules beyond terse snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains non-isolated baseline and NetworkPolicy isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers baseline and allowed traffic but omits CNI enforcement point", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported update semantics and traffic details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
+{"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Secret volume and environment-variable consumption", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
+{"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers both modes and update behavior difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}

measurements/2026-05-05-judge-rubric-opus-stress.jsonl ADDED Viewed

	@@ -0,0 +1,90 @@

+{"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that StatefulSet Pods have a unique identity consisting of an ordinal, stable network identity, and stable storage. The answer makes many claims not supported by this single snippet: 'sticky identity' terminology, comparison to Deployments, podManagementPolicy field with OrderedReady and Parallel options, requirement of a Headless Service, and use cases like databases. None of these specific claims are entailed by the snippet provided.", "label_timestamp": "2026-05-05T14:29:41Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021464, "_latency_ms": 3804.864978999831}
+{"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by describing the identity guarantees StatefulSets provide to Pods, including sticky identity, persistent identifiers, uniqueness/ordering, and network identity via headless service. It covers the core ask comprehensively.", "label_timestamp": "2026-05-05T14:34:20Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020344, "_latency_ms": 3798.0725439847447}
+{"item_id": "k8s_001", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has key points: (1) ordinal index, (2) stable network identity, (3) stable persistent storage, (4) identity sticks across rescheduling, (5) contrast with Deployment's interchangeable Pods. The answer covers sticky identity across rescheduling, network identity, and the contrast with Deployments. However, it does not explicitly mention ordinal index (only vaguely mentions 'ordering') and does not mention stable persistent storage at all. So partial coverage.", "label_timestamp": "2026-05-05T14:38:40Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022176000000000006, "_latency_ms": 6259.8209959687665}
+{"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims that go far beyond the gold snippets. The snippets only support: (1) Deployment manages Pods for stateless workloads, and (2) StatefulSet maintains sticky identity per Pod. The answer adds numerous unsupported claims: ordinal-based naming (web-0, web-1), volumeClaimTemplates, headless Service requirement, ordered sequential operations, specific examples like MySQL/Cassandra/Kafka, YAML configurations, persistent volumes per Pod, etc. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:29:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0031912, "_latency_ms": 5066.343065002002}
+{"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it compares StatefulSet and Deployment across multiple dimensions (pod identity, naming, network, storage, ordering) and explicitly explains when to prefer each with examples. This matches Example F's on-target comparison pattern.", "label_timestamp": "2026-05-05T14:34:29Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0027416000000000003, "_latency_ms": 4008.2699899794534}
+{"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two main parts: (1) Deployment manages stateless, interchangeable replicas with declarative updates, and (2) StatefulSet provides sticky identity with stable network IDs, stable persistent storage, and ordered deployment/scaling. The answer covers both: Deployment is described as managing stateless apps with interchangeable Pods, and StatefulSet is described with stable hostnames, persistent storage per Pod, ordered operations, and unique identity.", "label_timestamp": "2026-05-05T14:38:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0031176000000000003, "_latency_ms": 5786.756471032277}
+{"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that Ingress exposes HTTP/HTTPS routes from outside the cluster, and that Services target Pods usually via a selector. The answer makes many claims not supported by the snippets: kube-proxy details, EndpointSlices, iptables, DNAT, ClusterIP virtual IP, load balancing mechanics, session affinity, etc. None of these are in the gold snippets.", "label_timestamp": "2026-05-05T14:30:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002524, "_latency_ms": 3977.3940969607793}
+{"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the full traffic path from Ingress through Service to Pod, covering all three layers requested. It explains the Ingress controller's role at the edge, the Service layer with ClusterIP and EndpointSlices, kube-proxy's routing function, and final Pod delivery via DNAT.", "label_timestamp": "2026-05-05T14:34:38Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002372, "_latency_ms": 3949.3474710034207}
+{"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "Reference points: (1) Ingress exposes HTTP/HTTPS routes from outside and maps to backend Services based on rules - covered. (2) Service is abstraction defining logical set of endpoints (Pods) using selector - covered. (3) Load-balances traffic across Pods - covered. (4) Service delivers traffic to container port each Pod exposes - covered.", "label_timestamp": "2026-05-05T14:39:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028088, "_latency_ms": 4416.859550983645}
+{"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets are empty, meaning there are no snippets to ground any claims against. The answer provides substantive technical claims about Jaeger sidecar injection (labels, annotations, ports, etc.) that cannot be verified against the (empty) snippets. The answer also acknowledges the docs don't contain Jaeger-specific info but then proceeds to make many specific claims anyway, so this is not a pure refusal/abstain.", "label_timestamp": "2026-05-05T14:30:08Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002128, "_latency_ms": 3805.78425899148}
+{"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to enable Jaeger sidecar injection in a Kubernetes Deployment, providing the relevant label/annotation methods, namespace-level injection, configuration options, and verification steps. This is on-target for the question's core ask.", "label_timestamp": "2026-05-05T14:34:46Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020648000000000003, "_latency_ms": 2984.175974968821}
+{"item_id": "k8s_004", "dimension": "completeness", "score": 0, "abstained": false, "notes": "The reference answer's key point is that the agent should refuse to answer because Jaeger is not covered in the Kubernetes documentation corpus. The agent's answer does the opposite - it provides detailed instructions on Jaeger sidecar injection rather than refusing. While it briefly notes 'the specific search results don't contain Jaeger-specific documentation,' it then proceeds to give a full answer anyway, which contradicts the reference's point about refusing and citing zero sources.", "label_timestamp": "2026-05-05T14:39:11Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022992, "_latency_ms": 5866.323757043574}
+{"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only contain two minimal pieces of information: that policy violations cause pod rejection, and a feature state stable in v1.25. The answer makes many claims that are not supported by these snippets, including: warn mode triggers user-facing warnings, enforce mode not being applied to workload resources while warn is, the specific label syntax, and the configuration examples. Only the claim about enforce mode rejecting pods is supported by snippet [1].", "label_timestamp": "2026-05-05T14:30:18Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020616000000000002, "_latency_ms": 4882.2048010188155}
+{"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question about how Pod Security Admission behaves differently in enforce vs warn modes. It provides a clear comparison covering pod rejection behavior, user feedback, and applicability to workload resources, which are the key behavioral differences between these modes.", "label_timestamp": "2026-05-05T14:34:55Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020016, "_latency_ms": 3597.780067997519}
+{"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points: enforce mode rejects pods on violation, warn mode allows pods but triggers user-facing warning, and modes can be combined on the same namespace. Namespace-level application is also mentioned via labels. The stable since v1.25 detail is missing but is contextual rather than a core behavioral point about enforce vs warn.", "label_timestamp": "2026-05-05T14:39:21Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002044, "_latency_ms": 4796.786619001068}
+{"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports that a ConfigMap is an API object used to store non-confidential data in key-value pairs. The answer makes many additional claims not supported by the snippet: the three ways pods consume ConfigMaps (env vars, command-line args, volumes), the warning about not storing passwords/API keys/tokens/certificates, the recommendation to use Secrets, and the existence of `data` and `binaryData` fields. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:27Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020984000000000003, "_latency_ms": 3795.695087988861}
+{"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly defines what a ConfigMap is (a Kubernetes API object for storing non-confidential data in key-value pairs) and explicitly addresses what kind of data should be stored in it (non-sensitive configuration data like settings, parameters, feature flags), while also clarifying what should not be stored. Both parts of the question are addressed on-target.", "label_timestamp": "2026-05-05T14:35:04Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002028, "_latency_ms": 4578.002344002016}
+{"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) ConfigMap is an API object storing non-confidential data in key-value pairs, (2) used for application configuration that doesn't need to be secret, and (3) confidential data like passwords/tokens should be in Secrets instead.", "label_timestamp": "2026-05-05T14:39:30Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0020288, "_latency_ms": 3426.676850998774}
+{"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a Job creates Pods and retries until a specified number successfully terminate. The answer makes many additional claims not supported by the snippet: NonIndexed vs Indexed completion modes, .spec.completionMode, completion indices 0 to N-1, annotations/labels like batch.kubernetes.io/job-completion-index, JOB_COMPLETION_INDEX env variable, .spec.parallelism, .spec.backoffLimit, and the 'Completed' state transition. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:36Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023552, "_latency_ms": 3761.0972960246727}
+{"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: what a Kubernetes Job does (runs finite/batch tasks to completion, creates Pods that run once) and how it decides the task is complete (based on completion mode, .spec.completions, NonIndexed vs Indexed). Both core asks are covered on-target.", "label_timestamp": "2026-05-05T14:35:13Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019904000000000002, "_latency_ms": 3753.890769032296}
+{"item_id": "k8s_007", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) Job creates one or more Pods, (2) retries execution until specified number successfully terminate, (3) tracks successful completions, (4) complete when specified number reached, (5) deleting a Job cleans up the Pods it created. The answer covers points 1 (creates one or more Pods), 2 (implied via backoffLimit retries), 3 and 4 (tracks completions and completes when .spec.completions reached). However, point 5 (deleting a Job cleans up the Pods) is not mentioned in the answer.", "label_timestamp": "2026-05-05T14:39:40Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024712, "_latency_ms": 5021.353780000936}
+{"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that the RBAC API declares four kinds: Role, ClusterRole, RoleBinding, and ClusterRoleBinding. The answer adds many descriptive claims about each (e.g., 'Defines a set of permissions within a specific namespace', 'cluster-scoped rather than namespace-scoped', 'Grants the permissions defined in a Role to a user or set of users', 'connects a Role to subjects', etc.) that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:45Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021512, "_latency_ms": 3805.1239320193417}
+{"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly lists all four RBAC object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding) and explains what each does, fully addressing the question's core ask.", "label_timestamp": "2026-05-05T14:35:21Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001976, "_latency_ms": 2915.8728439942934}
+{"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: the four object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding), that Roles/ClusterRoles contain permission rules while bindings grant them to users/groups/service accounts, and the namespace vs cluster scope distinction.", "label_timestamp": "2026-05-05T14:39:48Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002052, "_latency_ms": 3570.147737977095}
+{"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippet only states that Secrets can be mounted as data volumes or exposed (presumably as environment variables). It does not contain any of the detailed claims about automatic updates, subPath behavior, character restrictions, pod restart requirements, or eventual consistency. The answer makes many specific claims that are not supported by the single truncated snippet provided.", "label_timestamp": "2026-05-05T14:30:53Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019776, "_latency_ms": 3797.7397789945826}
+{"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by comparing the two methods of consuming Secrets in Pods. It covers update behavior, restart requirements, naming constraints, and subPath limitations, providing a clear comparison of both approaches.", "label_timestamp": "2026-05-05T14:35:29Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019824, "_latency_ms": 3311.2139879958704}
+{"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) Secrets can be mounted as a data volume, (2) Secrets can be exposed as environment variables, (3) mounted volumes receive in-place updates when the Secret changes, and (4) environment variables are set at Pod start and don't update. The 'each key becomes a file' point is implicitly covered by 'Secrets are exposed as files in the mounted directory'.", "label_timestamp": "2026-05-05T14:39:57Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023824000000000002, "_latency_ms": 4128.535017021932}
+{"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that setting emptyDir.medium to 'Memory' causes Kubernetes to mount a tmpfs. The answer makes many additional claims not supported by the snippet: that default emptyDir is stored on node-backing media, that memory-backed files count against container memory limits, sizing behavior when no size is specified, cost/size comparisons, and other resource considerations. None of these are entailed by the single provided snippet.", "label_timestamp": "2026-05-05T14:31:02Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021024, "_latency_ms": 4015.2714860159904}
+{"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core comparison between default emptyDir and Memory-medium emptyDir, covering storage location, performance, memory accounting, and size limits for both cases.", "label_timestamp": "2026-05-05T14:35:37Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017872, "_latency_ms": 2831.339330004994}
+{"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) default emptyDir is stored on whatever medium backs the node (disk, SSD, or network storage), (2) setting medium to 'Memory' mounts a tmpfs/RAM-backed filesystem, and (3) tmpfs is fast but counts against the container's memory limit. The answer covers all three: it states the default storage location verbatim, describes Memory as tmpfs/RAM-backed, notes it's very fast, and explicitly states files count against the container's memory limit.", "label_timestamp": "2026-05-05T14:40:08Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023808, "_latency_ms": 5550.627671997063}
+{"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that liveness probes are used to know when to restart a container, and readiness probes are used to know when a container is ready to accept traffic. The answer makes many additional claims not supported by the snippets: terminationGracePeriodSeconds behavior, deadlock examples, Pod Ready condition being set to false, kubelet continuing to run more probes periodically, best practice recommendations about shared endpoints and failureThreshold settings, etc. These claims may be true but are not entailed by the provided snippets.", "label_timestamp": "2026-05-05T14:31:12Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002268, "_latency_ms": 4528.785168018658}
+{"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by explaining how the kubelet responds differently to failing liveness vs readiness probes. It clearly states liveness failures cause container restart while readiness failures remove the pod from service endpoints without killing the container.", "label_timestamp": "2026-05-05T14:35:46Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019464, "_latency_ms": 4366.952612996101}
+{"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has key points: (1) liveness failure \u2192 kubelet restarts container, (2) purpose is to recover from wedged state like deadlock, (3) readiness failure \u2192 container not restarted, (4) Pod marked not-ready and removed from Service load balancers. The answer covers all of these: liveness kills/restarts container, mentions deadlocks as use case, readiness keeps container running, and pod won't receive traffic from Services.", "label_timestamp": "2026-05-05T14:40:18Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023168000000000004, "_latency_ms": 4956.465922994539}
+{"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only support two basic claims: NodePort exposes the Service on each Node, and LoadBalancer exposes the Service externally using an external load balancer. The answer makes many additional claims not supported by the snippets, including the static port range (30000-32767), the access pattern <NodeIP>:<NodePort>, the cluster IP creation, the relationship that LoadBalancer is built on NodePort, the spec.allocateLoadBalancerNodePorts flag, and production/development suitability claims. None of these are entailed by the two minimal snippets provided.", "label_timestamp": "2026-05-05T14:31:21Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002252, "_latency_ms": 4352.603528008331}
+{"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core ask by comparing NodePort and LoadBalancer Service types, covering both sides with their characteristics, use cases, and relationship.", "label_timestamp": "2026-05-05T14:35:54Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017527999999999999, "_latency_ms": 2996.093010995537}
+{"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) NodePort exposes Service on each Node's IP at a static port - explicitly stated; (2) LoadBalancer exposes externally using an external load balancer - explicitly stated; (3) Kubernetes doesn't directly provide the load balancer, requires cloud provider integration - explicitly stated; (4) LoadBalancer typically implemented on top of NodePort - explicitly stated.", "label_timestamp": "2026-05-05T14:40:26Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023512, "_latency_ms": 3537.0634549763054}
+{"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims not supported by the snippets. The snippets only state that Jobs are one-off tasks that run to completion and that CronJobs start Jobs on a repeating schedule. The answer adds unsupported claims about idempotency requirements, exactly-once semantics, two Jobs being created for a single schedule, startingDeadlineSeconds, and deadline handling behavior. None of these are in the snippets.", "label_timestamp": "2026-05-05T14:31:31Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021944, "_latency_ms": 4788.899898994714}
+{"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how CronJobs differ from Jobs (execution, scheduling, use case) and when to use one over the other. It also adds relevant considerations about idempotency and deadlines.", "label_timestamp": "2026-05-05T14:36:02Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001912, "_latency_ms": 2846.5557509916835}
+{"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) Job runs a one-off task to completion, creating Pods and retrying until success; (2) CronJob creates Jobs on a repeating cron schedule for recurring actions like backups; (3) use Job for single batch run, CronJob for recurring schedule. The answer covers all three: Job runs once to completion for batch tasks, CronJob runs on cron schedule for recurring tasks like backups/reports, and explicitly states use Job for one-time and CronJob for recurring. The 'creates Pods and retries' detail is not explicitly mentioned but the run-to-completion concept is covered.", "label_timestamp": "2026-05-05T14:40:38Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0026271999999999997, "_latency_ms": 6194.046579999849}
+{"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that a Deployment manages Pods for stateless workloads and that a DaemonSet ensures all/some Nodes run a copy of a Pod. The answer makes many additional claims not supported by the snippets: details about the Kubernetes scheduler, nodeAffinity rules added by the DaemonSet controller, the .spec.nodeName field being set, specific replica distribution examples, and common use cases like logging agents, monitoring daemons, web servers. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:31:51Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002292, "_latency_ms": 14661.54205496423}
+{"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core scheduling difference between Deployment and DaemonSet, explaining that Deployments run a specified number of replicas distributed across nodes via the scheduler, while DaemonSets ensure one Pod per eligible node. Both sides of the comparison are covered.", "label_timestamp": "2026-05-05T14:36:11Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018928, "_latency_ms": 3361.026384984143}
+{"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two key points: (1) Deployment schedules a fixed/configured number of replicas independent of node count, and (2) DaemonSet runs a Pod on each (matching) node, so replica count scales with nodes. The answer covers both: it states Deployment manages a specified number of replicas placed on any eligible nodes, and DaemonSet ensures all eligible nodes run one copy of a Pod. While it doesn't explicitly say 'as nodes are added DaemonSet Pods are added with them,' it conveys that the count is tied to eligible nodes.", "label_timestamp": "2026-05-05T14:40:50Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025031999999999997, "_latency_ms": 7288.97923103068}
+{"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that init containers are like regular containers and run before the main application container. The answer makes many additional claims (sequential execution, retry behavior with restartPolicy, OnFailure handling, networking/storage readiness, one-time execution, exit code 0 requirement, etc.) that are not supported by the provided snippets.", "label_timestamp": "2026-05-05T14:32:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019376, "_latency_ms": 3801.133704953827}
+{"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: the order of init and application container execution, and the guarantees Kubernetes provides. It explains sequential init container execution, blocking of app containers, and includes related guarantees about retries and restart policy.", "label_timestamp": "2026-05-05T14:36:19Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001876, "_latency_ms": 3559.6700820024125}
+{"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has four key points: (1) init containers run one at a time in defined order, (2) each must complete before next starts, (3) regular containers start only after all init containers terminate successfully, (4) on failure, Pod restarts per restartPolicy and init sequence begins again, (5) suitable for one-time setup. The answer covers sequential ordering, completion requirement, app containers starting after init completion, and restartPolicy retry behavior. The 'one-time setup' purpose is implied at the end ('initialization tasks complete reliably before your application containers begin running').", "label_timestamp": "2026-05-05T14:41:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 5247.3236820078455}
+{"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only mention that HorizontalPodAutoscaler automatically updates a workload resource and reference a current stable version. They do not support specific claims about autoscaling/v2 being stable since v1.23, support for custom/memory/external metrics, multiple metrics evaluation, scale-up/scale-down behaviors, or that v1 only supports CPU. Nearly every claim in the answer is unsupported by the snippets.", "label_timestamp": "2026-05-05T14:32:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020632, "_latency_ms": 4089.8927800008096}
+{"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it specifies the API version (autoscaling/v2) and explains why (stable support for custom metrics, memory metrics, multiple metrics, advanced features). Both the 'which' and 'why' components are covered.", "label_timestamp": "2026-05-05T14:36:28Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020032, "_latency_ms": 4004.737361043226}
+{"item_id": "k8s_018", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) current stable HPA API is autoscaling/v2, (2) it adds memory and custom metrics beyond CPU-only autoscaling/v1, (3) new fields are preserved as annotations when working with autoscaling/v1, (4) use autoscaling/v2 for memory/custom metric scaling. The answer covers points 1, 2, and 4 clearly. Point 3 about annotation preservation in v1 is not mentioned. This is partial coverage.", "label_timestamp": "2026-05-05T14:41:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002352, "_latency_ms": 4204.742238041945}
+{"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a ConfigMap is an API object for storing non-confidential key-value pairs. The answer makes many claims about four mechanisms (env vars individual/bulk, volume mounts, API access), update behaviors, subPath limitations, kubelet handling, etc. None of these claims are supported by the single provided snippet.", "label_timestamp": "2026-05-05T14:32:18Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002272, "_latency_ms": 4105.534160975367}
+{"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing and explaining the mechanisms Kubernetes provides to make ConfigMap values available to a Pod: individual env vars, bulk env vars, volume mounts, and direct API access. It includes examples and notes on update behavior, fully covering the core ask.", "label_timestamp": "2026-05-05T14:36:37Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0023304000000000003, "_latency_ms": 3380.863350990694}
+{"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) env variables expose keys, (2) volume mount makes keys into files, (3) volume-mounted data updates in place while env vars require pod restart. The answer covers all three: it describes env vars, volume mounts with keys as filenames, and explicitly notes env vars require restart while volume mounts support dynamic updates.", "label_timestamp": "2026-05-05T14:41:19Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 4229.396597947925}
+{"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only contain two fragments: 'non-isolated' and 'namespaceSelector'. The answer makes many detailed claims about ingress/egress defaults, policyTypes, podSelector, connections from the Pod's own node, etc. While 'non-isolated' supports the default claim, the vast majority of claims (policyTypes values, podSelector mechanism, node connection allowance, egress rule behavior) are not supported by the minimal snippets provided.", "label_timestamp": "2026-05-05T14:32:26Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020512, "_latency_ms": 3578.6442419630475}
+{"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it explains that by default Pods are non-isolated (and thus not subject to NetworkPolicy filtering in the restrictive sense), and it explains how applying a NetworkPolicy that selects a Pod changes the baseline by making the Pod isolated for the specified direction(s), allowing only explicitly permitted traffic.", "label_timestamp": "2026-05-05T14:36:45Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020232, "_latency_ms": 3397.536567004863}
+{"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has four key points: (1) Pods are non-isolated by default, (2) a Pod becomes isolated when a NetworkPolicy selects it via podSelector, (3) only traffic allowed by the union of NetworkPolicies is permitted, (4) policies can target ingress/egress/both, and (5) the CNI plugin enforces the policy, not Kubernetes itself. The answer covers points 1, 2, 3, and 4 clearly. However, it does not mention that the CNI plugin is what enforces the policy.", "label_timestamp": "2026-05-05T14:41:28Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025376000000000005, "_latency_ms": 4806.188436981756}
+{"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports the claim that RBAC is purely additive with no deny rules. The answer makes many additional claims about CEL-based AuthorizationPolicy (with specific API version and schema), ValidatingAdmissionPolicy syntax, Kubernetes 1.26+ version requirements, and specific YAML structures that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:32:35Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023096, "_latency_ms": 3849.2560360464267}
+{"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the user's question about writing an RBAC deny rule for blocking pod deletion. It correctly explains that RBAC doesn't support explicit deny rules and provides alternative approaches including not granting delete permission, admission controllers, etc. This directly engages with the core ask.", "label_timestamp": "2026-05-05T14:36:53Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0022656000000000004, "_latency_ms": 3391.0853110137396}
+{"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference's key points are: (1) RBAC does not support deny rules, (2) rules are purely additive, (3) to block an action, simply don't grant the permission. The answer covers all of these: it states RBAC does not have explicit deny rules, is allow-based only, and recommends not granting the delete verb as the primary approach.", "label_timestamp": "2026-05-05T14:41:37Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024144, "_latency_ms": 3988.3979110163637}
+{"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided is 'Unrestricted policy', which is a very minimal piece of information. The answer makes multiple specific claims: that the privileged profile is 'entirely unrestricted', that pods can 'bypass typical container isolation mechanisms', that pods can access the node's host network, and that it's aimed at system/infrastructure-level workloads managed by privileged trusted users. While the snippet's phrase 'Unrestricted policy' loosely supports the 'unrestricted' claim, the specific claims about bypassing container isolation, host network access, and the intended audience (system/infrastructure workloads, trusted users) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:32:45Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002084, "_latency_ms": 4419.9781229835935}
+{"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks which container-isolation restrictions the 'privileged' profile enforces. The answer directly addresses this by stating that the privileged profile enforces no restrictions and is unrestricted by design, which is a direct on-topic answer to the core ask.", "label_timestamp": "2026-05-05T14:37:01Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017008000000000001, "_latency_ms": 2984.6201190375723}
+{"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) privileged enforces none/absence of restrictions, (2) purposely-open and entirely unrestricted, (3) bypasses typical container isolation mechanisms with host network example. The only minor omission is the explicit recommendation to use baseline or restricted for actual isolation, but the answer does contrast with these more restrictive profiles.", "label_timestamp": "2026-05-05T14:41:47Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0018664, "_latency_ms": 4233.887287962716}
+{"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets section is empty - there are no snippets provided to ground any claims against. The answer makes numerous technical claims about Envoy ADS configuration, bootstrap settings, Kubernetes deployment, and control plane setup, none of which can be supported by the (nonexistent) snippets. The answer itself acknowledges the documentation doesn't contain this information but then proceeds to provide unsupported claims rather than abstaining. Since the answer is not a pure refusal/abstention but instead provides substantive unsupported claims, it fails groundedness.", "label_timestamp": "2026-05-05T14:32:55Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0027576000000000002, "_latency_ms": 5302.7043039910495}
+{"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question about configuring Envoy xDS ADS for sidecar proxies in a Kubernetes Deployment. It provides bootstrap configuration with ADS settings, a Kubernetes Deployment manifest with the Envoy sidecar, and discusses the control plane setup. Both aspects of the question (ADS configuration and Kubernetes Deployment integration) are covered.", "label_timestamp": "2026-05-05T14:37:11Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002908, "_latency_ms": 4470.350661023986}
+{"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference states the correct answer is to refuse and cite zero sources because the corpus doesn't cover Envoy/xDS/ADS. The agent does acknowledge that the documentation doesn't contain this information ('The documentation corpus doesn't appear to contain specific Envoy ADS configuration documentation'), which partially covers the reference point. However, instead of refusing, it then provides extensive configuration guidance, which goes against the 'refuse' aspect of the reference. The acknowledgment of absence is present but the refusal stance is not maintained.", "label_timestamp": "2026-05-05T14:41:56Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0029904000000000003, "_latency_ms": 4743.537011963781}
+{"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only describe ClusterIP ('Exposes the Service on a cluster-internal IP') and LoadBalancer ('Exposes the Service externally using an external load balancer'). The answer makes many claims about NodePort, ExternalIPs, Ingress, Gateway, and details about LoadBalancer (cloud provider, stable external IP) that are not supported by the snippets. Multiple unsupported claims fail the strict conjunction rule.", "label_timestamp": "2026-05-05T14:33:05Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020808000000000003, "_latency_ms": 4765.863787964918}
+{"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing Kubernetes Service types that expose applications externally (NodePort, LoadBalancer) and adds ExternalIPs as additional context, while clarifying ClusterIP is internal-only. This directly addresses the core ask.", "label_timestamp": "2026-05-05T14:37:19Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017576000000000002, "_latency_ms": 3343.788483995013}
+{"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference identifies three external Service types: NodePort, LoadBalancer, and ExternalName, plus mentions ClusterIP as default and Ingress as alternative. The answer covers NodePort, LoadBalancer, ClusterIP as default, and Ingress. However, it does not mention ExternalName (instead it discusses ExternalIPs, which is a different concept). So one of the three external-exposing types is missing.", "label_timestamp": "2026-05-05T14:42:06Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0021791999999999996, "_latency_ms": 4295.65988300601}
+{"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer's core claims about default caching within a single request and using use_cache=False to disable it are supported by the snippet. However, the answer adds claims not supported: 'particularly useful for expensive operations like database connections' is an embellishment not in the snippet. More critically, the example shows using use_cache=False on two separate Depends calls in the same handler and claims this 'ensures that the dependency function is called twice' \u2014 but the snippet only describes caching when the same dependency is used multiple times (e.g., route + sub-dependency). The performance improvement claim ('improves performance by avoiding redundant function calls') is a reasonable inference but not explicitly stated. The strict-conjunction rule means unsupported claims fail the rubric.", "label_timestamp": "2026-05-05T14:28:25Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0026408, "_latency_ms": 6825.492414005566}
+{"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how dependency caching works in FastAPI (per-request caching of dependency results) and how to disable it (using use_cache=False). Both core asks are covered with explanation and example.", "label_timestamp": "2026-05-05T14:33:13Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001804, "_latency_ms": 3368.1467549758963}
+{"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers both reference points: (1) FastAPI caches dependency results so each dependency is called only once per request, and (2) caching can be disabled by setting use_cache=False in Depends().", "label_timestamp": "2026-05-05T14:37:29Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001668, "_latency_ms": 5226.763973012567}
+{"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "No gold snippets are provided to support any of the claims in the answer. The answer makes multiple specific claims (FastAPI generates OpenAPI 3.1.0, no native GraphQL support, specific third-party libraries like Strawberry/Graphene/Ariadne) that cannot be verified against the empty snippet set.", "label_timestamp": "2026-05-05T14:28:33Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0016656000000000002, "_latency_ms": 2784.843849018216}
+{"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question of whether FastAPI can natively generate GraphQL schemas, clearly stating it cannot, and provides additional context about alternatives.", "label_timestamp": "2026-05-05T14:33:21Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0016496000000000002, "_latency_ms": 2810.719419037923}
+{"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "Without a gold reference provided, I cannot assess coverage of specific reference points against the answer.", "label_timestamp": "2026-05-05T14:37:38Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0012887999999999999, "_latency_ms": 3803.2577409758233}
+{"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only confirm that `/docs` is the Swagger UI endpoint and that FastAPI exposes three documentation-related endpoints by default. The answer makes many additional claims not supported by the snippets: the specific names `/redoc` and `/openapi.json`, that ReDoc is the alternative viewer, that `/openapi.json` returns the raw OpenAPI schema, and the customization/disabling code with parameters like `docs_url`, `redoc_url`, `openapi_url`. None of these specifics are entailed by the two snippets provided.", "label_timestamp": "2026-05-05T14:28:43Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021912, "_latency_ms": 5105.2629159530625}
+{"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by stating that the default Swagger UI endpoint in FastAPI is /docs, and provides additional helpful context about other documentation endpoints and customization.", "label_timestamp": "2026-05-05T14:33:29Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0015224000000000001, "_latency_ms": 3036.487515026238}
+{"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states two points: (1) the default Swagger UI endpoint is /docs, and (2) it provides interactive API documentation. The answer covers both, explicitly stating /docs is the endpoint and describing it as an interactive API explorer.", "label_timestamp": "2026-05-05T14:37:46Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0016392, "_latency_ms": 2931.246972992085}
+{"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains several claims not supported by the snippets. The snippets only mention HTTPException as the primary way to return errors, that it terminates processing, returns status code and detail, and that detail can be string/list/dict serialized to JSON. Unsupported claims include: the 'headers' parameter, the example with custom headers, the default JSON response format showing {\"detail\": \"Item not found\"}, that HTTPException inherits from Starlette's HTTPException, and that you can create custom exception handlers. These are not entailed by the gold snippets.", "label_timestamp": "2026-05-05T14:28:52Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00242, "_latency_ms": 4274.66034895042}
+{"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to raise an HTTP error in a FastAPI route handler by explaining the use of HTTPException, providing code examples, parameters, and additional context. This is a direct, on-target response to the question's core ask.", "label_timestamp": "2026-05-05T14:33:37Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018336000000000001, "_latency_ms": 3302.3087149485946}
+{"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) raise HTTPException, (2) with a status_code, (3) and a detail message. The answer covers all three explicitly, showing the raise statement, status_code parameter, and detail parameter.", "label_timestamp": "2026-05-05T14:37:55Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0017992000000000001, "_latency_ms": 4265.608784975484}
+{"item_id": "q021", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet states max_age=600 seconds means the browser caches preflight results. The answer's core claim that 600 seconds = 10 minutes is a math conversion supported by the snippet's value. However, the answer adds claims about the 'Access-Control-Max-Age' header name and that max_age is a 'directive' specifying maximum cache time of preflight requests \u2014 these specifics (header name in particular) are not in the snippet. The snippet only mentions 'max_age' as a config with value 600 and that browser caches preflight results.", "label_timestamp": "2026-05-05T14:29:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0018016, "_latency_ms": 5443.063500977587}
+{"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by converting 600 seconds to 10 minutes and explaining the calculation.", "label_timestamp": "2026-05-05T14:33:45Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0012775999999999998, "_latency_ms": 2611.7517079692334}
+{"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states that with max_age 600 seconds, the browser caches preflight results for 10 minutes. The answer covers exactly this: 10 minutes caching, with the 600/60=10 calculation shown.", "label_timestamp": "2026-05-05T14:38:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0014024, "_latency_ms": 2987.084314983804}
+{"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer makes many specific claims that go far beyond the two snippets provided. The snippets only state that HTTPException is the primary way to return error responses and that CORS is configured using CORSMiddleware from Starlette. The answer includes extensive code with specific parameters (allow_credentials, allow_methods, expose_headers, max_age), custom exception handlers with @app.exception_handler decorator syntax, RequestValidationError handling, dependency injection patterns, and many other details not supported by the snippets.", "label_timestamp": "2026-05-05T14:29:12Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00298, "_latency_ms": 4431.315872003324}
+{"item_id": "q023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three parts of the question: custom error handling (custom exception classes and handlers), CORS middleware (CORSMiddleware setup), and begins setting up dependency injection for testing. Although the answer is truncated mid-code, it clearly engages with the core ask on-topic.", "label_timestamp": "2026-05-05T14:33:53Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0024936000000000003, "_latency_ms": 3396.7526520136744}
+{"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has three key points: (1) custom error handling via HTTPException/exception handlers, (2) CORS via CORSMiddleware with allowed origins, (3) testing with TestClient and app.dependency_overrides. The answer covers exception handlers extensively (point 1) and CORSMiddleware with allow_origins (point 2). However, the answer is cut off and does not cover testing with TestClient or app.dependency_overrides (point 3 missing).", "label_timestamp": "2026-05-05T14:38:13Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028344, "_latency_ms": 4287.58555999957}
+{"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided describes cursor-based pagination as using an opaque token pointing to the last item in the previous page, to avoid performance issues with large offsets. The answer makes many claims far beyond this: specific Pydantic models, field constraints (min_length, max_length, le=100), base64 encoding of cursors, analytics logging structure, FastAPI implementation details, etc. None of these specifics are supported by the single snippet. While the general concept of cursor pagination is mentioned, the vast majority of the answer's claims (validation rules, base64 encoding scheme, analytics fields, etc.) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:29:23Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0030064000000000002, "_latency_ms": 5922.551519994158}
+{"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three components of the question: cursor-based pagination (with encode/decode utilities and cursor query params), response model validation (Pydantic models with Field validators), and background task processing for analytics logging (FastAPI BackgroundTasks with analytics logger). Although the code is cut off at the end, the answer substantively engages with the core ask.", "label_timestamp": "2026-05-05T14:34:02Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002676, "_latency_ms": 4254.6321749687195}
+{"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all three reference points: (1) cursor-based pagination with opaque cursor tokens (shown via base64 encoding/decoding utilities and cursor parameter), (2) response_model parameter on route decorators for validation (shown in @app.get with response_model=PaginatedResponse[Item]), and (3) analytics logging handled asynchronously via FastAPI's BackgroundTasks (imported and used as a parameter).", "label_timestamp": "2026-05-05T14:38:22Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028224, "_latency_ms": 3933.714357961435}
+{"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "There are no gold source snippets provided. With no snippets, no claims in the answer can be grounded/supported. The answer makes multiple substantive claims (about Uvicorn workers, Nginx/HAProxy, Kubernetes, statelessness, etc.) that cannot be verified against any snippet. This is not a refusal answer, so abstain doesn't apply.", "label_timestamp": "2026-05-05T14:29:32Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019168000000000002, "_latency_ms": 3519.5088270120323}
+{"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks how FastAPI handles automatic load balancing across multiple servers. The answer directly addresses this by clarifying that FastAPI does not provide built-in load balancing, explains what it does provide (worker processes, ASGI foundation), and describes how load balancing is actually achieved via external tools. This directly engages with the question's core ask.", "label_timestamp": "2026-05-05T14:34:11Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020432000000000002, "_latency_ms": 3767.282517044805}
+{"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "No gold reference answer was provided to compare against. Without reference points to check coverage, the score cannot be determined.", "label_timestamp": "2026-05-05T14:38:29Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001504, "_latency_ms": 2372.651186946314}

measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"item_id": "q006", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer covers the key points from the reference about how FastAPI caches dependency results and how to disable caching. It explains that dependencies are called only once per request and provides the method to disable caching using `use_cache=False`. All essential points from the reference are present, albeit with additional detail and explanation.", "evidence_quotes": ["FastAPI implements automatic caching of dependency results within a single request.", "To disable caching and force a fresh call each time a dependency is needed, use the `use_cache` parameter set to `False`."], "shifted_up": true}
+{"item_id": "q011", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer covers the key point from the reference about the default Swagger UI endpoint being /docs. However, it also includes additional details about the availability of the endpoint, other documentation endpoints, and customization options, which are not part of the reference. Since the main point about the default endpoint is clearly stated, the answer is considered to have full coverage of the reference's key point.", "evidence_quotes": ["/docs"], "shifted_up": true}
+{"item_id": "k8s_002", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer provides a detailed comparison between Deployments and StatefulSets, covering all key points from the reference answer. It explains that a Deployment manages interchangeable Pods for stateless applications and provides declarative updates, while a StatefulSet maintains a sticky identity for each Pod, including stable network identifiers, stable persistent storage, and ordered deployment/scaling. All aspects of the reference are addressed, including the use cases for each type. Therefore, the answer fully covers the reference points.", "evidence_quotes": ["A Deployment manages a set of Pods for an application workload that does not maintain state and provides declarative updates; its Pods are interchangeable replicas.", "A StatefulSet, by contrast, maintains a sticky identity for each of its Pods \u2014 stable network identifiers, stable persistent storage, and ordered deployment/scaling."], "shifted_up": true}
+{"item_id": "k8s_006", "prior": 1, "new": 1, "gold": 2, "reasoning": "The answer covers the key point that a ConfigMap is an API object used to store non-confidential data in key-value pairs, which is a direct match to the gold reference. However, it does not mention that ConfigMaps are intended for application configuration that does not need to be kept secret, nor does it mention that confidential data should be stored in a Secret instead of a ConfigMap. Therefore, while one key point is fully covered, the second key point is missing.", "evidence_quotes": ["A ConfigMap is a Kubernetes API object used to store non-confidential data in key-value pairs."], "shifted_up": false}
+{"item_id": "k8s_018", "prior": 1, "new": 1, "gold": 2, "reasoning": "The answer covers several key points from the reference, including the current stable version of the HorizontalPodAutoscaler API (autoscaling/v2), the support for scaling on memory and custom metrics, and the distinction between autoscaling/v1 and autoscaling/v2. However, it does not explicitly mention that the new fields in autoscaling/v2 are preserved as annotations when working with autoscaling/v1, which is a key point in the reference. Therefore, while it covers most of the reference's points, it does not cover all of them.", "evidence_quotes": ["The `autoscaling/v2` API version is the current stable version (stable since Kubernetes v1.23) that includes support for scaling on **custom metrics**.", "The `autoscaling/v2` API version includes support for scaling on **memory metrics**, which is not available in the older `autoscaling/v1` version.", "With `autoscaling/v2`, you can specify multiple metrics for a HorizontalPodAutoscaler to scale on.", "When you create a HorizontalPodAutoscaler for scaling on custom or memory metrics, you must use `apiVersion: autoscaling/v2` in your manifest."], "shifted_up": false}

measurements/2026-05-06-4a-gpt4o-full-probe.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"reasoning": "The answer covers all the key points from the reference. It mentions that a ConfigMap is used to store non-confidential data in key-value pairs, which aligns with the reference's description. It also states that confidential data should be stored in a Secret, not a ConfigMap, which matches the reference's point about where to store sensitive information.", "evidence_quotes": ["A ConfigMap is a Kubernetes API object used to store non-confidential data in key-value pairs.", "Do NOT store sensitive or confidential data in ConfigMaps", "For sensitive data, you should use Kubernetes Secrets instead"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "cost_usd": 0.00022004999999999998, "latency_ms": 3570.2048700186424, "item_id": "k8s_006", "mini_prior_score": 1, "gold_score": 2}
+{"reasoning": "The answer covers all the key points from the reference. It mentions that the current stable version is autoscaling/v2, which supports scaling on memory and custom metrics, similar to the reference. It also notes that autoscaling/v1 only supports CPU-based scaling, aligning with the reference's point about using autoscaling/v2 for memory or custom metric scaling.", "evidence_quotes": ["The `autoscaling/v2` API version is the current stable version", "includes support for scaling on **custom metrics**", "includes support for scaling on **memory metrics**", "Only supports CPU-based scaling"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "cost_usd": 0.00023295, "latency_ms": 1422.9646890307777, "item_id": "k8s_018", "mini_prior_score": 1, "gold_score": 2}
+{"reasoning": "The agent's answer covers the key point from the reference answer, which is that the default Swagger UI endpoint in FastAPI is `/docs`. The additional details provided by the agent do not affect the completeness score as they do not detract from the coverage of the reference's key point.", "evidence_quotes": ["The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface.", "The default Swagger UI endpoint in FastAPI is **`/docs`**."], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "cost_usd": 0.0001923, "latency_ms": 1622.3357539856806, "item_id": "q011", "mini_prior_score": 1, "gold_score": 2}
+{"reasoning": "The agent's answer covers all the key points from the reference answer. The reference states that an HTTP error in FastAPI is raised by using an HTTPException with a status_code and a detail message. The agent's answer explains the use of HTTPException, including the status_code and detail parameters, and provides examples that illustrate these points. Therefore, the agent's answer fully covers the reference's key points.", "evidence_quotes": ["The primary way to raise HTTP errors in FastAPI is using the `HTTPException` class:", "- **status_code** (required): The HTTP status code to return (e.g., 404, 403, 400)", "- **detail** (optional): The error message/details. Can be a string, list, or dictionary - FastAPI automatically serializes it to JSON"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "cost_usd": 0.0002484, "latency_ms": 2396.0261089960113, "item_id": "q012", "mini_prior_score": 1, "gold_score": 2}
+{"reasoning": "The agent's answer covers all the key points from the reference answer. It mentions the unique identity composed of an ordinal index, stable network identity, and persistent storage, which are the core components of the reference's identity description. Additionally, it explains the persistence of identity across rescheduling, similar to the reference's explanation of identity sticking to each Pod.", "evidence_quotes": ["StatefulSets maintain a **sticky identity** for each of its Pods.", "Each Pod has a **persistent identifier** that remains consistent even if the Pod is rescheduled or restarted.", "StatefulSets require a **Headless Service** to be responsible for the network identity of the Pods."], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "cost_usd": 0.00023145, "latency_ms": 2257.6226279488765, "item_id": "k8s_001", "mini_prior_score": 1, "gold_score": 2}

measurements/2026-05-06-gpt4o-extraction-reasoning-split.md ADDED Viewed

	@@ -0,0 +1,162 @@

+# GPT-4o-mini extraction-vs-reasoning split — three calibration items
+This artifact documents the specific failure mode behind v1.1 finding 3 (jury
+κ regression on completeness): on each of the three representative
+disputed cells (gold=2, Haiku=2, gpt-4o-mini=1), gpt-4o-mini's
+`evidence_quotes` correctly extracts the paraphrased coverage from the
+agent's answer, and then its `reasoning` field denies that those quotes
+constitute coverage. The score follows the reasoning, not the evidence.
+The mechanism is a *post-extraction reasoning regression*: the
+structured-output discipline forces the model to commit to an extraction
+step (which it does correctly), and then the reasoning step applies a
+literal-string-match standard to the answer text — even though the
+rubric says "paraphrase allowed". This is a known failure mode in
+chain-of-thought judges and shows up more in smaller models because the
+reasoning step has less capacity to integrate the rubric's "paraphrase
+allowed" instruction with the literal-text comparison the model is
+running on autopilot.
+Source: `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
+filtered to `judge_id="gpt-4o-mini-2024-07-18_completeness"` joined to
+gold labels in `measurements/2026-05-04-judge-calibration-labels.jsonl`.
+---
+## Item q006 — FastAPI dependency caching (gold=2, Haiku=2, gpt=1)
+**Reference (gold) answer:** "FastAPI caches dependency results so each
+dependency is called only once per request; caching can be disabled by
+setting `use_cache=False` in the `Depends()` call."
+### gpt-4o-mini evidence_quotes (extraction step — correct)
+```
+[
+  "FastAPI implements automatic caching of dependency results within a single request.",
+  "To disable caching and force a fresh call each time a dependency is needed, use the `use_cache` parameter set to `False`."
+]
+```
+Both gold points are present in the extracted quotes: (1) per-request
+caching, (2) `use_cache=False` mechanism.
+### gpt-4o-mini reasoning (literal-match regression)
+> "The answer provides a detailed explanation of how dependency caching
+> works in FastAPI and how to disable it, but it does not explicitly
+> mention that caching is enabled by default and that it can be disabled
+> by setting `use_cache=False` in the `Depends()` call. While it covers
+> the concept of caching and provides an example, it does not directly
+> address the key points from the reference answer."
+**The reasoning denies what the evidence_quotes confirm.** The model has
+the strings "automatic caching of dependency results within a single
+request" and "use the `use_cache` parameter set to `False`" in its own
+extracted evidence, then writes "does not explicitly mention that caching
+is enabled by default and that it can be disabled by setting
+`use_cache=False`". Score=1 follows the reasoning.
+---
+## Item k8s_002 — Deployment vs StatefulSet (gold=2, Haiku=2, gpt=1)
+**Reference (gold) answer:** "Deployment manages stateless pods with
+declarative updates; StatefulSet maintains sticky identity for each pod
+— stable network identifiers, stable persistent storage, ordered
+operations."
+### gpt-4o-mini evidence_quotes (extraction step — correct)
+```
+[
+  "A Deployment manages a set of Pods for an application workload that does not maintain state and provides declarative updates; its Pods are interchangeable replicas.",
+  "A StatefulSet, by contrast, maintains a sticky identity for each of its Pods — stable network identifiers, stable persistent storage, and ordered deployment/scaling."
+]
+```
+The evidence_quotes literally contain the strings `"declarative updates"`
+and `"sticky identity"` — the two reference phrases gpt's reasoning will
+claim are missing.
+### gpt-4o-mini reasoning (literal-match regression)
+> "The answer provides a detailed comparison between Deployments and
+> StatefulSets, covering key differences such as pod identity, naming,
+> network identity, storage, ordering, and use cases. However, it does
+> not explicitly mention the points about 'declarative updates' and
+> 'sticky identity' for StatefulSets as stated in the reference."
+**This is the cleanest evidence-vs-reasoning contradiction in the
+calibration set.** The model quotes "declarative updates" and "sticky
+identity" into evidence, then writes "does not explicitly mention
+'declarative updates' and 'sticky identity'". The score field is then
+1, following the reasoning.
+---
+## Item k8s_018 — autoscaling/v2 vs v1 (gold=2, Haiku=2, gpt=1)
+**Reference (gold) answer:** "autoscaling/v2 is the current stable API
+version; supports memory metrics beyond CPU-only autoscaling/v1; supports
+custom metrics."
+### gpt-4o-mini evidence_quotes (extraction step — correct)
+```
+[
+  "The `autoscaling/v2` API version is the current stable version (stable since Kubernetes v1.23) that includes support for scaling on **custom metrics**.",
+  "The `autoscaling/v2` API version includes support for scaling on **memory metrics**, which is not available in the older `autoscaling/v1` version.",
+  "With `autoscaling/v2`, you can specify multiple metrics for a HorizontalPodAutoscaler to scale on."
+]
+```
+All three reference points are extracted: stable version, memory metrics,
+custom metrics.
+### gpt-4o-mini reasoning (literal-match regression)
+> "The answer covers some key points from the reference but does not
+> fully address all of them. It mentions the stable version of
+> `autoscaling/v2`, support for custom metrics, and memory metrics, but
+> it does not explicitly state that the new fields in `autoscaling/v2`
+> are preserved as annotations when using `autoscaling/v1`, nor does it
+> mention the need to use `autoscaling/v2` directly for memory or custom
+> metric scaling for a Deployment or StatefulSet. Therefore, it
+> partially covers the reference points."
+**Same pattern, with extra-credit deduction.** The reasoning
+acknowledges the three reference points are covered ("It mentions the
+stable version of `autoscaling/v2`, support for custom metrics, and
+memory metrics") and then deducts for points the *reference does not
+require* ("does not explicitly state that the new fields in
+`autoscaling/v2` are preserved as annotations when using
+`autoscaling/v1`"). The reference (per the gold annotation) requires
+three points; gpt's reasoning invents a fourth and penalizes for it.
+---
+## Why this matters for the writeup
+This isn't "GPT-4o-mini is bad at completeness." It's a sharper claim:
+*the structured-output discipline forces correct extraction, but the
+reasoning step regresses to a literal-match standard the rubric does
+not specify*. That regression is dimension-specific (groundedness AC1 =
+1.000, relevance AC1 = 1.000 on the same model) — it surfaces only on
+the 3-point ordinal scale where "paraphrase allowed" is load-bearing.
+Two consequences for evaluation framework design:
+1. **Per-dimension judge selection matters more than per-judge selection.**
+   gpt-4o-mini is fine for binary groundedness and saturated relevance;
+   it's miscalibrated for paraphrase-tolerant ordinal completeness. v1's
+   global "include in jury" decision flattens this.
+2. **A judge's `reasoning` field can contradict its `evidence_quotes`
+   field, and the score follows the reasoning.** Internal consistency
+   between the two structured-output fields is not enforced by any
+   provider's structured-output API; it's a property of the model's
+   capability that varies across model sizes and dimensions. v1.2
+   diagnostics should include an internal-consistency check (does the
+   reasoning's score-direction match what the evidence_quotes would
+   support?) as an additional signal beyond raw κ.

measurements/README.md CHANGED Viewed

@@ -12,3 +12,4 @@ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
 Current entries:
 - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."

 Current entries:
 - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."
+- `2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3 dimensions hand-labels (single rater) for the κ ablation table in `docs/_generated/kappa_table.md` and the writeup at `docs/judge-design.md`. Backs the DECISIONS.md entry "LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call". Lands in Phase 10 (manual labeling).

pyproject.toml CHANGED Viewed

@@ -54,6 +54,7 @@ testpaths = ["tests"]
 [tool.ruff]
 target-version = "py311"
 line-length = 100
 [tool.ruff.lint]
 select = ["E", "F", "I", "N", "W"]
@@ -62,3 +63,4 @@ select = ["E", "F", "I", "N", "W"]
 python_version = "3.11"
 warn_return_any = true
 warn_unused_configs = true

 [tool.ruff]
 target-version = "py311"
 line-length = 100
+extend-exclude = ["scripts/_dev"]
 [tool.ruff.lint]
 select = ["E", "F", "I", "N", "W"]
 python_version = "3.11"
 warn_return_any = true
 warn_unused_configs = true
+exclude = ["scripts/_dev/"]

results/calibration_v1_judge_baseline.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_abstain.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_anchors.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_cot.json ADDED Viewed

	@@ -0,0 +1,2115 @@

+[
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer contains a claim not entailed by the snippet. The snippet describes the caching mechanism (what it does) but does not state that caching 'is particularly useful for expensive operations like database connections.' This is an embellishment importing general LLM knowledge about why caching is useful, similar to Example B in the rubric. The snippet uses `Depends(get_db)` as an identifier in its example, not as a normative argument that caching is useful for database connections. All other claims in the answer are entailed: the default caching behavior, the once-per-request semantics, the mechanism of reusing cached values, the `use_cache=False` disable mechanism, and the code example demonstrating the disable feature. However, the single ungrounded claim about usefulness for expensive operations fails the strict binary rubric.",
+    "evidence_quotes": [
+      "Snippet: 'By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once.'",
+      "Answer claim: 'This improves performance by avoiding redundant function calls and is particularly useful for expensive operations like database connections.'",
+      "The 'particularly useful for expensive operations like database connections' portion is not entailed by the snippet."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.005529600000000001,
+    "latency_ms": 5959.3987130210735,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims that need to be evaluated against the gold snippets:\n\n1. 'The default Swagger UI endpoint in FastAPI is `/docs`' - Snippet [1] directly states this.\n2. 'This endpoint is automatically available in every FastAPI application with zero configuration required' - Snippet [2] states 'Every FastAPI application exposes three documentation-related endpoints by default', which entails this claim.\n3. 'When you navigate to http://localhost:8000/docs... you'll see an interactive API explorer powered by Swagger UI' - Snippet [1] describes `/docs` as 'Swagger UI -- interactive API explorer', which entails this.\n4. 'FastAPI also provides two other documentation endpoints by default: `/redoc` and `/openapi.json`' - Snippet [2] states there are 'three documentation-related endpoints by default', and the answer identifies three total endpoints. However, the snippets do not explicitly name `/redoc` and `/openapi.json` as the other two endpoints.\n5. The code examples showing customization of `docs_url`, `redoc_url`, and `openapi_url` parameters - These are not entailed by the snippets provided.\n6. The code example showing how to disable endpoints by setting them to `None` - This is not entailed by the snippets provided.\n\nThe critical issue is that while Snippet [2] confirms there are three documentation endpoints by default, it does not specify what those three endpoints are. The answer names `/redoc` and `/openapi.json` as the other two, but this is not entailed by the snippets. Additionally, the customization examples and the ability to disable endpoints are not mentioned in the gold snippets at all.\n\nUnder the strict reference scope rule, claims that are factually correct but not entailed by the snippets must score 0. The answer goes beyond what the snippets explicitly state regarding the specific endpoint names and the customization/disabling capabilities.",
+    "evidence_quotes": [
+      "Snippet [1]: '/docs | Swagger UI -- interactive API explorer'",
+      "Snippet [2]: 'Every FastAPI application exposes three documentation-related endpoints by default'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.006808,
+    "latency_ms": 11671.607129916083,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "Evaluating each major claim in the answer against the gold snippets:\n\n1. 'HTTPException is the primary way to raise HTTP errors' - ENTAILED by snippet [1].\n2. 'Raising HTTPException immediately terminates request processing and returns specified status code and detail' - ENTAILED by snippet [2].\n3. 'detail parameter can be a string, list, or dictionary' - ENTAILED by snippet [2].\n4. 'FastAPI serializes detail to JSON automatically' - ENTAILED by snippet [2].\n5. 'status_code is required parameter' - NOT ENTAILED. Snippets do not specify which parameters are required vs optional.\n6. 'detail is optional parameter' - NOT ENTAILED. Snippets do not specify optionality.\n7. 'headers parameter (optional)' - NOT ENTAILED. Snippets make no mention of a headers parameter.\n8. 'Custom headers example with X-Custom-Header' - NOT ENTAILED. Snippets do not mention headers capability.\n9. 'Response is automatically formatted as JSON with detail field' - ENTAILED by snippet [2] (serializes to JSON).\n10. 'HTTPException inherits from Starlette's HTTPException' - NOT ENTAILED. Snippets do not mention Starlette or inheritance.\n11. 'Can create custom exception handlers' - NOT ENTAILED. Snippets do not mention custom exception handlers.\n\nThe answer contains multiple ungrounded claims: the headers parameter and its usage, parameter optionality/requirement specifications, the Starlette inheritance detail, and custom exception handlers. Under the strict binary rubric, even one ungrounded claim requires a score of 0.",
+    "evidence_quotes": [
+      "Snippet [1]: 'The HTTPException class is the primary way to return error responses from route handlers'",
+      "Snippet [2]: 'When raised, HTTPException immediately terminates request processing and returns the specified status code and detail message. The detail parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically.'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.006624,
+    "latency_ms": 7320.906443987042,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0016296000000000001,
+    "latency_ms": 806.1586309922859,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer provides a comprehensive FastAPI setup guide covering error handling, CORS, and dependency injection. However, examining against the gold snippets: Snippet [1] states 'The `HTTPException` class is the primary way to return error responses from route handlers.' Snippet [2] states 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette.' The answer makes numerous claims beyond what is entailed by these snippets: (1) Custom exception classes (ItemNotFoundException, DatabaseError) and their handlers are presented as a primary pattern, but the snippet specifies HTTPException as 'the primary way'; (2) Custom exception handlers for RequestValidationError are shown without snippet support; (3) Detailed CORS configuration parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age) are provided, but the snippet only states that CORS 'is configured using CORSMiddleware from Starlette' without detailing these parameters; (4) The entire dependency injection section is completely unsupported by the snippets; (5) Database class implementation, Pydantic models, and route handler examples are all beyond snippet scope. While the answer is factually correct FastAPI documentation, it dramatically exceeds what is entailed by the two minimal snippets provided. The strict-conjunction rule applies: even though most claims are correct in general FastAPI knowledge, they are not entailed by the gold snippets.",
+    "evidence_quotes": [
+      "Snippet [1]: 'The `HTTPException` class is the primary way to return error responses from route handlers'",
+      "Snippet [2]: 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette'",
+      "Answer claims: Custom exception classes, RequestValidationError handlers, detailed CORS parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age), dependency injection patterns, database classes, Pydantic models, and route implementations"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.007796000000000001,
+    "latency_ms": 9348.714338964783,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer provides a comprehensive implementation of a paginated API with cursor-based pagination, validation, and analytics. However, the gold snippet only states: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.' The answer goes far beyond this single sentence by providing: (1) complete Pydantic response models with validation, (2) cursor encoding/decoding utilities using base64, (3) analytics logging with background tasks, (4) a complete FastAPI endpoint implementation with mock database, and (5) detailed field specifications and configurations. While these implementations are factually correct and follow best practices, they are NOT entailed by the gold snippet. The snippet only describes what cursor-based pagination is and why it's useful (avoiding performance degradation). The answer adds extensive implementation details, specific technologies (Pydantic, base64 encoding, logging, BackgroundTasks), and architectural patterns that are not mentioned in or entailed by the snippet. This is similar to the k8s_006 calibration anchor: the answer dramatically extends beyond what the gold snippet covers.",
+    "evidence_quotes": [
+      "Gold snippet: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.'",
+      "Answer claims not in snippet: Complete Pydantic BaseModel definitions, Generic type parameters, Field validators with constraints, base64 encoding/decoding implementation, analytics logging with JSON serialization, FastAPI endpoint decorators, BackgroundTasks integration, mock database with 100 items, datetime handling, and specific HTTP query parameter configurations."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.007131199999999999,
+    "latency_ms": 8201.74567901995,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0018592,
+    "latency_ms": 16725.520576001145,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets provided. The snippets state: (1) A Deployment manages Pods for stateless workloads, and (2) A StatefulSet maintains sticky identity for each Pod. The answer goes far beyond these minimal claims to include: ordinal-based naming conventions, stable hostnames, persistent volumes via volumeClaimTemplates, ordered scaling/updates, specific use cases (databases, message queues, etc.), YAML configuration examples, headless service requirements, and detailed comparisons across multiple dimensions. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only establish the core distinction (stateless vs. stateful identity), not the mechanisms, naming schemes, storage patterns, ordering semantics, or use-case guidance that comprise most of the answer. This is analogous to the k8s_006 calibration anchor: the answer dramatically extends beyond the gold-snippet scope into general Kubernetes knowledge not present in the reference material.",
+    "evidence_quotes": [
+      "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
+      "Snippet [2]: 'Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods'",
+      "Answer claims not in snippets: 'Ordinal-based names (e.g., web-0, web-1, web-2)', 'Stable hostname maintained across rescheduling', 'Persistent volumes per Pod via volumeClaimTemplates', 'Ordered, sequential operations (0\u21921\u21922)', 'Headless Service Required', detailed YAML examples, specific use cases like 'Databases (MySQL, PostgreSQL, MongoDB)'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0074048,
+    "latency_ms": 20093.030811985955,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011CajszTeiZPK5rrayb3438'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0031856,
+    "latency_ms": 51021.941915969364,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about Pod Security Admission's enforce and warn modes. Evaluating against the gold snippets:\n\nSnippet [1] states: 'Policy violations will cause the pod to be rejected'\nSnippet [2] only provides feature state information (Kubernetes v1.25 [stable])\n\nThe answer's key claims:\n1. 'Policy violations will cause the pod to be rejected' in enforce mode - ENTAILED by snippet [1]\n2. 'Policy violations are otherwise allowed' in warn mode - NOT ENTAILED by snippets\n3. 'Pod creation fails silently' in enforce mode - NOT ENTAILED by snippets\n4. 'Policy violations trigger a user-facing warning' in warn mode - NOT ENTAILED by snippets\n5. Claims about workload resources not being applied in enforce mode - NOT ENTAILED by snippets\n6. Claims about workload resources being applied in warn mode - NOT ENTAILED by snippets\n7. Namespace labeling examples and configuration details - NOT ENTAILED by snippets\n8. Claims about independent configuration - NOT ENTAILED by snippets\n\nThe gold snippets provide only minimal information: that policy violations cause pod rejection (in some mode, presumably enforce), and a feature state marker. The answer goes far beyond this, making numerous claims about warn mode behavior, workload resource handling, configuration examples, and practical implications that are not entailed by the provided snippets. While these claims may be factually correct in Kubernetes documentation, they are not grounded in the gold snippets provided. The strict-entailment rule requires that every claim be entailed by the gold snippets only.",
+    "evidence_quotes": [
+      "Snippet [1]: 'Policy violations will cause the pod to be rejected'",
+      "Snippet [2]: 'FEATURE STATE: Kubernetes v1.25 [stable]'",
+      "Answer claims about warn mode, workload resources, configuration examples, and practical implications are not present in or entailed by the gold snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0065352000000000006,
+    "latency_ms": 36792.009823955595,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer makes numerous claims beyond this single sentence: (1) three consumption methods (environment variables, command-line arguments, volumes), (2) specific examples of what to store (application configuration settings, environment-specific parameters, configuration files, feature flags), (3) what NOT to store (passwords, API keys, tokens, certificates), (4) recommendation to use Kubernetes Secrets instead, and (5) details about `data` and `binaryData` fields. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the gold snippet provided. The snippet only defines what a ConfigMap is and that it stores non-confidential data in key-value pairs. It does not describe consumption methods, specific use cases, security warnings, or schema fields. This is analogous to calibration anchor k8s_006: the answer dramatically extends beyond the snippet's scope. The strict-conjunction rule applies\u2014a single ungrounded claim fails the binary rubric, and this answer contains multiple ungrounded claims.",
+    "evidence_quotes": [
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      ":",
+      " ",
+      "'",
+      "A",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      " ",
+      "i",
+      "s",
+      " ",
+      "a",
+      "n",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "o",
+      "b",
+      "j",
+      "e",
+      "c",
+      "t",
+      " ",
+      "u",
+      "s",
+      "e",
+      "d",
+      " ",
+      "t",
+      "o",
+      " ",
+      "s",
+      "t",
+      "o",
+      "r",
+      "e",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      " ",
+      "i",
+      "n",
+      " ",
+      "k",
+      "e",
+      "y",
+      "-",
+      "v",
+      "a",
+      "l",
+      "u",
+      "e",
+      " ",
+      "p",
+      "a",
+      "i",
+      "r",
+      "s",
+      ".",
+      "'",
+      " ",
+      "A",
+      "n",
+      "s",
+      "w",
+      "e",
+      "r",
+      " ",
+      "c",
+      "l",
+      "a",
+      "i",
+      "m",
+      "s",
+      " ",
+      "n",
+      "o",
+      "t",
+      " ",
+      "i",
+      "n",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      ":",
+      " ",
+      "'",
+      "P",
+      "o",
+      "d",
+      "s",
+      " ",
+      "c",
+      "a",
+      "n",
+      " ",
+      "c",
+      "o",
+      "n",
+      "s",
+      "u",
+      "m",
+      "e",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "i",
+      "n",
+      " ",
+      "t",
+      "h",
+      "r",
+      "e",
+      "e",
+      " ",
+      "w",
+      "a",
+      "y",
+      "s",
+      ":",
+      " ",
+      "E",
+      "n",
+      "v",
+      "i",
+      "r",
+      "o",
+      "n",
+      "m",
+      "e",
+      "n",
+      "t",
+      " ",
+      "v",
+      "a",
+      "r",
+      "i",
+      "a",
+      "b",
+      "l",
+      "e",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "m",
+      "m",
+      "a",
+      "n",
+      "d",
+      "-",
+      "l",
+      "i",
+      "n",
+      "e",
+      " ",
+      "a",
+      "r",
+      "g",
+      "u",
+      "m",
+      "e",
+      "n",
+      "t",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "f",
+      "i",
+      "l",
+      "e",
+      "s",
+      " ",
+      "i",
+      "n",
+      " ",
+      "a",
+      " ",
+      "v",
+      "o",
+      "l",
+      "u",
+      "m",
+      "e",
+      "'",
+      ";",
+      " ",
+      "'",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "a",
+      "r",
+      "e",
+      " ",
+      "d",
+      "e",
+      "s",
+      "i",
+      "g",
+      "n",
+      "e",
+      "d",
+      " ",
+      "f",
+      "o",
+      "r",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      ",",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "s",
+      "e",
+      "n",
+      "s",
+      "i",
+      "t",
+      "i",
+      "v",
+      "e",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      ",",
+      " ",
+      "s",
+      "u",
+      "c",
+      "h",
+      " ",
+      "a",
+      "s",
+      ":",
+      " ",
+      "A",
+      "p",
+      "p",
+      "l",
+      "i",
+      "c",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "s",
+      "e",
+      "t",
+      "t",
+      "i",
+      "n",
+      "g",
+      "s",
+      ",",
+      " ",
+      "E",
+      "n",
+      "v",
+      "i",
+      "r",
+      "o",
+      "n",
+      "m",
+      "e",
+      "n",
+      "t",
+      "-",
+      "s",
+      "p",
+      "e",
+      "c",
+      "i",
+      "f",
+      "i",
+      "c",
+      " ",
+      "p",
+      "a",
+      "r",
+      "a",
+      "m",
+      "e",
+      "t",
+      "e",
+      "r",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "f",
+      "i",
+      "l",
+      "e",
+      "s",
+      ",",
+      " ",
+      "F",
+      "e",
+      "a",
+      "t",
+      "u",
+      "r",
+      "e",
+      " ",
+      "f",
+      "l",
+      "a",
+      "g",
+      "s",
+      "'",
+      ";",
+      " ",
+      "'",
+      "D",
+      "o",
+      " ",
+      "N",
+      "O",
+      "T",
+      " ",
+      "s",
+      "t",
+      "o",
+      "r",
+      "e",
+      " ",
+      "s",
+      "e",
+      "n",
+      "s",
+      "i",
+      "t",
+      "i",
+      "v",
+      "e",
+      " ",
+      "o",
+      "r",
+      " ",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      " ",
+      "i",
+      "n",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      ",",
+      " ",
+      "s",
+      "u",
+      "c",
+      "h",
+      " ",
+      "a",
+      "s",
+      ":",
+      " ",
+      "P",
+      "a",
+      "s",
+      "s",
+      "w",
+      "o",
+      "r",
+      "d",
+      "s",
+      ",",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "k",
+      "e",
+      "y",
+      "s",
+      ",",
+      " ",
+      "T",
+      "o",
+      "k",
+      "e",
+      "n",
+      "s",
+      ",",
+      " ",
+      "C",
+      "e",
+      "r",
+      "t",
+      "i",
+      "f",
+      "i",
+      "c",
+      "a",
+      "t",
+      "e",
+      "s",
+      "'",
+      ";",
+      " ",
+      "'",
+      "y",
+      "o",
+      "u",
+      " ",
+      "s",
+      "h",
+      "o",
+      "u",
+      "l",
+      "d",
+      " ",
+      "u",
+      "s",
+      "e",
+      " ",
+      "K",
+      "u",
+      "b",
+      "e",
+      "r",
+      "n",
+      "e",
+      "t",
+      "e",
+      "s",
+      " ",
+      "S",
+      "e",
+      "c",
+      "r",
+      "e",
+      "t",
+      "s",
+      " ",
+      "i",
+      "n",
+      "s",
+      "t",
+      "e",
+      "a",
+      "d",
+      "'",
+      ";",
+      " ",
+      "'",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "h",
+      "a",
+      "v",
+      "e",
+      " ",
+      "`",
+      "d",
+      "a",
+      "t",
+      "a",
+      "`",
+      " ",
+      "a",
+      "n",
+      "d",
+      " ",
+      "`",
+      "b",
+      "i",
+      "n",
+      "a",
+      "r",
+      "y",
+      "D",
+      "a",
+      "t",
+      "a",
+      "`",
+      " ",
+      "f",
+      "i",
+      "e",
+      "l",
+      "d",
+      "s",
+      "'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.006216,
+    "latency_ms": 16998.984744946938,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.' The answer makes numerous claims that go far beyond what is entailed by this single snippet. Specifically: (1) The answer describes Jobs as 'finite or batch tasks to completion' \u2014 not entailed by the snippet. (2) The answer explains completion modes (NonIndexed vs Indexed) with detailed specifications about `.spec.completionMode`, `.spec.completions`, and how each mode determines completion \u2014 none of this is mentioned in the snippet. (3) The answer describes how completion indices are made available through pod annotations, labels, and environment variables \u2014 not entailed. (4) The answer explains `.spec.parallelism` and `.spec.backoffLimit` parameters \u2014 not entailed. (5) The answer states Jobs are 'useful for running tasks that need to run once and then stop, unlike Deployments which run continuously' \u2014 a use-case comparison not in the snippet. While the answer is factually correct about Kubernetes Jobs, the strict-snippet scope rule is clear: claims not entailed by the gold snippet must score 0, even if they are correct in the broader documentation or world knowledge. This answer dramatically extends beyond the single sentence provided, similar to the k8s_006 calibration anchor. The only claim entailed by the snippet is that a Job 'creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate,' which appears in the opening paragraph. Everything else fails strict entailment.",
+    "evidence_quotes": [
+      "Gold snippet: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate'",
+      "Answer claims about completion modes, indexed vs non-indexed, .spec.completionMode, environment variables, parallelism, backoffLimit \u2014 none present in snippet"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0067864,
+    "latency_ms": 12302.207177970558,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding.' The answer correctly identifies these four object types. However, the answer goes significantly beyond what is entailed by the snippet. The snippet only declares that these four kinds exist; it does not describe their purposes, scopes, or relationships. The answer makes numerous claims not entailed by the snippet: (1) Role 'defines a set of permissions within a specific namespace' and 'contains rules that represent a set of permissions'; (2) ClusterRole is 'cluster-scoped rather than namespace-scoped' and 'can be used to grant permissions across the entire cluster'; (3) RoleBinding 'grants the permissions defined in a Role' and 'connects a Role to subjects'; (4) ClusterRoleBinding 'grants the permissions defined in a ClusterRole at the cluster level'; (5) the distinction that 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions'; (6) the namespace vs cluster-scoped distinction and its implications. While these claims are factually correct in Kubernetes documentation, they are not entailed by the single-sentence snippet provided. The snippet only names the four objects; it does not describe their functionality, scope, or relationships. This is analogous to calibration anchor k8s_006, where the answer dramatically extends beyond what the snippet entails.",
+    "evidence_quotes": [
+      "Gold snippet: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding'",
+      "Answer claims about Role: 'Defines a set of permissions within a specific namespace. It contains rules that represent a set of permissions.' \u2014 NOT in snippet",
+      "Answer claims about ClusterRole: 'Similar to a Role, but is cluster-scoped rather than namespace-scoped. It can be used to grant permissions across the entire cluster.' \u2014 NOT in snippet",
+      "Answer claims about RoleBinding: 'Grants the permissions defined in a Role to a user or set of users within a specific namespace. It connects a Role to subjects' \u2014 NOT in snippet",
+      "Answer claims about ClusterRoleBinding: 'Similar to a RoleBinding, but grants the permissions defined in a ClusterRole at the cluster level.' \u2014 NOT in snippet",
+      "Answer meta-claim: 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions.' \u2014 NOT in snippet"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0070079999999999995,
+    "latency_ms": 10670.279069978278,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet provided is incomplete: '[1] Secrets can be mounted as data volumes or exposed as'. It cuts off mid-sentence and provides no substantive information about the behavioral differences between these two approaches. The answer makes numerous specific claims about automatic updates, dynamic refresh, subPath mount limitations, static environment variables, pod restart requirements, character restrictions, and use-case recommendations. None of these claims are entailed by the incomplete snippet fragment, which merely states that Secrets *can* be mounted or exposed without elaborating on how they differ or what their characteristics are. Under strict reference scope (gold snippets only) and the binary rubric requiring every claim to be entailed, this answer fails because the snippet provides no content to ground any of the detailed behavioral claims made.",
+    "evidence_quotes": [
+      "Gold snippet: '[1] Secrets can be mounted as data volumes or exposed as' (incomplete/truncated)"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0053408,
+    "latency_ms": 11916.670492035337,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'. This is a single factual claim about what happens when medium is set to Memory. The answer makes numerous claims beyond this snippet: (1) default emptyDir storage location and behavior, (2) performance characteristics of both modes, (3) memory impact and counting against container limits, (4) size limit specifications and node allocatable memory, (5) resource considerations about memory cost and capacity. While many of these claims may be factually correct in Kubernetes documentation, they are NOT entailed by the provided gold snippet. The snippet only establishes that setting medium to 'Memory' results in tmpfs mounting. Claims about default behavior, memory accounting, size limits, and resource considerations go dramatically beyond what the snippet entails. This is analogous to the k8s_006 calibration anchor where the answer extends far beyond the single-sentence snippet provided.",
+    "evidence_quotes": [
+      "Gold snippet: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'",
+      "Answer claims not in snippet: storage location for default emptyDir, performance characteristics, memory impact and counting against limits, size limit behavior, node allocatable memory sizing, resource considerations"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.005747200000000001,
+    "latency_ms": 30587.856588943396,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about liveness and readiness probes. Checking against the gold snippets:\n\nSnippet [1]: 'uses liveness probes to know when to restart a container'\nSnippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'\n\nThe answer claims:\n1. Liveness probe failure \u2192 kubelet kills container and restarts it\n2. Readiness probe failure \u2192 kubelet continues running container, doesn't kill it\n3. Readiness probe failure \u2192 pod won't receive traffic from Services\n4. Liveness probe is for detecting unrecoverable failures\n5. Readiness probe is for detecting temporary unavailability\n6. Best practice details about using both probes together with same endpoint, higher failureThreshold for liveness\n\nSnippet [1] only states that liveness probes are used 'to know when to restart a container' - it entails that liveness probes trigger restarts.\n\nSnippet [2] only states that readiness probes are used 'to know when a container is ready to start accepting traffic' - it entails that readiness probes relate to traffic acceptance.\n\nHowever, the answer makes many claims NOT entailed by these minimal snippets:\n- The specific behavior that readiness probe failure keeps the container running (not stated in snippet [2])\n- The detail about Pod's Ready condition being set to false\n- The use-case explanations (unrecoverable failures, temporary unavailability)\n- The best practice section about using both together, same endpoint, failureThreshold settings\n- The terminationGracePeriodSeconds detail\n\nWhile these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only state the basic purpose of each probe type, not the detailed behavioral differences or best practices described in the answer.\n\nUnder strict reference scope (gold snippets only), this answer goes far beyond what is entailed.",
+    "evidence_quotes": [
+      "Snippet [1]: 'uses liveness probes to know when to restart a container'",
+      "Snippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'",
+      "Answer claims: 'The kubelet kills the container and restarts it', 'The kubelet continues running the container', 'The pod will not receive traffic from Kubernetes Services', 'Use case: Detecting unrecoverable application failures', 'Use case: Indicating that a container is temporarily unable to handle traffic', 'A common pattern is to use both probes together'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.007291200000000001,
+    "latency_ms": 26607.242761005182,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets. The snippets only state: (1) NodePort 'Exposes the Service on each Node' and (2) LoadBalancer 'Exposes the Service externally using an external load balancer.' The answer adds many ungrounded claims including: the NodePort range (30000-32767), that NodePort creates a cluster IP internally, that external traffic accesses via <NodeIP>:<NodePort>, that LoadBalancer requires cloud provider integration, that LoadBalancer creates a NodePort internally, the optional `spec.allocateLoadBalancerNodePorts: false` configuration, and various use-case recommendations (development/testing vs production). While these claims may be factually correct in Kubernetes documentation, they are not entailed by the gold snippets provided. The strict-reference-scope rule requires that claims be entailed by the gold snippets only, not by broader domain knowledge or the underlying documentation. This is analogous to calibration anchor k8s_006, where correct information beyond the snippet scope scores 0.",
+    "evidence_quotes": [
+      "Gold snippet [1]: 'Exposes the Service on each Node'",
+      "Gold snippet [2]: 'Exposes the Service externally using an external load balancer'",
+      "Answer claims not in snippets: NodePort range (30000-32767), cluster IP creation, <NodeIP>:<NodePort> access pattern, cloud provider integration requirement, LoadBalancer creates NodePort internally, spec.allocateLoadBalancerNodePorts configuration, development vs production suitability"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0065256,
+    "latency_ms": 11835.180599999148,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims that must be checked against the two gold snippets: [1] 'Jobs represent one-off tasks that run to completion and then stop' and [2] 'A CronJob starts one-time Jobs on a repeating schedule'. The answer's core claims are: (1) Jobs run once when created; (2) CronJobs run on a repeating schedule; (3) Jobs are for one-time batch tasks; (4) CronJobs are for recurring tasks; (5) CronJobs create Job objects approximately once per scheduled execution; (6) Kubernetes doesn't guarantee exactly-once semantics; (7) Two Jobs might be created for a single schedule; (8) No Job might be created for a scheduled time; (9) Jobs created by CronJobs should be idempotent; (10) You can set startingDeadlineSeconds on a CronJob; (11) startingDeadlineSeconds controls what happens if a scheduled Job is missed. Claims 1-4 are directly entailed by the snippets (paraphrase of 'one-off tasks' and 'repeating schedule'). However, claims 5-11 regarding exactly-once semantics, idempotency requirements, the possibility of multiple or zero Job creations, and the startingDeadlineSeconds mechanism are NOT entailed by the gold snippets. These are factually correct Kubernetes behaviors but are not present in the provided reference material. The strict-entailment rubric requires that every claim be entailed by the gold snippets only. The answer goes well beyond what the snippets support, similar to the k8s_006 calibration anchor where correct information from the broader documentation fails because it's not in the gold snippet scope.",
+    "evidence_quotes": [
+      "Gold snippet [1]: 'Jobs represent one-off tasks that run to completion and then stop'",
+      "Gold snippet [2]: 'A CronJob starts one-time Jobs on a repeating schedule'",
+      "Answer claims not in snippets: 'Kubernetes doesn't guarantee exactly-once semantics', 'Two Jobs might be created for a single schedule', 'No Job might be created for a scheduled time', 'Jobs created by CronJobs should be idempotent', 'You can set startingDeadlineSeconds on a CronJob'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.007357600000000001,
+    "latency_ms": 27444.345601077657,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about Deployments and DaemonSets. Checking against the gold snippets:\n\nSnippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'\nSnippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'\n\nClaims in the answer:\n1. Deployment manages 'specified number of Pod replicas' - entailed by [1] ('manages a set of Pods')\n2. Deployment Pods run on 'any eligible nodes' - NOT entailed by snippet [1]\n3. Deployment 'uses the standard Kubernetes scheduler' - NOT entailed by snippets\n4. Deployment 'Pods are distributed across nodes based on availability and scheduling policies' - NOT entailed by snippets\n5. DaemonSet 'ensures that all eligible nodes (or a subset based on node selectors/affinity) run exactly one copy' - partially entailed by [2] ('all (or some) Nodes run a copy of a Pod'), though [2] doesn't specify 'exactly one'\n6. DaemonSet controller 'creates a Pod for each eligible node and adds nodeAffinity rules' - NOT entailed by [2]\n7. DaemonSet uses '.spec.nodeName field' - NOT entailed by [2]\n8. Use cases section - NOT entailed by snippets\n\nThe answer goes far beyond what the snippets entail. While some core concepts are present (Deployment manages Pods, DaemonSet runs on nodes), the answer adds substantial detail about scheduling mechanisms, nodeAffinity, nodeName fields, and use cases that are not in the gold snippets. Under strict snippet entailment, this fails the binary rubric.",
+    "evidence_quotes": [
+      "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
+      "Snippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'",
+      "Answer claims about 'standard Kubernetes scheduler', 'nodeAffinity rules', '.spec.nodeName field', and use cases are not entailed by the snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.006692,
+    "latency_ms": 14147.950275044423,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt33D8HsUdB7oGVqJ1A'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0,
+    "latency_ms": 34940.02001499757,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes multiple specific claims about the `autoscaling/v2` API version and HorizontalPodAutoscaler functionality. However, the gold snippets provided are extremely minimal: [1] states that HorizontalPodAutoscaler 'automatically updates a workload resource' and [2] is incomplete ('The current stable version can be found in the'). Neither snippet entails the detailed claims in the answer about: (a) autoscaling/v2 being the current stable version since Kubernetes v1.23, (b) support for custom metrics, (c) support for memory metrics, (d) the ability to specify multiple metrics, (e) configurable scaling behavior with separate scale-up/scale-down behaviors, (f) autoscaling/v1 only supporting CPU-based scaling, or (g) the comparison between v1 and v2 features. While these claims may be factually correct in the broader Kubernetes documentation, they are not entailed by the gold snippets provided. The strict reference scope is the gold snippets only, and the answer dramatically exceeds what can be inferred from them. This is analogous to the k8s_006 calibration anchor: most of the answer is well-supported by the broader corpus, but the gold-snippet scope is what the rubric measures.",
+    "evidence_quotes": [
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "[",
+      "1",
+      "]",
+      ":",
+      " ",
+      "'",
+      "H",
+      "o",
+      "r",
+      "i",
+      "z",
+      "o",
+      "n",
+      "t",
+      "a",
+      "l",
+      "P",
+      "o",
+      "d",
+      "A",
+      "u",
+      "t",
+      "o",
+      "s",
+      "c",
+      "a",
+      "l",
+      "e",
+      "r",
+      "*",
+      " ",
+      "a",
+      "u",
+      "t",
+      "o",
+      "m",
+      "a",
+      "t",
+      "i",
+      "c",
+      "a",
+      "l",
+      "l",
+      "y",
+      " ",
+      "u",
+      "p",
+      "d",
+      "a",
+      "t",
+      "e",
+      "s",
+      " ",
+      "a",
+      " ",
+      "w",
+      "o",
+      "r",
+      "k",
+      "l",
+      "o",
+      "a",
+      "d",
+      " ",
+      "r",
+      "e",
+      "s",
+      "o",
+      "u",
+      "r",
+      "c",
+      "e",
+      "'",
+      ".",
+      " ",
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "[",
+      "2",
+      "]",
+      ":",
+      " ",
+      "'",
+      "T",
+      "h",
+      "e",
+      " ",
+      "c",
+      "u",
+      "r",
+      "r",
+      "e",
+      "n",
+      "t",
+      " ",
+      "s",
+      "t",
+      "a",
+      "b",
+      "l",
+      "e",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "a",
+      "n",
+      " ",
+      "b",
+      "e",
+      " ",
+      "f",
+      "o",
+      "u",
+      "n",
+      "d",
+      " ",
+      "i",
+      "n",
+      " ",
+      "t",
+      "h",
+      "e",
+      "'",
+      " ",
+      "(",
+      "i",
+      "n",
+      "c",
+      "o",
+      "m",
+      "p",
+      "l",
+      "e",
+      "t",
+      "e",
+      ")",
+      ".",
+      " ",
+      "N",
+      "e",
+      "i",
+      "t",
+      "h",
+      "e",
+      "r",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "s",
+      "u",
+      "p",
+      "p",
+      "o",
+      "r",
+      "t",
+      "s",
+      " ",
+      "c",
+      "l",
+      "a",
+      "i",
+      "m",
+      "s",
+      " ",
+      "a",
+      "b",
+      "o",
+      "u",
+      "t",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      "s",
+      ",",
+      " ",
+      "c",
+      "u",
+      "s",
+      "t",
+      "o",
+      "m",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "m",
+      "e",
+      "m",
+      "o",
+      "r",
+      "y",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "m",
+      "u",
+      "l",
+      "t",
+      "i",
+      "p",
+      "l",
+      "e",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "s",
+      "c",
+      "a",
+      "l",
+      "i",
+      "n",
+      "g",
+      " ",
+      "b",
+      "e",
+      "h",
+      "a",
+      "v",
+      "i",
+      "o",
+      "r",
+      "s",
+      ",",
+      " ",
+      "o",
+      "r",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "o",
+      "m",
+      "p",
+      "a",
+      "r",
+      "i",
+      "s",
+      "o",
+      "n",
+      "s",
+      "."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.005912000000000001,
+    "latency_ms": 29024.256383010652,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only that 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer provides four detailed mechanisms for consuming ConfigMap values (environment variables individual, environment variables bulk, volume mounts, and Kubernetes API access), along with YAML examples, limitations, update behavior, and a comparison table. None of these consumption methods, technical details, YAML syntax, update semantics, or use-case guidance are entailed by the single-sentence snippet. While these claims are factually correct in Kubernetes documentation, they are not entailed by the gold snippet provided. This is analogous to calibration anchor k8s_006: the answer goes dramatically beyond what the snippet supports, covering consumption methods, schema details, and operational guidance that the snippet does not mention. The strict-conjunction rule applies: even though the answer is well-supported by the broader Kubernetes corpus, the gold-snippet scope is what the rubric measures.",
+    "evidence_quotes": [
+      "Gold snippet: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs'",
+      "Answer claims not in snippet: 'four main mechanisms', 'Environment Variables (Individual)', 'Environment Variables (Bulk)', 'Volume Mounts (Files)', 'Kubernetes API (Direct Access)', YAML examples, 'not updated automatically', 'supports dynamic updates', 'propagation delays', 'subPath volume mounts will not receive ConfigMap updates', comparison table with update behavior"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0060544,
+    "latency_ms": 21910.545678983908,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50 requests per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt3cHxhyPgmbXLYt5AS'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0022536000000000006,
+    "latency_ms": 36177.12139704963,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'purely additive (there are no \"deny\" rules)'. This establishes that standard Kubernetes RBAC is allow-based only with no deny rules. The answer's opening claim that 'Standard Kubernetes RBAC does **not have explicit deny rules**' and 'RBAC is **allow-based only**' are directly entailed by this snippet. However, the answer then introduces Option 2 (CEL-based Authorization Policy) and Option 3 (Admission Controllers) as mechanisms to implement explicit deny rules. These options are presented as valid approaches to achieve deny functionality in Kubernetes. The gold snippet does not mention CEL-based authorization policies, AuthorizationPolicy resources, or admission controllers as ways to implement deny rules. While these may be factually correct in the broader Kubernetes ecosystem, they are not entailed by the gold snippet provided. The snippet only establishes what standard RBAC is (allow-based, no deny rules), not what alternative mechanisms exist to add deny functionality. The answer goes beyond the scope of the gold snippet by introducing solutions not mentioned in the reference material. This violates the strict-entailment requirement that claims must be entailed by the gold snippets only, not by general LLM knowledge.",
+    "evidence_quotes": [
+      "Gold snippet: 'purely additive (there are no \"deny\" rules)'",
+      "Answer claim: 'Option 2: Use CEL-based Authorization Policy (Kubernetes 1.26+)' with detailed YAML examples",
+      "Answer claim: 'Option 3: Use Admission Controllers' with ValidatingAdmissionPolicy examples"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0063864,
+    "latency_ms": 14159.782970964443,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.001712,
+    "latency_ms": 558.5277889622375,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippets provided are minimal: [1] 'Exposes the Service on a cluster-internal IP' and [2] 'Exposes the Service externally using an external load balancer.' Snippet [1] describes ClusterIP (internal exposure), and snippet [2] describes LoadBalancer (external exposure). The answer makes numerous claims about NodePort, LoadBalancer, ExternalIPs, ClusterIP, and Ingress/Gateway resources. Most of these claims are not entailed by the two snippets. Specifically: (1) The detailed description of NodePort (allocating ports on every node, connecting via node IP and port, traffic forwarding) is not mentioned in either snippet. (2) The claim about LoadBalancer 'automatically provisions an external load balancer' and 'provides a stable external IP address' goes beyond snippet [2], which only states it 'exposes the Service externally using an external load balancer.' (3) The entire section on ExternalIPs is not covered by either snippet. (4) The ClusterIP section's claim that it 'only exposes the Service on a cluster-internal IP' is a paraphrase of snippet [1], which is grounded. (5) The claim about using Ingress or Gateway resources to expose ClusterIP is not entailed by either snippet. Under the strict-conjunction rule, a single ungrounded claim fails the binary rubric. This answer contains multiple ungrounded claims beyond what the snippets entail.",
+    "evidence_quotes": [
+      "Snippet [1]: 'Exposes the Service on a cluster-internal IP'",
+      "Snippet [2]: 'Exposes the Service externally using an external load balancer'",
+      "Answer claims about NodePort (allocation, node IP access, traffic forwarding) are not in snippets",
+      "Answer claims about LoadBalancer details (automatic provisioning, stable external IP) exceed snippet [2]",
+      "Answer section on ExternalIPs is entirely absent from snippets",
+      "Answer claims about Ingress/Gateway are not entailed by snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0067864,
+    "latency_ms": 20653.504143992905,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0008928,
+    "latency_ms": 5275.652518961579,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930",
+    "cost_usd": 0.000776,
+    "latency_ms": 613.2173900259659,
+    "item_id": "q010"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0007768,
+    "latency_ms": 4376.481520943344,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0008864000000000001,
+    "latency_ms": 8851.143473002594,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0006552000000000001,
+    "latency_ms": 6161.781317030545,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.001404,
+    "latency_ms": 1018.7896409770474,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.001404,
+    "latency_ms": 22714.352431998122,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e",
+    "cost_usd": 0.0009288,
+    "latency_ms": 4240.290573972743,
+    "item_id": "q027"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.00088,
+    "latency_ms": 4285.718351020478,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.001404,
+    "latency_ms": 6870.535210997332,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0011152,
+    "latency_ms": 2953.8072769646533,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437",
+    "cost_usd": 0.0010424000000000002,
+    "latency_ms": 4848.448178963736,
+    "item_id": "k8s_004"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0008928,
+    "latency_ms": 10320.249837997835,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0008608000000000001,
+    "latency_ms": 8522.756394988392,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0009432,
+    "latency_ms": 6325.502372987103,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0007928000000000001,
+    "latency_ms": 1811.099338985514,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0009128000000000001,
+    "latency_ms": 768.1542619829997,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0008704000000000001,
+    "latency_ms": 12236.18174099829,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.000932,
+    "latency_ms": 11616.58075498417,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0009008000000000001,
+    "latency_ms": 4196.989124990068,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.0009376,
+    "latency_ms": 838.445411005523,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0008352,
+    "latency_ms": 5632.905109028798,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0009328000000000001,
+    "latency_ms": 2904.8574669868685,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.000872,
+    "latency_ms": 16631.02817395702,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0011104,
+    "latency_ms": 5025.444047001656,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.000888,
+    "latency_ms": 6382.348418002948,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6uVe7LRm3cyR9dtBR'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0,
+    "latency_ms": 20767.425035010092,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0007504000000000001,
+    "latency_ms": 5516.492156020831,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6wc9SEcDRfPrzZBVp'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f",
+    "cost_usd": 0.0,
+    "latency_ms": 18978.80935500143,
+    "item_id": "k8s_024"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0008464000000000001,
+    "latency_ms": 3490.8632279839367,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0008888,
+    "latency_ms": 722.3775110323913,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0007648,
+    "latency_ms": 4397.730973025318,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0008768000000000001,
+    "latency_ms": 1102.155871980358,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt7RiL9hz5pVxt333xL'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0,
+    "latency_ms": 18437.84686899744,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.0014032,
+    "latency_ms": 1967.281456978526,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.0014048,
+    "latency_ms": 4361.171844007913,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0008992000000000001,
+    "latency_ms": 821.1477959994227,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.001436,
+    "latency_ms": 4178.335952979978,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0011384000000000001,
+    "latency_ms": 5098.598277952988,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0009088000000000001,
+    "latency_ms": 1836.335435975343,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0008712,
+    "latency_ms": 20613.944871001877,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0009632,
+    "latency_ms": 1971.2769520119764,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0008248000000000001,
+    "latency_ms": 5351.545320998412,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0009288,
+    "latency_ms": 2994.747666991316,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt8UJ8fPy6FvgG5MQKo'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0,
+    "latency_ms": 19476.016786997207,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.0009552,
+    "latency_ms": 574.3700260063633,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0009384,
+    "latency_ms": 5021.697896998376,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.000972,
+    "latency_ms": 776.8337430316024,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0008712,
+    "latency_ms": 4251.321821007878,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0009608,
+    "latency_ms": 4064.9051589425653,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0008984,
+    "latency_ms": 4741.949952032883,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer identifies two main ways to surface ConfigMap data to Pods: (1) exposing specific keys as environment variables, and (2) mounting as a volume. It also specifies a key difference: volume-mounted data updates in place when ConfigMap changes, while environment variables do not update until Pod restart. The agent's answer covers four mechanisms instead of two. Evaluating against the reference's key points: (1) Environment variables as a method - COVERED (both individual and bulk import shown). (2) Volume mounting as a method - COVERED (with detailed example). (3) Environment variables don't update until Pod restart - COVERED (explicitly stated in sections 1 and 2). (4) Volume-mounted data updates in place when ConfigMap changes - COVERED (stated in section 3). The agent adds two additional mechanisms (Kubernetes API direct access) not in the reference, but the rubric explicitly states 'The judge does not penalize the agent for adding correct extra detail.' All four key points from the reference are present in the answer, though paraphrased and expanded with additional context and examples.",
+    "evidence_quotes": [
+      "Reference point 1 - env vars: Agent states 'ConfigMap key-value pairs can be injected as individual environment variables' and 'All key-value pairs from a ConfigMap can be imported as environment variables'",
+      "Reference point 2 - volume mounting: Agent states 'ConfigMap data can be mounted as files in the container's filesystem' with 'Each key in the ConfigMap becomes a filename'",
+      "Reference point 3 - env vars don't auto-update: Agent explicitly states 'ConfigMaps consumed as environment variables are not updated automatically and require a pod restart'",
+      "Reference point 4 - volume updates in place: Agent states 'This mechanism supports dynamic updates when the ConfigMap changes (though there may be propagation delays)'"
+    ],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.004544,
+    "latency_ms": 10436.621424974874,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "Comparing the agent's answer against the reference's key points: (1) 'By default, Pods are non-isolated \u2014 they accept traffic from any source' - The agent covers this explicitly in the 'Default Baseline' section, stating 'By default, all Pods are non-isolated' for both ingress and egress. (2) 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' - The agent covers this, explaining that isolation occurs when a NetworkPolicy selects the Pod via podSelector. (3) 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' - The agent covers this concept, stating 'only the following connections are allowed' and 'only connections explicitly allowed by the rules...are permitted'. (4) 'NetworkPolicy rules can target ingress, egress, or both' - The agent covers this extensively with separate sections on ingress and egress isolation. (5) 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' - The agent does NOT mention the CNI plugin or clarify that Kubernetes itself does not enforce the policy. This is a notable omission of a key technical point from the reference.",
+    "evidence_quotes": [
+      "Reference: 'By default, Pods are non-isolated' | Agent: 'By default, all Pods are non-isolated for both ingress and egress traffic'",
+      "Reference: 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' | Agent: 'A Pod becomes isolated for ingress if there is any NetworkPolicy that selects the Pod (via podSelector)'",
+      "Reference: 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' | Agent: 'only the following connections are allowed' and 'only connections explicitly allowed by the rules'",
+      "Reference: 'NetworkPolicy rules can target ingress, egress, or both' | Agent: Covers this with separate ingress and egress sections",
+      "Reference: 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' | Agent: No mention of CNI plugin or enforcement mechanism"
+    ],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0041456,
+    "latency_ms": 9525.302856054623,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer makes one key point: Kubernetes RBAC does not support deny rules; it is purely additive, and the way to prevent an action is to not grant the permission. The answer to score covers this point explicitly in the opening section ('Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only') and recommends Option 1 (not granting the delete verb) as the primary solution. However, the answer goes significantly beyond the reference by introducing two additional options (CEL-based Authorization Policy and Admission Controllers) that are not mentioned in the reference. While these additions are correct and helpful, the scoring rubric explicitly states: 'The judge does not penalize the agent for adding correct extra detail (that's a separate concern). Score only on what fraction of the reference's points are present.' The single key point from the reference\u2014that RBAC has no deny rules and you must not grant permissions\u2014is fully covered in the answer.",
+    "evidence_quotes": [
+      "Reference: 'Kubernetes RBAC does not support deny rules...there are no deny rules. To prevent a user from deleting Pods you simply do not grant them a Role that contains the delete verb'",
+      "Answer: 'Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only...The simplest approach is to not grant the delete verb to the user.'"
+    ],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0041928,
+    "latency_ms": 13928.576675010845,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0007776,
+    "latency_ms": 2871.9944750191644,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer identifies four key points: (1) NodePort exposes on each Node's IP at a static port, (2) LoadBalancer uses external load balancer from cloud integration, (3) ExternalName maps to external DNS name via CNAME record, and (4) ClusterIP is default/cluster-internal only, with Ingress as alternative. The agent's answer covers: (1) NodePort - fully covered with detailed explanation, (2) LoadBalancer - fully covered, (3) ExternalIPs - this is mentioned but is NOT the same as ExternalName from the reference (ExternalIPs route to cluster nodes; ExternalName maps to external DNS via CNAME), (4) ClusterIP - covered as default/cluster-internal, (5) Ingress - mentioned as alternative. The agent's answer is missing the ExternalName Service type entirely, which is explicitly listed in the reference as one of the three main types that expose applications outside the cluster. The agent substitutes ExternalIPs instead, which is a different mechanism. This represents incomplete coverage of the reference's key points.",
+    "evidence_quotes": [
+      "Reference lists: NodePort, LoadBalancer, ExternalName, and ClusterIP/Ingress alternative",
+      "Agent covers: NodePort (\u2713), LoadBalancer (\u2713), ExternalIPs (\u2717 - not in reference), ClusterIP (\u2713), Ingress (\u2713)",
+      "Missing: ExternalName (maps Service to external DNS name via CNAME record)"
+    ],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0036983999999999997,
+    "latency_ms": 8602.465078001842,
+    "item_id": "k8s_025"
+  }
+]