Spaces:

Nomearod
/

agentbench

Running

Jane Yeung Claude Opus 4.7 (1M context) commited on 26 days ago

Commit

e16544c

unverified ·

1 Parent(s): ee729e0

rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold (#20)

* rubric: clarify groundedness reference scope (snippets-only) for v1.1 gold

The Opus stress-test pass disagreed with the v1.0 single-rater human gold
on 22 of 30 groundedness items, all in the same direction (human=1,
opus=0). The pattern traced to a reference-scope drift: rubric scopes
groundedness to the gold snippets only, but the v1.0 labels were
calibrated against the broader corpus. Strict-snippet entailment is
methodologically load-bearing — it measures retrieval-grounded behavior,
not LLM general knowledge passing through a RAG harness — so the rubric
holds and the labels move.

This commit lands three coupled artifacts:

1. Rubric clarification (agent_bench/evaluation/rubrics/groundedness.md).
Adds an explicit reference-scope section with normative "must score 0",
a trivial-inference clause covering paraphrase + unit conversion +
syntactic variation + canonical-name carve-out, and three calibration
anchors (k8s_006 dramatic over-extension, q006 subtle embellishment,
q021 trivial-inference positive). v1.2-debt sentence flags the
canonical-name carve-out as the most likely revision target.

2. v1.1 corrected gold (measurements/2026-05-04-judge-calibration-labels
.jsonl). 21 of 22 disagreed-on items re-labeled to 0 against the
strict rubric; q021 held at 1 under the trivial-inference clause; q006
flipped to 0 on second-pass strict re-read (claim "particularly useful
for expensive operations like database connections" is added LLM
knowledge, not snippet-entailed). Post-flip distribution: 1/29 on
groundedness, by design — the strict rubric is a hard target.

3. DECISIONS.md entry capturing the methodology and audit trail.

Also tracked: measurements/2026-05-05-judge-rubric-opus-stress.jsonl (the
90-row Opus second-rater pass that surfaced the drift) and
results/calibration_v1_system_outputs.json (the frozen orchestration
output the labels' system_output_hash references). .gitignore narrowed
to opt evidence-bearing measurements/*.jsonl through the *.jsonl
telemetry rule.

Post-merge: make calibrate runs against this v1.1 gold and emits the κ
ablation table. Per the DECISIONS.md entry, AC1 is the load-bearing
statistic for groundedness (Cohen's κ degrades under the 1/29 prevalence
imbalance); both will be reported with the methodology argument.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* judges: strip markdown fence + bump max_tokens for v1.1 rubric

First post-rubric-v1.1 calibrate run had 78/82 cells abstain on
ABSTAIN_REASON_SCHEMA_PARSE in the baseline row. Two coupled
production-code fixes drop that to 0:

1. **Markdown fence stripping.** Haiku 4.5 wraps its JSON output in
```json ... ``` even when the prompt asks for a bare JSON object,
producing responses that start with the literal backtick character.
`json.loads` failed at char 0 with "Expecting value: line 1 column 1".
New `_strip_markdown_fence` helper unwraps the fence (idempotent;
no-op when absent) before the parse. The strict-reprompt suffix also
gets a "do not wrap in a code fence" instruction so the retry path
has a fighting chance even on models that ignore the helper's case.

2. **`max_tokens` 512 -> 1024.** Initial diagnosis was truncation; with
the longer v1.1 rubric eliciting longer model reasoning, 512 tokens
was tight even when the response was well-formed. Fence-strip is the
load-bearing fix, but the bump prevents a future rubric revision
from re-tripping the same wire.

Validation:
- 51/51 rubric+judge unit tests pass.
- Smoke test of `_strip_markdown_fence` covers bare JSON, ```json fence,
``` fence (no language), and whitespace-padded variants.
- Re-run of `run-judges --row-config=baseline.yaml` against the v1.1
gold yields count=82, 0 schema_parse failures, 1 transient
provider-exhausted (down from 77/5 in the broken run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration: item_id fix + κ ablation table run on v1.1 gold

Three coupled changes on top of the rubric clarification:

1. **scripts/run_calibration.py — fix dropped item_id.** Three score_one
callsites (single-strategy, rubric_permute, jury) were serializing
`{"dimension": dim, **result.model_dump()}`, which doesn't include
item_id (ScoreResult has system_output_hash, not item_id). The κ
table loader joins predictions to labels by (item_id, dimension) and
crashed with KeyError. Fix: prepend `"item_id": rec["item_id"]` to
each returned dict. The 6 already-written row files were backfilled
in-place via hash → item_id map from system_outputs.json (492/492
records mapped cleanly, 0 unmapped) so we didn't have to re-spend
~$1 on judge calls.

2. **results/calibration_v1_judge_*.json (6 row files) +
.._members.jsonl (2 sidecars).** All 82 cells per row scored against
the v1.1 gold. Distribution sanity-checked per row; baseline matches
gold within run-to-run noise; jury and permute clean (0 abstains).

3. **docs/_generated/kappa_table.md.** First v1.1 κ ablation. Reads as
a *partial* artifact pending three follow-ups documented in the
DECISIONS.md outcome paragraph: AC1 alongside Cohen's κ on
prevalence-skewed dimensions (relevance, groundedness), abstain-rate
investigation on no_cot completeness (κ=1.000 at n=24 may be
selective abstain), and per-member κ analysis on the jury vs
baseline completeness regression (the realized risk).

4. **.gitignore — `!results/*.jsonl` exception** so the jury and permute
per-member sidecars are tracked alongside the row JSON files. They
are evidence artifacts the κ table joins against, not telemetry.

5. **DECISIONS.md outcome paragraph.** Records the three production-code
fixes that were coupled to the rubric clarification (fence-strip,
max_tokens bump, item_id) so a future reader of the supersession
entry can recover the artifact provenance, plus the three κ-table
reading questions the writeup must address before claiming numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* calibration v1.1: AC1 reporting + jury underperformance finding

- report.py: per-dimension headline metric. Cohen's κ degenerates on
prevalence-skewed dimensions (groundedness 1×1/29×0, relevance
29×2/1×1) where Pe approaches 1; AC1 (Gwet 2008, unweighted) stays
informative. _DIM_METRIC selects AC1 for groundedness + relevance
and Cohen's κ for completeness (gold 23×2/5×1, balanced enough for
κ to behave). Headline footer in the rendered table makes the
per-dim choice explicit so a reader uses the right statistic.

- docs/_generated/kappa_table.md: regenerated against the v1.1
corrected gold. Relevance now reads 0.96–1.00 AC1 (was 0.000 κ
across 5/6 rows — degenerate); groundedness reads 0.90–1.00 AC1.

- DECISIONS.md: appended three v1.1 findings to the supersession
entry's outcome paragraph.
(1) Relevance κ=0 across 5/6 rows is a prevalence artifact, not
judges failing — AC1 makes this honest in the table.
(2) no_cot completeness κ=1.000 is real, not selective abstain.
The 2 absent cells are infrastructure rate-limit abstains; on
the 24 scored cells, all 4 baseline-with-CoT disagreements flip
to agreement when CoT is removed. CoT-before-score on 3-point
completeness lets the judge over-emphasize partial coverage —
opposite of the conventional CoT-helps story.
(3) jury_kappa_weighted underperformed baseline on completeness
with a precise mechanism: per-member analysis of the sidecar
shows gpt-4o-mini-2024-07-18 is systematically harsh on the
3-point scale (κ=0.020 vs Haiku's 0.416). The kappa-weighting
stub at scripts/run_calibration.py::_load_weights_from_baseline
returns weight=1.0 for every judge_id in baseline.json — so
Haiku gets 1.0 from the stub, gpt-4o-mini gets 1.0 from jury.py
missing-key fallback. Equal weights on disputed (2,1) cells
produce mean=1.5; _discretize_mean's frac>0.5-else-floor rule
rounds 1.5 down to 1; gpt-4o-mini's verdict wins every disputed
cell. Both branches of the v1 risks subsection ("kappa-weighting
wrong, or worse judge drags mean") fired simultaneously.

- v1.2 fix-list expanded to four items: held-out jury weights;
symmetric member coverage in weights source (missing-key fallback
is the bug, not the documented behavior); per-dimension member
exclusion when individual κ < threshold (weighting cannot rescue
systematic bias); per-dimension tie-break rule (round-down's
binary-conservative semantics don't transfer to ordinal scales).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(report): lift _DIM_METRIC tuple-unpack to avoid mypy 1.19+ INTERNAL ERROR

CI ran mypy 1.20.2; the previous commit had two `metric_name, _ =
_DIM_METRIC.get(dim, ("κ", cohen_kappa))` tuple-destructuring sites
that triggered mypy's INTERNAL ERROR -- not a real type error, mypy
itself crashed (reproducible locally with mypy 1.19.1). No traceback
was useful even with --show-traceback.

Fix: lift the `metric_name, metric_fn = _DIM_METRIC.get(...)` resolution
out of all use sites and into one call at the top of the `for dim in
sorted(preds_by_dim.keys()):` loop. Same observable behavior (still
falls back to ("κ", cohen_kappa) for an unknown dim, which never
fires in practice but stays defensive). The four downstream branches
just reference the already-bound metric_name and metric_fn now.

Also adds explicit `dict[str, tuple[str, _MetricFn]]` annotation on
_DIM_METRIC. Doesn't fix the crash on its own but documents intent
and lets a future mypy upgrade re-narrow the type cleanly.

Validation:
- mypy agent_bench/ --ignore-missing-imports → "Success: no issues
found in 53 source files"
- pytest tests/evaluation/test_calibration_report.py → 8/8 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (18) hide show

.gitignore +7 -0
DECISIONS.md +214 -0
agent_bench/evaluation/calibration/report.py +68 -16
agent_bench/evaluation/judges/base.py +34 -3
agent_bench/evaluation/rubrics/groundedness.md +117 -44
docs/_generated/kappa_table.md +24 -0
measurements/2026-05-04-judge-calibration-labels.jsonl +90 -0
measurements/2026-05-05-judge-rubric-opus-stress.jsonl +90 -0
results/calibration_v1_judge_baseline.json +0 -0
results/calibration_v1_judge_baseline_no_abstain.json +0 -0
results/calibration_v1_judge_baseline_no_anchors.json +0 -0
results/calibration_v1_judge_baseline_no_cot.json +2115 -0
results/calibration_v1_judge_jury_kappa_weighted.json +1068 -0
results/calibration_v1_judge_jury_kappa_weighted_members.jsonl +0 -0
results/calibration_v1_judge_permute.json +1068 -0
results/calibration_v1_judge_permute_members.jsonl +0 -0
results/calibration_v1_system_outputs.json +0 -0
scripts/run_calibration.py +3 -3

.gitignore CHANGED Viewed

@@ -24,6 +24,13 @@ venv/
 logs/
 *.jsonl
 # Opaque binary artifacts — no PDFs in the repo today, and any that
 # appear here are almost always local reference material (downloaded
 # papers, vendor docs) that should not be committed. If a PDF ever

 logs/
 *.jsonl
+# Evidence-bearing measurement artifacts referenced from DECISIONS.md.
+# Narrow exception to the *.jsonl ignore above. Add new measurement files
+# explicitly here so the audit-trail intent stays opt-in.
+!measurements/*.jsonl
+# Calibration jury/permute sidecars (per-member detail for κ ablation table).
+!results/*.jsonl
 # Opaque binary artifacts — no PDFs in the repo today, and any that
 # appear here are almost always local reference material (downloaded
 # papers, vendor docs) that should not be committed. If a PDF ever

DECISIONS.md CHANGED Viewed

@@ -2160,3 +2160,217 @@ defaults to the three v1 dimensions. Zero user-facing config migration.
 Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
 prompt optimization, citation_faithfulness in the default
 judge_dimensions, AC2 sympy-derived parity tests.

 Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
 prompt optimization, citation_faithfulness in the default
 judge_dimensions, AC2 sympy-derived parity tests.
+## Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05
+The Opus stress-test pass over the 30 calibration items × 3 dimensions
+disagreed with the single-rater human gold on **22 of 30 groundedness
+items** (8/30 agreement). Relevance and completeness agreed at 28/30 and
+25/30 respectively. The groundedness disagreement is consistent in
+direction — every disagreed-on item is `human=1, opus=0` — and has a
+single root cause.
+**Root cause: reference-scope drift between rubric author and labeler.**
+`agent_bench/evaluation/rubrics/groundedness.md` defines the reference
+scope as the gold snippets attached to each item:
+> The judge sees only the gold snippets — not the retrieved chunks. A
+> claim that happens to be true in the world but is not entailed by the
+> snippets fails groundedness.
+The single-rater notes on the disagreed-on items describe checking
+against the broader documentation, not against `source_snippets`:
+"supported by the corpus", "supported by the docs", "supported by the
+provided dependency snippet". For items like `k8s_006` the gold snippet
+is one sentence ("A ConfigMap is an API object used to store
+non-confidential data in key-value pairs"), while the agent's answer
+correctly synthesizes seven or eight additional claims from the full
+`k8s_configmap.md`. Those claims are true in the world and well-supported
+by the full doc, but **not entailed by the one snippet**. Opus applied
+the strict-snippet rubric; the human rater applied a corpus-supported
+rubric.
+**Why this blocks `make calibrate` against the current gold.** The κ
+ablation compares Haiku and GPT-4o-mini judges against the human gold.
+A judge that correctly applies the strict-snippet rubric will disagree
+with miscalibrated gold; a judge that's too lenient will agree. The
+ablation rewards leniency and punishes rigor — the opposite of the
+intended measurement. This is the same failure mode codified earlier in
+this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
+compensation when the measurement is wrong, not the intended effect.
+**Why the rubric stays as written, not relaxed to "corpus-supported".**
+Strict-snippet groundedness measures *RAG behavior*: did the agent
+synthesize from what it retrieved? Corpus-supported groundedness
+measures *LLM general knowledge passing through a RAG harness*: did the
+agent happen to be correct? The first is what this benchmark is for;
+the second is what `agent_bench/evaluation/metrics.py` measured before
+supersession. Relaxing the rubric to "corpus-supported" would silently
+re-introduce the failure mode the supersession entry above just removed.
+**Decision — three-step correction lands before `make calibrate` runs:**
+1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.**
+   Add an explicit reference-scope line and one anchored example
+   contrasting "supported by the snippet" vs "true in the world but
+   not in the snippet". Audit-trail requirement: the v1.1 writeup will
+   cite "rubric clarified between v1.0 and v1.1", and the git history
+   needs to back that claim.
+2. **Re-label the 22 disagreed-on groundedness items** in
+   `measurements/2026-05-04-judge-calibration-labels.jsonl` against the
+   clarified rubric, snippet-only. **Do not mechanically copy Opus's
+   labels.** The labels remain the human single-rater's; what changes is
+   the rubric being applied. Mechanical copy would turn the κ table
+   into "judge vs Opus", which is not what the writeup claims it
+   measures.
+3. **Recompute `make calibrate` against the corrected gold** and emit
+   `docs/_generated/kappa_table.md` from the v1.1 labels.
+**Evidence files for the v1.1 writeup section:**
+- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — 90 Opus
+  labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min
+  org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
+- `measurements/2026-05-04-judge-calibration-labels.jsonl` — original
+  v1.0 single-rater gold; will be diffed against v1.1 corrected gold to
+  quantify the re-label delta.
+- `agent_bench/evaluation/rubrics/groundedness.md` — pre/post diff is
+  the rubric clarification.
+**Pre-labeling observations also worth recording for the writeup
+methodology section:**
+- `q021` (fastapi · calculation) answered the CORS preflight question
+  correctly (600 / 60 = 10 minutes) with `sources: []` and
+  `ranked_sources: []` — the agent did the arithmetic without retrieval
+  and emitted an answer consistent with the snippet without having
+  retrieved it. Methodologically interesting for the
+  citation-faithfulness story (Block 2.7) if it ships: an answer can be
+  correct without being grounded-by-citation.
+- `q025` (fastapi · multi_hop) answer was truncated mid-token by the
+  orchestrator's max_tokens limit. The labels reflect what the system
+  produced, not a mentally-patched complete version. The completeness
+  rubric does not currently anchor "truncated response" as a level —
+  v1.1 rubric work should add an anchor.
+- Several K8s items embed external knowledge that's correct but not in
+  the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container
+  success; `k8s_009` describes Roles vs ClusterRoles by their semantics).
+  The clarified groundedness rubric should pick **strict** on this case
+  (claim must be supportable by the retrieved spans, not just consistent
+  with them) and the anchored example should show that ruling.
+**Methodology framing for the writeup.** The Opus stress-test was added
+specifically to catch hand-labeled-gold fragility before the κ table is
+published. It caught it. The writeup's calibration section should
+disclose the rubric clarification, quantify the re-label delta on
+groundedness, and report κ against the v1.1 corrected gold — that is a
+more credible story than a first-try clean κ table would have been.
+**Outcome — 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows
+ran cleanly after three coupled production-code fixes that landed on the
+same branch as the rubric clarification: (1) markdown fence stripping in
+`agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because
+Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens`
+512 → 1024 because v1.1 anchored examples elicit longer model reasoning,
+(3) calibration runner v1.0 omitted `item_id` from prediction records;
+fixed in v1.1 with backfill of the 6 already-written row files via
+`hash → item_id` map (no re-spend). Probe-one-cell-before-sweep saved a
+fourth $0.50 wasted run after the fence-strip change — the methodology
+note in `feedback_judge_probe_before_sweep.md` was earned by this
+session's two failed full-row attempts that paid ~$1.15 for unparseable
+output before the diagnosis converged.
+The κ table at `docs/_generated/kappa_table.md` (regenerated on
+2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for
+completeness — see report.py `_DIM_METRIC`) shows three findings
+that the writeup interprets rather than reports verbatim:
+**v1.1 finding 1 — relevance is not "judges fail" territory.**
+Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the
+29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00.
+AC1 is the load-bearing statistic on relevance and groundedness; both
+metrics agree on completeness where the gold (23×2 / 5×1) is balanced.
+**v1.1 finding 2 — `no_cot completeness` agreement is real, not
+selective abstain.** AC1 = κ = 1.000 at n=24. The 2 absent cells
+(`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit
+retry exhaustion), both gold=`2`, neither in baseline's disagreement
+set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
+(3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to
+agreement when CoT is removed. The interview-relevant claim is the
+*opposite* of the conventional CoT-helps story: CoT-before-score on
+3-point completeness lets the judge over-emphasize partial coverage
+and rationalize `1` when the human gold sides with the holistic
+"covers the points" reading.
+**v1.1 finding 3 — `jury_kappa_weighted` underperformed baseline on
+completeness, with a precise mechanism.** Per-member analysis from
+`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
+Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%;
+gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw
+26.9% — systematically harsh on the 3-point scale, almost never
+scoring `2`. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% —
+matches gpt-4o-mini alone exactly because the jury verdict reduces
+to gpt-4o-mini's verdict on every disputed cell.
+The mechanism is *missing-weight + round-down* compounding, not
+weighted voting in the usual sense. `scripts/run_calibration.py
+::_load_weights_from_baseline` is a documented v1 stub that returns
+weight = 1.0 for every judge_id present in baseline. baseline.json
+contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
+gets 1.0 from `jury.py`'s missing-key fallback (with a logged
+`jury_missing_weight_fallback_to_one` warning per call). Equal
+weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
+of 1.5; the `_discretize_mean` rule is `frac > 0.5 → ceil else floor`,
+and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict
+wins every disputed cell. The v1 design doc's risks subsection listed
+"jury κ worse than the better individual judge — (a) kappa-weighting
+wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
+*both* branches simultaneously: branch (a) because the weighting is a
+stub returning equal weights, and branch (b) because round-down at
+exact 0.5 ties hands the verdict to the lower-scoring member.
+The deeper structural point is that weighting alone cannot rescue a
+systematically miscalibrated member. Even held-out validation that
+correctly assigned gpt-4o-mini's true low weight on completeness
+would still let it dominate disputed ties unless its weight were
+driven near zero — and at that point exclusion is more honest than
+near-zero inclusion. The conservative-on-binary "ties to lower" rule
+also doesn't transfer cleanly to ordinal scales: on completeness,
+"conservative" means scoring *toward incomplete*, which is precisely
+the direction of gpt-4o-mini's bias.
+**v1.2 fix list (four items, expanding the earlier two-item list):**
+1. **Held-out jury weights.** Replace the
+   `_load_weights_from_baseline` stub with a real κ-derived
+   computation, evaluated on a *held-out validation set* — not the
+   same calibration row whose κ is being measured against the gold.
+   Closes the circular-weighting hole.
+2. **Symmetric member coverage in the weights source.** Missing-member
+   fallback to weight = 1.0 amplifies an unweighted member rather than
+   suppressing it. Either every jury member must have a weight in the
+   source file or the run must abort. The `jury_missing_weight_
+   fallback_to_one` warning fired loudly on every call this run; in
+   v1.2 it should be a hard error.
+3. **Per-dimension member exclusion when individual κ falls below a
+   threshold.** gpt-4o-mini at κ = 0.020 on completeness should not be
+   in the completeness jury at all. Weights below a floor (suggested
+   κ < 0.2) should be treated as exclusion, not as small-weight
+   inclusion. Held-out validation fixes circular weighting; it does
+   not fix systematic member bias.
+4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule
+   (ties to lower) was selected for conservative behavior on binary
+   scales, where "conservative" means scoring 0 on uncertainty. On
+   3-point completeness, "conservative" means scoring toward
+   *incomplete*, which interacts badly with member miscalibration.
+   v1.2 should select the tie-break rule per-dimension based on the
+   rubric's conservative direction, not globally.
+**Evidence files:** `docs/_generated/kappa_table.md` (regenerated with
+AC1 for groundedness/relevance, κ for completeness);
+`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
+(per-member sidecar where the gpt-4o-mini completeness bias is
+visible per item); `results/calibration_v1_judge_baseline.json`
+(weights source — note the absence of any gpt-4o-mini-2024-07-18
+entries, which is why the missing-weight fallback fires).

agent_bench/evaluation/calibration/report.py CHANGED Viewed

@@ -8,11 +8,16 @@ from __future__ import annotations
 import glob as _glob
 import json
 from collections import defaultdict
 from pathlib import Path
 import structlog
-from agent_bench.evaluation.calibration.metrics import bootstrap_ci, cohen_kappa
 from agent_bench.evaluation.judges.base import (
     ABSTAIN_REASON_OUT_OF_RANGE,
     ABSTAIN_REASON_PROVIDER_EXHAUSTED,
@@ -23,6 +28,27 @@ logger = structlog.get_logger()
 ABSTAIN_THRESHOLD = 0.20  # strictly greater than fires the flag
 # Filename marker for jury / permute sidecar files. Any prediction file whose
 # basename contains this token is per-member detail, not aggregate predictions,
 # and is excluded from the κ table. Pinned here so a future extension change
@@ -114,6 +140,14 @@ def generate_kappa_table(
             labels_by_dim[label_rec["dimension"]].append(label_rec)
         for dim in sorted(preds_by_dim.keys()):
             preds_d = {p["item_id"]: p for p in preds_by_dim[dim]}
             labs_d = {
                 label_rec["item_id"]: label_rec
@@ -163,6 +197,7 @@ def generate_kappa_table(
                     {
                         "row": row_label,
                         "dim": dim,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
@@ -171,23 +206,24 @@ def generate_kappa_table(
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
-                            f"κ undefined: insufficient agreement-eligible "
-                            f"items (N={n_eligible})"
                         ),
                     }
                 )
                 continue
             try:
-                kappa = cohen_kappa(y_lab, y_pred)
                 point, lo, hi = bootstrap_ci(
-                    y_lab, y_pred, cohen_kappa, n_iter=1000, seed=42
                 )
             except (ValueError, ZeroDivisionError):
                 rows.append(
                     {
                         "row": row_label,
                         "dim": dim,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
@@ -196,8 +232,8 @@ def generate_kappa_table(
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
-                            "κ undefined: insufficient variance after "
-                            "exclusion"
                         ),
                     }
                 )
@@ -211,6 +247,7 @@ def generate_kappa_table(
                     {
                         "row": row_label,
                         "dim": dim,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
@@ -219,8 +256,9 @@ def generate_kappa_table(
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
-                            "κ undefined: all labels and predictions in a "
-                            "single category (no variance to measure)"
                         ),
                     }
                 )
@@ -234,15 +272,17 @@ def generate_kappa_table(
                     if v > 0
                 )
                 footnote = (
-                    f"κ computed on N={n_eligible} of {len(common)} items; "
-                    f"high abstain rate ({100 * abstain_rate:.1f}% — "
-                    f"breakdown: {breakdown}) suggests rubric ambiguity."
                 )
             rows.append(
                 {
                     "row": row_label,
                     "dim": dim,
                     "kappa": kappa,
                     "ci_lo": lo,
                     "ci_hi": hi,
@@ -255,8 +295,20 @@ def generate_kappa_table(
             )
     out = ["# κ ablation table — calibration v1\n"]
-    out.append("| Row | Dimension | κ (95% CI) | N | Abstain rate | Notes |")
-    out.append("|---|---|---|---|---|---|")
     for r in rows:
         if r["kappa"] is None:
             kcell = " — "
@@ -264,8 +316,8 @@ def generate_kappa_table(
             kcell = f"{r['kappa']:.3f} ({r['ci_lo']:.3f}, {r['ci_hi']:.3f})"
         rate = f"{100 * r['abstain_rate']:.1f}%"
         out.append(
-            f"| {r['row']} | {r['dim']} | {kcell} | {r['n_eligible']} | "
-            f"{rate} | {r['footnote']} |"
         )
     Path(output_path).parent.mkdir(parents=True, exist_ok=True)

 import glob as _glob
 import json
 from collections import defaultdict
+from collections.abc import Callable
 from pathlib import Path
 import structlog
+from agent_bench.evaluation.calibration.metrics import (
+    bootstrap_ci,
+    cohen_kappa,
+    gwets_ac2,
+)
 from agent_bench.evaluation.judges.base import (
     ABSTAIN_REASON_OUT_OF_RANGE,
     ABSTAIN_REASON_PROVIDER_EXHAUSTED,
 ABSTAIN_THRESHOLD = 0.20  # strictly greater than fires the flag
+# Per-dimension headline metric. Cohen's κ degenerates under the prevalence
+# imbalance produced by the v1.1 strict-snippet groundedness rubric (1×score=1,
+# ~25×score=0) and by the inherent skew on relevance (29×score=2, 1×score=1):
+# both Po and Pe approach 1.0, the formula collapses to ~0/0, and the rendered
+# κ reads as 0.000 even when raw agreement is >95%. Gwet's AC1 (gwets_ac2 with
+# weights=None per metrics.py) uses mean marginals and stays informative under
+# imbalance. Completeness has a more balanced gold (23×2, 5×1, 2×Unknown) so
+# Cohen's κ is the conventional choice there. The metric per dim is rendered
+# explicitly in the footer so a writeup reader sees the methodology choice.
+# Type annotation prevents a mypy 1.20.x INTERNAL ERROR triggered by the
+# tuple-unpack of `_DIM_METRIC.get(dim, default)` further down. Without it
+# mypy fails to infer the metric_fn callable signature consistently across
+# the dict literal and the fallback default, and crashes with no real
+# user-facing type error to fix.
+_MetricFn = Callable[[list, list], float]
+_DIM_METRIC: dict[str, tuple[str, _MetricFn]] = {
+    "groundedness": ("AC1", gwets_ac2),
+    "relevance": ("AC1", gwets_ac2),
+    "completeness": ("κ", cohen_kappa),
+}
 # Filename marker for jury / permute sidecar files. Any prediction file whose
 # basename contains this token is per-member detail, not aggregate predictions,
 # and is excluded from the κ table. Pinned here so a future extension change
             labels_by_dim[label_rec["dimension"]].append(label_rec)
         for dim in sorted(preds_by_dim.keys()):
+            # Resolve dimension's headline metric once per dim, instead of
+            # tuple-unpacking _DIM_METRIC.get(...) at each use site below.
+            # The repeated unpack pattern triggered a mypy 1.19+ INTERNAL
+            # ERROR; one resolution call here is also less code.
+            metric_name, metric_fn = _DIM_METRIC.get(
+                dim, ("κ", cohen_kappa)
+            )
             preds_d = {p["item_id"]: p for p in preds_by_dim[dim]}
             labs_d = {
                 label_rec["item_id"]: label_rec
                     {
                         "row": row_label,
                         "dim": dim,
+                        "metric": metric_name,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
+                            f"{metric_name} undefined: insufficient "
+                            f"agreement-eligible items (N={n_eligible})"
                         ),
                     }
                 )
                 continue
             try:
+                kappa = metric_fn(y_lab, y_pred)
                 point, lo, hi = bootstrap_ci(
+                    y_lab, y_pred, metric_fn, n_iter=1000, seed=42
                 )
             except (ValueError, ZeroDivisionError):
                 rows.append(
                     {
                         "row": row_label,
                         "dim": dim,
+                        "metric": metric_name,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
+                            f"{metric_name} undefined: insufficient "
+                            f"variance after exclusion"
                         ),
                     }
                 )
                     {
                         "row": row_label,
                         "dim": dim,
+                        "metric": metric_name,
                         "kappa": None,
                         "ci_lo": None,
                         "ci_hi": None,
                         "abstain_rate": abstain_rate,
                         "abstain_causes": abstain_causes,
                         "footnote": (
+                            f"{metric_name} undefined: all labels and "
+                            f"predictions in a single category (no variance "
+                            f"to measure)"
                         ),
                     }
                 )
                     if v > 0
                 )
                 footnote = (
+                    f"{metric_name} computed on N={n_eligible} of "
+                    f"{len(common)} items; high abstain rate "
+                    f"({100 * abstain_rate:.1f}% — breakdown: {breakdown}) "
+                    f"suggests rubric ambiguity."
                 )
             rows.append(
                 {
                     "row": row_label,
                     "dim": dim,
+                    "metric": metric_name,
                     "kappa": kappa,
                     "ci_lo": lo,
                     "ci_hi": hi,
             )
     out = ["# κ ablation table — calibration v1\n"]
+    out.append(
+        "Headline metric per dimension: " + ", ".join(
+            f"**{d} → {m}**" for d, (m, _) in _DIM_METRIC.items()
+        ) + ". "
+        "AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold "
+        "is prevalence-skewed enough to make Cohen's κ degenerate "
+        "(groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics "
+        "produce ≥0.95 raw agreement on those rows but Cohen's κ collapses "
+        "to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its "
+        "gold (23×`2`/5×`1`) is balanced enough for κ to behave normally."
+    )
+    out.append("")
+    out.append("| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |")
+    out.append("|---|---|---|---|---|---|---|")
     for r in rows:
         if r["kappa"] is None:
             kcell = " — "
             kcell = f"{r['kappa']:.3f} ({r['ci_lo']:.3f}, {r['ci_hi']:.3f})"
         rate = f"{100 * r['abstain_rate']:.1f}%"
         out.append(
+            f"| {r['row']} | {r['dim']} | {r['metric']} | {kcell} | "
+            f"{r['n_eligible']} | {rate} | {r['footnote']} |"
         )
     Path(output_path).parent.mkdir(parents=True, exist_ok=True)

agent_bench/evaluation/judges/base.py CHANGED Viewed

@@ -405,10 +405,27 @@ class MockJudge(Judge):
 _STRICT_REPROMPT_SUFFIX = (
     "\n\nSTRICT FORMATTING NOTE: respond ONLY with a JSON object matching "
-    "the schema; reasoning first, then evidence_quotes, then score."
 )
 async def _call_judge_with_retry(
     *,
     provider: "LLMProvider",
@@ -420,11 +437,19 @@ async def _call_judge_with_retry(
     system_output_hash: str,
     item_id: str,
     abstain_allowed: bool = True,
-    max_tokens: int = 512,
 ) -> ScoreResult:
     """Send prompt to provider; one retry with strict reprompt on
     schema-parse / score-out-of-range; abstain on persistent failure
     or provider exhaustion. Re-raises unknown exceptions (caller bugs).
     """
     accumulated_cost = 0.0
     accumulated_latency = 0.0
@@ -458,8 +483,14 @@ async def _call_judge_with_retry(
         # Parse — reasoning and evidence_quotes are optional so judges
         # configured with use_cot=False (which prompt for {"score": ...}
         # only) don't fail parsing on the missing key.
         try:
-            data = _json.loads(response.content)
             reasoning = str(data.get("reasoning", ""))
             evidence_quotes = list(data.get("evidence_quotes", []))
             raw_score = data["score"]

 _STRICT_REPROMPT_SUFFIX = (
     "\n\nSTRICT FORMATTING NOTE: respond ONLY with a JSON object matching "
+    "the schema; reasoning first, then evidence_quotes, then score. "
+    "Do not wrap the JSON in a markdown code fence."
 )
+_MARKDOWN_FENCE_RE = re.compile(r"^\s*```(?:json|JSON)?\s*\n(.*?)\n```\s*$", re.DOTALL)
+def _strip_markdown_fence(text: str) -> str:
+    """Strip a leading/trailing ```json ... ``` markdown fence if present.
+    Some chat models wrap structured JSON in a markdown code fence even
+    when the prompt asks for a bare JSON object. The judge parser uses
+    json.loads on the raw content, which fails at char 0 on the literal
+    backtick. This helper unwraps the fence so the parse can proceed.
+    Idempotent: returns text unchanged if no fence is present.
+    """
+    m = _MARKDOWN_FENCE_RE.match(text.strip())
+    return m.group(1) if m else text
 async def _call_judge_with_retry(
     *,
     provider: "LLMProvider",
     system_output_hash: str,
     item_id: str,
     abstain_allowed: bool = True,
+    max_tokens: int = 1024,
 ) -> ScoreResult:
     """Send prompt to provider; one retry with strict reprompt on
     schema-parse / score-out-of-range; abstain on persistent failure
     or provider exhaustion. Re-raises unknown exceptions (caller bugs).
+    max_tokens defaults to 1024 (was 512 pre-v1.1). The v1.1 groundedness
+    rubric ships with calibration anchors whose verbose thinking traces
+    elicit longer model reasoning in turn; 512 truncated the JSON
+    response mid-reasoning and caused 78/82 schema_parse_failed
+    abstains in the first run after the rubric clarification. 1024 leaves
+    enough headroom; bump again if a future rubric revision pushes
+    reasoning longer.
     """
     accumulated_cost = 0.0
     accumulated_latency = 0.0
         # Parse — reasoning and evidence_quotes are optional so judges
         # configured with use_cot=False (which prompt for {"score": ...}
         # only) don't fail parsing on the missing key.
+        #
+        # Some models (observed on Haiku 4.5 under the v1.1 rubric) wrap
+        # their JSON in a ```json ... ``` markdown fence. Strip the fence
+        # before parsing rather than abstaining on a syntactically valid
+        # but conventionally formatted response.
+        content = _strip_markdown_fence(response.content)
         try:
+            data = _json.loads(content)
             reasoning = str(data.get("reasoning", ""))
             evidence_quotes = list(data.get("evidence_quotes", []))
             raw_score = data["score"]

agent_bench/evaluation/rubrics/groundedness.md CHANGED Viewed

@@ -7,13 +7,44 @@ abstain_allowed: true
 # Groundedness (binary)
-Score whether **every claim** in the agent's answer is directly supported by
-the gold source snippets attached to this item. Paraphrase is allowed; what
-matters is factual entailment.
-The judge sees only the gold snippets — not the retrieved chunks. A claim
-that happens to be true in the world but is not entailed by the snippets
-fails groundedness.
 **When to abstain (`"Unknown"`)**: if the answer is a refusal ("I don't
 know" / "not in the documentation") and there is nothing to ground, score
@@ -21,49 +52,91 @@ abstain rather than 1.
 ## Score 0
-At least one claim in the answer is not supported by any snippet.
-### Example A — answer adds an unsupported claim
-Question: "What's the default port for the dashboard?"
-Snippets: ["The dashboard listens on port 8080 by default."]
-Answer: "The default port is 8080 and TLS is enabled out of the box."
-Score=0 because the TLS claim has no support in the snippet. The strict-
-conjunction rule applies: even a single unsupported claim fails the binary
-groundedness rubric. The grounded portion of the answer doesn't redeem it.
-### Example B — answer paraphrases incorrectly
-Question: "How long do connections idle before timeout?"
-Snippets: ["Idle connections are closed after 30 seconds."]
-Answer: "Connections close after 30 minutes of inactivity."
-Score=0 because the unit is wrong (minutes vs seconds). Paraphrase is
-allowed but factual content must match.
 ## Score 1
-Every claim in the answer is directly supported by at least one snippet.
-### Example C — fully grounded one-fact answer
-Question: "What's the default port?"
-Snippets: ["The dashboard listens on port 8080 by default."]
-Answer: "Port 8080."
-Score=1 because the only claim is the port number, which is in the snippet.
-### Example D — fully grounded multi-claim answer
-Question: "What identity guarantees does a StatefulSet provide?"
-Snippets: [
-  "StatefulSet pods receive an ordinal index from 0 to N-1.",
-  "Each pod gets a stable hostname based on the StatefulSet name and ordinal.",
-  "Storage is persistent across pod restarts and reschedules."
-]
-Answer: "Pods are assigned ordinal indices, stable hostnames derived from
-the StatefulSet name + ordinal, and storage that persists across restarts."
-Score=1 because all three claims (ordinal indices, stable hostnames,
-persistent storage) are each supported by one snippet.

 # Groundedness (binary)
+Score whether **every claim** in the agent's answer is entailed by the gold
+source snippets attached to this item. Paraphrase is allowed; what matters
+is content equivalence, not surface form.
+## Reference scope (strict, clarified in v1.1)
+Reference scope is the **gold snippets only**, not the broader corpus, not
+the retrieved chunks, not the LLM's general domain knowledge. A claim that
+is factually correct in the world but not entailed by the snippets **must
+score 0**. The "must" forecloses the "well, mostly grounded" reading: a
+single ungrounded claim in an otherwise solid answer fails the binary
+rubric.
+The strict-entailment posture is a methodological choice. A claim that is
+correct because the model happened to know it isn't grounded — it's lucky.
+Strict-snippet groundedness measures *retrieval-grounded behavior*, not
+LLM general knowledge passing through a RAG harness.
+## Trivial inference is entailment
+Some surface-form variations of a snippet's content are entailment, not
+new claims. The test is **content equivalence**, not surface form:
+- **Paraphrase.** "X causes Y" ↔ "Y is caused by X".
+- **Unit conversion.** "600 seconds" ↔ "10 minutes".
+- **Syntactic variation.** Pluralization, tense, voice, declarative ↔ imperative.
+- **Canonical name of the snippet's concept.** When the snippet describes
+  a field, header, or API element by configuration syntax (e.g., a
+  `max_age` table row), the canonical name (`Access-Control-Max-Age` HTTP
+  header) is the same content in different surface form. This is a
+  separate carve-out from pure paraphrase: it admits domain knowledge
+  tightly bound to the snippet's referent.
+> **v1.2 debt.** The trivial-inference clause — especially the
+> canonical-name carve-out — is the strictest-rubric concession most
+> likely to require revision in v1.2. If labelers find themselves
+> applying it broadly to rescue answers from score-0, the clause is
+> too permissive and should be tightened.
 **When to abstain (`"Unknown"`)**: if the answer is a refusal ("I don't
 know" / "not in the documentation") and there is nothing to ground, score
 ## Score 0
+At least one claim in the answer is not entailed by any snippet, after
+applying the trivial-inference clause.
+### Example A — calibration anchor `k8s_006` (dramatic over-extension)
+Question: "What is a ConfigMap in Kubernetes and what kind of data should you store in it?"
+Snippet: "A ConfigMap is an API object used to store non-confidential data in key-value pairs."
+Answer (excerpted): The agent gives a comprehensive multi-section answer
+covering (i) the definition, (ii) three consumption methods (env vars,
+command-line args, volumes), (iii) a warning not to store
+passwords/tokens/certificates, (iv) a recommendation to use Secrets
+instead, and (v) details about `data` and `binaryData` fields.
+Thinking trace: Score = 0. Only the definition (i) is entailed by the
+snippet. Claims (ii)–(v) are factually correct against the underlying
+`k8s_configmap.md` doc, but **none are entailed by the one-sentence
+snippet**. The snippet does not describe consumption methods, security
+guidance, or schema fields. The strict-conjunction rule applies: even
+though most of the answer is well-supported by the broader corpus, the
+gold-snippet scope is what the rubric measures, and the answer goes
+dramatically beyond it.
+### Example B — calibration anchor `q006` (subtle embellishment)
+Question: "How does dependency caching work in FastAPI, and how can you disable it?"
+Snippet: "By default, if the same dependency is used multiple times within
+a single request (e.g., both a route and a sub-dependency use
+`Depends(get_db)`), FastAPI caches the result and calls the dependency
+only once. To disable caching and force a fresh call each time, use
+`Depends(get_db, use_cache=False)`."
+Answer (excerpted): Most of the answer is entailed by the snippet — the
+default-caching behavior, the once-per-request semantics, the
+`use_cache=False` disable mechanism, and an example. But the answer also
+states that caching is "particularly useful for expensive operations like
+database connections."
+Thinking trace: Score = 0. The "particularly useful for expensive
+operations like database connections" claim is **not entailed** by the
+snippet. The snippet describes *what* caching does (a mechanism); the
+claim adds *why* it is useful (a use-case argument the snippet does not
+make). The example uses `Depends(get_db)` as an identifier, not as a
+normative argument that caching is useful for database connections. This
+is the subtle case: most of the answer is good, but a single embellishment
+that imports general LLM knowledge ("expensive operations like database
+connections") fails strict snippet entailment. This is the boundary case
+the rubric exists to catch — easier to mis-label than `k8s_006` because
+the failure is one claim in the middle of a mostly-grounded answer.
 ## Score 1
+Every claim in the answer is entailed by at least one snippet, including
+trivial inferences (paraphrase, unit conversion, syntactic variation,
+canonical-name substitution).
+### Example C — calibration anchor `q021` (trivial inference + canonical name)
+Question: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?"
+Snippet (table row): `| max_age | 600 | Seconds the browser caches preflight results |`
+Answer: "If the CORS `max_age` is 600 seconds, the browser caches
+preflight results for 10 minutes. The `max_age` directive in the
+`Access-Control-Max-Age` header specifies the maximum time (in seconds)
+that the results of a preflight request can be cached. So 600 seconds
+÷ 60 seconds/minute = 10 minutes."
+Thinking trace: Score = 1. Every claim is entailed under the
+trivial-inference clause:
+- "max_age=600 → 10 minutes" — unit conversion.
+- "Access-Control-Max-Age header" — canonical-name carve-out: the snippet
+  describes `max_age` by its configuration syntax; the answer uses the
+  canonical HTTP header name for the same field.
+- "specifies the maximum time (in seconds)" — paraphrase of "Seconds the
+  browser caches preflight results".
+- "600 ÷ 60 = 10 minutes" — arithmetic, the same trivial-inference class
+  as unit conversion.
+The canonical-name carve-out is doing the heaviest lifting in this
+example. Without it, "Access-Control-Max-Age" would be ungrounded
+(domain knowledge not in the snippet text). With it, the answer is a
+clean strict-snippet pass. This is exactly the v1.2-debt sentence above
+— if many future labels rescue score-1 via canonical-name appeals, the
+clause is over-rescuing and should be tightened.

docs/_generated/kappa_table.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# κ ablation table — calibration v1
+Headline metric per dimension: **groundedness → AC1**, **relevance → AC1**, **completeness → κ**. AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold is prevalence-skewed enough to make Cohen's κ degenerate (groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics produce ≥0.95 raw agreement on those rows but Cohen's κ collapses to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its gold (23×`2`/5×`1`) is balanced enough for κ to behave normally.
+| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |
+|---|---|---|---|---|---|---|
+| baseline | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% |  |
+| baseline | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| baseline | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% |  |
+| baseline_no_abstain | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% |  |
+| baseline_no_abstain | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| baseline_no_abstain | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% |  |
+| baseline_no_anchors | completeness | κ | 0.623 (-0.054, 1.000) | 26 | 0.0% |  |
+| baseline_no_anchors | groundedness | AC1 | 0.953 (0.834, 1.000) | 24 | 7.7% |  |
+| baseline_no_anchors | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% |  |
+| baseline_no_cot | completeness | κ | 1.000 (1.000, 1.000) | 24 | 7.7% |  |
+| baseline_no_cot | groundedness | AC1 | 0.897 (0.707, 1.000) | 23 | 11.5% |  |
+| baseline_no_cot | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% |  |
+| jury_kappa_weighted | completeness | κ | 0.014 (-0.077, 0.112) | 26 | 0.0% |  |
+| jury_kappa_weighted | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% |  |
+| jury_kappa_weighted | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% |  |
+| permute | completeness | κ | 0.506 (-0.061, 1.000) | 26 | 0.0% |  |
+| permute | groundedness | AC1 | 1.000 (1.000, 1.000) | 25 | 3.8% |  |
+| permute | relevance | AC1 | 0.966 (0.890, 1.000) | 30 | 0.0% |  |

measurements/2026-05-04-judge-calibration-labels.jsonl ADDED Viewed

	@@ -0,0 +1,90 @@

+{"item_id": "q021", "dimension": "groundedness", "score": 1, "abstained": false, "notes": "600 seconds and preflight caching are supported; conversion is arithmetic", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers the minutes conversion", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers 600/60 = 10 minutes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
+{"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported GraphQL and library claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses whether native GraphQL schema generation exists", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
+{"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported load-balancing claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses FastAPI load balancing", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
+{"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "v1.0 -> v1.1 flip: claim 'particularly useful for expensive operations like database connections' adds a use-case argument the snippet does not make. Snippet's get_db is an identifier in the example, not a use-case claim. Other claims entailed; this one is general LLM knowledge.", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers caching behavior and disabling mechanism", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers once-per-request cache and use_cache=False", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
+{"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported claims about other endpoints and customization beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly gives the default Swagger UI endpoint", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers /docs and interactive documentation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
+{"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported headers, response format, inheritance, and custom-handler claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains raising HTTPException in a route", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers HTTPException with status_code and detail", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
+{"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "large code sample adds many unsupported implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q023", "dimension": "relevance", "score": 1, "abstained": false, "notes": "on-topic but truncated before testing and dependency overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers error handling and CORS but misses TestClient/dependency_overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
+{"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "only cursor pagination is supported; response_model/background task claims are unsupported by snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses pagination, validation, and analytics logging", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers cursor navigation, response_model, and BackgroundTasks", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
+{"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported naming, storage, ordering, examples, and YAML details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares StatefulSet and Deployment use cases", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stateless Deployment vs sticky StatefulSet identity and when to use each", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
+{"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported port range, cloud-provider, production, and allocation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares NodePort and LoadBalancer Services", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node IP/static port versus external load balancer and relationship", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
+{"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported scheduler, nodeAffinity, and nodeName implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers Deployment vs DaemonSet scheduling difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers fixed replicas versus one copy on all or selected nodes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
+{"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Jaeger configuration guidance", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Jaeger sidecar injection setup", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_004", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Jaeger docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
+{"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported CEL and admission-controller deny alternatives", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers RBAC deny-rule question", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no deny rules and not granting delete permission", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
+{"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Envoy ADS configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Envoy ADS sidecar configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Envoy ADS docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
+{"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress-controller, EndpointSlice, kube-proxy, and DNAT details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains external HTTP traffic flow", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Ingress to Service to Pod routing and selector/load-balancing role", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
+{"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "snippets do not support sequential order, retry policy, or lifecycle details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains init-container startup order", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers sequential init containers, completion before app containers, and failure retry", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
+{"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported autoscaling/v2, memory/custom metric, and v1 comparison details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly identifies the HPA API version to use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_018", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers autoscaling/v2 and why it supports memory/custom metrics", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
+{"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source only defines ConfigMap; mechanisms and update behavior are unsupported", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains how ConfigMap values reach Pods", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers env vars, volume mounts, and update behavior", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
+{"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported NodePort, ExternalIPs, Ingress, and Gateway claims beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly lists Service exposure options", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers NodePort/LoadBalancer and ClusterIP/Ingress but misses ExternalName", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
+{"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported sticky identity, rescheduling, headless service, and policy details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers StatefulSet Pod identity guarantees", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_001", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers ordinal, network identity, stable storage, and sticky identity", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
+{"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported consumption mechanisms and Secret guidance beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains ConfigMap purpose and data type", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers non-confidential key-value config and not storing secrets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
+{"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported completion modes and configuration details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains what Jobs do and completion criteria", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_007", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Pod creation, successful completions, retries, and completion state", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
+{"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source snippet names the four kinds but not the detailed role/binding explanations", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly names and explains the four RBAC object kinds", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Role, ClusterRole, RoleBinding, and ClusterRoleBinding with scope/use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
+{"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported warn-mode and workload-resource behavior beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares enforce and warn modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stable PSA, enforce rejection, warn allowance, and combined modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
+{"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported default medium, memory accounting, size, and performance claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares default emptyDir and Memory medium", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node-backed default, tmpfs Memory, speed, and memory-limit accounting", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
+{"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported kill, Service traffic, grace-period, and best-practice details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares failed liveness and readiness probes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers liveness restart and readiness removal from traffic without restart", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
+{"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported idempotency and deadline details beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Job and CronJob usage", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers one-off task versus recurring cron-scheduled Jobs", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
+{"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "single snippet does not support bypass, host-network, or trusted-workload details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers what privileged profile enforces", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no restrictions, unrestricted policy, and bypassing isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
+{"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress/egress isolation rules beyond terse snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains non-isolated baseline and NetworkPolicy isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers baseline and allowed traffic but omits CNI enforcement point", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
+{"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported update semantics and traffic details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
+{"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Secret volume and environment-variable consumption", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
+{"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers both modes and update behavior difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}

measurements/2026-05-05-judge-rubric-opus-stress.jsonl ADDED Viewed

	@@ -0,0 +1,90 @@

+{"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that StatefulSet Pods have a unique identity consisting of an ordinal, stable network identity, and stable storage. The answer makes many claims not supported by this single snippet: 'sticky identity' terminology, comparison to Deployments, podManagementPolicy field with OrderedReady and Parallel options, requirement of a Headless Service, and use cases like databases. None of these specific claims are entailed by the snippet provided.", "label_timestamp": "2026-05-05T14:29:41Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021464, "_latency_ms": 3804.864978999831}
+{"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by describing the identity guarantees StatefulSets provide to Pods, including sticky identity, persistent identifiers, uniqueness/ordering, and network identity via headless service. It covers the core ask comprehensively.", "label_timestamp": "2026-05-05T14:34:20Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020344, "_latency_ms": 3798.0725439847447}
+{"item_id": "k8s_001", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has key points: (1) ordinal index, (2) stable network identity, (3) stable persistent storage, (4) identity sticks across rescheduling, (5) contrast with Deployment's interchangeable Pods. The answer covers sticky identity across rescheduling, network identity, and the contrast with Deployments. However, it does not explicitly mention ordinal index (only vaguely mentions 'ordering') and does not mention stable persistent storage at all. So partial coverage.", "label_timestamp": "2026-05-05T14:38:40Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022176000000000006, "_latency_ms": 6259.8209959687665}
+{"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims that go far beyond the gold snippets. The snippets only support: (1) Deployment manages Pods for stateless workloads, and (2) StatefulSet maintains sticky identity per Pod. The answer adds numerous unsupported claims: ordinal-based naming (web-0, web-1), volumeClaimTemplates, headless Service requirement, ordered sequential operations, specific examples like MySQL/Cassandra/Kafka, YAML configurations, persistent volumes per Pod, etc. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:29:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0031912, "_latency_ms": 5066.343065002002}
+{"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it compares StatefulSet and Deployment across multiple dimensions (pod identity, naming, network, storage, ordering) and explicitly explains when to prefer each with examples. This matches Example F's on-target comparison pattern.", "label_timestamp": "2026-05-05T14:34:29Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0027416000000000003, "_latency_ms": 4008.2699899794534}
+{"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two main parts: (1) Deployment manages stateless, interchangeable replicas with declarative updates, and (2) StatefulSet provides sticky identity with stable network IDs, stable persistent storage, and ordered deployment/scaling. The answer covers both: Deployment is described as managing stateless apps with interchangeable Pods, and StatefulSet is described with stable hostnames, persistent storage per Pod, ordered operations, and unique identity.", "label_timestamp": "2026-05-05T14:38:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0031176000000000003, "_latency_ms": 5786.756471032277}
+{"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that Ingress exposes HTTP/HTTPS routes from outside the cluster, and that Services target Pods usually via a selector. The answer makes many claims not supported by the snippets: kube-proxy details, EndpointSlices, iptables, DNAT, ClusterIP virtual IP, load balancing mechanics, session affinity, etc. None of these are in the gold snippets.", "label_timestamp": "2026-05-05T14:30:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002524, "_latency_ms": 3977.3940969607793}
+{"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the full traffic path from Ingress through Service to Pod, covering all three layers requested. It explains the Ingress controller's role at the edge, the Service layer with ClusterIP and EndpointSlices, kube-proxy's routing function, and final Pod delivery via DNAT.", "label_timestamp": "2026-05-05T14:34:38Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002372, "_latency_ms": 3949.3474710034207}
+{"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "Reference points: (1) Ingress exposes HTTP/HTTPS routes from outside and maps to backend Services based on rules - covered. (2) Service is abstraction defining logical set of endpoints (Pods) using selector - covered. (3) Load-balances traffic across Pods - covered. (4) Service delivers traffic to container port each Pod exposes - covered.", "label_timestamp": "2026-05-05T14:39:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028088, "_latency_ms": 4416.859550983645}
+{"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets are empty, meaning there are no snippets to ground any claims against. The answer provides substantive technical claims about Jaeger sidecar injection (labels, annotations, ports, etc.) that cannot be verified against the (empty) snippets. The answer also acknowledges the docs don't contain Jaeger-specific info but then proceeds to make many specific claims anyway, so this is not a pure refusal/abstain.", "label_timestamp": "2026-05-05T14:30:08Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002128, "_latency_ms": 3805.78425899148}
+{"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to enable Jaeger sidecar injection in a Kubernetes Deployment, providing the relevant label/annotation methods, namespace-level injection, configuration options, and verification steps. This is on-target for the question's core ask.", "label_timestamp": "2026-05-05T14:34:46Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020648000000000003, "_latency_ms": 2984.175974968821}
+{"item_id": "k8s_004", "dimension": "completeness", "score": 0, "abstained": false, "notes": "The reference answer's key point is that the agent should refuse to answer because Jaeger is not covered in the Kubernetes documentation corpus. The agent's answer does the opposite - it provides detailed instructions on Jaeger sidecar injection rather than refusing. While it briefly notes 'the specific search results don't contain Jaeger-specific documentation,' it then proceeds to give a full answer anyway, which contradicts the reference's point about refusing and citing zero sources.", "label_timestamp": "2026-05-05T14:39:11Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022992, "_latency_ms": 5866.323757043574}
+{"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only contain two minimal pieces of information: that policy violations cause pod rejection, and a feature state stable in v1.25. The answer makes many claims that are not supported by these snippets, including: warn mode triggers user-facing warnings, enforce mode not being applied to workload resources while warn is, the specific label syntax, and the configuration examples. Only the claim about enforce mode rejecting pods is supported by snippet [1].", "label_timestamp": "2026-05-05T14:30:18Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020616000000000002, "_latency_ms": 4882.2048010188155}
+{"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question about how Pod Security Admission behaves differently in enforce vs warn modes. It provides a clear comparison covering pod rejection behavior, user feedback, and applicability to workload resources, which are the key behavioral differences between these modes.", "label_timestamp": "2026-05-05T14:34:55Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020016, "_latency_ms": 3597.780067997519}
+{"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points: enforce mode rejects pods on violation, warn mode allows pods but triggers user-facing warning, and modes can be combined on the same namespace. Namespace-level application is also mentioned via labels. The stable since v1.25 detail is missing but is contextual rather than a core behavioral point about enforce vs warn.", "label_timestamp": "2026-05-05T14:39:21Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002044, "_latency_ms": 4796.786619001068}
+{"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports that a ConfigMap is an API object used to store non-confidential data in key-value pairs. The answer makes many additional claims not supported by the snippet: the three ways pods consume ConfigMaps (env vars, command-line args, volumes), the warning about not storing passwords/API keys/tokens/certificates, the recommendation to use Secrets, and the existence of `data` and `binaryData` fields. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:27Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020984000000000003, "_latency_ms": 3795.695087988861}
+{"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly defines what a ConfigMap is (a Kubernetes API object for storing non-confidential data in key-value pairs) and explicitly addresses what kind of data should be stored in it (non-sensitive configuration data like settings, parameters, feature flags), while also clarifying what should not be stored. Both parts of the question are addressed on-target.", "label_timestamp": "2026-05-05T14:35:04Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002028, "_latency_ms": 4578.002344002016}
+{"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) ConfigMap is an API object storing non-confidential data in key-value pairs, (2) used for application configuration that doesn't need to be secret, and (3) confidential data like passwords/tokens should be in Secrets instead.", "label_timestamp": "2026-05-05T14:39:30Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0020288, "_latency_ms": 3426.676850998774}
+{"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a Job creates Pods and retries until a specified number successfully terminate. The answer makes many additional claims not supported by the snippet: NonIndexed vs Indexed completion modes, .spec.completionMode, completion indices 0 to N-1, annotations/labels like batch.kubernetes.io/job-completion-index, JOB_COMPLETION_INDEX env variable, .spec.parallelism, .spec.backoffLimit, and the 'Completed' state transition. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:36Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023552, "_latency_ms": 3761.0972960246727}
+{"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: what a Kubernetes Job does (runs finite/batch tasks to completion, creates Pods that run once) and how it decides the task is complete (based on completion mode, .spec.completions, NonIndexed vs Indexed). Both core asks are covered on-target.", "label_timestamp": "2026-05-05T14:35:13Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019904000000000002, "_latency_ms": 3753.890769032296}
+{"item_id": "k8s_007", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) Job creates one or more Pods, (2) retries execution until specified number successfully terminate, (3) tracks successful completions, (4) complete when specified number reached, (5) deleting a Job cleans up the Pods it created. The answer covers points 1 (creates one or more Pods), 2 (implied via backoffLimit retries), 3 and 4 (tracks completions and completes when .spec.completions reached). However, point 5 (deleting a Job cleans up the Pods) is not mentioned in the answer.", "label_timestamp": "2026-05-05T14:39:40Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024712, "_latency_ms": 5021.353780000936}
+{"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that the RBAC API declares four kinds: Role, ClusterRole, RoleBinding, and ClusterRoleBinding. The answer adds many descriptive claims about each (e.g., 'Defines a set of permissions within a specific namespace', 'cluster-scoped rather than namespace-scoped', 'Grants the permissions defined in a Role to a user or set of users', 'connects a Role to subjects', etc.) that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:45Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021512, "_latency_ms": 3805.1239320193417}
+{"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly lists all four RBAC object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding) and explains what each does, fully addressing the question's core ask.", "label_timestamp": "2026-05-05T14:35:21Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001976, "_latency_ms": 2915.8728439942934}
+{"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: the four object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding), that Roles/ClusterRoles contain permission rules while bindings grant them to users/groups/service accounts, and the namespace vs cluster scope distinction.", "label_timestamp": "2026-05-05T14:39:48Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002052, "_latency_ms": 3570.147737977095}
+{"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippet only states that Secrets can be mounted as data volumes or exposed (presumably as environment variables). It does not contain any of the detailed claims about automatic updates, subPath behavior, character restrictions, pod restart requirements, or eventual consistency. The answer makes many specific claims that are not supported by the single truncated snippet provided.", "label_timestamp": "2026-05-05T14:30:53Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019776, "_latency_ms": 3797.7397789945826}
+{"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by comparing the two methods of consuming Secrets in Pods. It covers update behavior, restart requirements, naming constraints, and subPath limitations, providing a clear comparison of both approaches.", "label_timestamp": "2026-05-05T14:35:29Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019824, "_latency_ms": 3311.2139879958704}
+{"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) Secrets can be mounted as a data volume, (2) Secrets can be exposed as environment variables, (3) mounted volumes receive in-place updates when the Secret changes, and (4) environment variables are set at Pod start and don't update. The 'each key becomes a file' point is implicitly covered by 'Secrets are exposed as files in the mounted directory'.", "label_timestamp": "2026-05-05T14:39:57Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023824000000000002, "_latency_ms": 4128.535017021932}
+{"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that setting emptyDir.medium to 'Memory' causes Kubernetes to mount a tmpfs. The answer makes many additional claims not supported by the snippet: that default emptyDir is stored on node-backing media, that memory-backed files count against container memory limits, sizing behavior when no size is specified, cost/size comparisons, and other resource considerations. None of these are entailed by the single provided snippet.", "label_timestamp": "2026-05-05T14:31:02Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021024, "_latency_ms": 4015.2714860159904}
+{"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core comparison between default emptyDir and Memory-medium emptyDir, covering storage location, performance, memory accounting, and size limits for both cases.", "label_timestamp": "2026-05-05T14:35:37Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017872, "_latency_ms": 2831.339330004994}
+{"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) default emptyDir is stored on whatever medium backs the node (disk, SSD, or network storage), (2) setting medium to 'Memory' mounts a tmpfs/RAM-backed filesystem, and (3) tmpfs is fast but counts against the container's memory limit. The answer covers all three: it states the default storage location verbatim, describes Memory as tmpfs/RAM-backed, notes it's very fast, and explicitly states files count against the container's memory limit.", "label_timestamp": "2026-05-05T14:40:08Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023808, "_latency_ms": 5550.627671997063}
+{"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that liveness probes are used to know when to restart a container, and readiness probes are used to know when a container is ready to accept traffic. The answer makes many additional claims not supported by the snippets: terminationGracePeriodSeconds behavior, deadlock examples, Pod Ready condition being set to false, kubelet continuing to run more probes periodically, best practice recommendations about shared endpoints and failureThreshold settings, etc. These claims may be true but are not entailed by the provided snippets.", "label_timestamp": "2026-05-05T14:31:12Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002268, "_latency_ms": 4528.785168018658}
+{"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by explaining how the kubelet responds differently to failing liveness vs readiness probes. It clearly states liveness failures cause container restart while readiness failures remove the pod from service endpoints without killing the container.", "label_timestamp": "2026-05-05T14:35:46Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019464, "_latency_ms": 4366.952612996101}
+{"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has key points: (1) liveness failure \u2192 kubelet restarts container, (2) purpose is to recover from wedged state like deadlock, (3) readiness failure \u2192 container not restarted, (4) Pod marked not-ready and removed from Service load balancers. The answer covers all of these: liveness kills/restarts container, mentions deadlocks as use case, readiness keeps container running, and pod won't receive traffic from Services.", "label_timestamp": "2026-05-05T14:40:18Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023168000000000004, "_latency_ms": 4956.465922994539}
+{"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only support two basic claims: NodePort exposes the Service on each Node, and LoadBalancer exposes the Service externally using an external load balancer. The answer makes many additional claims not supported by the snippets, including the static port range (30000-32767), the access pattern <NodeIP>:<NodePort>, the cluster IP creation, the relationship that LoadBalancer is built on NodePort, the spec.allocateLoadBalancerNodePorts flag, and production/development suitability claims. None of these are entailed by the two minimal snippets provided.", "label_timestamp": "2026-05-05T14:31:21Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002252, "_latency_ms": 4352.603528008331}
+{"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core ask by comparing NodePort and LoadBalancer Service types, covering both sides with their characteristics, use cases, and relationship.", "label_timestamp": "2026-05-05T14:35:54Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017527999999999999, "_latency_ms": 2996.093010995537}
+{"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) NodePort exposes Service on each Node's IP at a static port - explicitly stated; (2) LoadBalancer exposes externally using an external load balancer - explicitly stated; (3) Kubernetes doesn't directly provide the load balancer, requires cloud provider integration - explicitly stated; (4) LoadBalancer typically implemented on top of NodePort - explicitly stated.", "label_timestamp": "2026-05-05T14:40:26Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023512, "_latency_ms": 3537.0634549763054}
+{"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims not supported by the snippets. The snippets only state that Jobs are one-off tasks that run to completion and that CronJobs start Jobs on a repeating schedule. The answer adds unsupported claims about idempotency requirements, exactly-once semantics, two Jobs being created for a single schedule, startingDeadlineSeconds, and deadline handling behavior. None of these are in the snippets.", "label_timestamp": "2026-05-05T14:31:31Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021944, "_latency_ms": 4788.899898994714}
+{"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how CronJobs differ from Jobs (execution, scheduling, use case) and when to use one over the other. It also adds relevant considerations about idempotency and deadlines.", "label_timestamp": "2026-05-05T14:36:02Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001912, "_latency_ms": 2846.5557509916835}
+{"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) Job runs a one-off task to completion, creating Pods and retrying until success; (2) CronJob creates Jobs on a repeating cron schedule for recurring actions like backups; (3) use Job for single batch run, CronJob for recurring schedule. The answer covers all three: Job runs once to completion for batch tasks, CronJob runs on cron schedule for recurring tasks like backups/reports, and explicitly states use Job for one-time and CronJob for recurring. The 'creates Pods and retries' detail is not explicitly mentioned but the run-to-completion concept is covered.", "label_timestamp": "2026-05-05T14:40:38Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0026271999999999997, "_latency_ms": 6194.046579999849}
+{"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that a Deployment manages Pods for stateless workloads and that a DaemonSet ensures all/some Nodes run a copy of a Pod. The answer makes many additional claims not supported by the snippets: details about the Kubernetes scheduler, nodeAffinity rules added by the DaemonSet controller, the .spec.nodeName field being set, specific replica distribution examples, and common use cases like logging agents, monitoring daemons, web servers. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:31:51Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002292, "_latency_ms": 14661.54205496423}
+{"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core scheduling difference between Deployment and DaemonSet, explaining that Deployments run a specified number of replicas distributed across nodes via the scheduler, while DaemonSets ensure one Pod per eligible node. Both sides of the comparison are covered.", "label_timestamp": "2026-05-05T14:36:11Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018928, "_latency_ms": 3361.026384984143}
+{"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two key points: (1) Deployment schedules a fixed/configured number of replicas independent of node count, and (2) DaemonSet runs a Pod on each (matching) node, so replica count scales with nodes. The answer covers both: it states Deployment manages a specified number of replicas placed on any eligible nodes, and DaemonSet ensures all eligible nodes run one copy of a Pod. While it doesn't explicitly say 'as nodes are added DaemonSet Pods are added with them,' it conveys that the count is tied to eligible nodes.", "label_timestamp": "2026-05-05T14:40:50Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025031999999999997, "_latency_ms": 7288.97923103068}
+{"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that init containers are like regular containers and run before the main application container. The answer makes many additional claims (sequential execution, retry behavior with restartPolicy, OnFailure handling, networking/storage readiness, one-time execution, exit code 0 requirement, etc.) that are not supported by the provided snippets.", "label_timestamp": "2026-05-05T14:32:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019376, "_latency_ms": 3801.133704953827}
+{"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: the order of init and application container execution, and the guarantees Kubernetes provides. It explains sequential init container execution, blocking of app containers, and includes related guarantees about retries and restart policy.", "label_timestamp": "2026-05-05T14:36:19Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001876, "_latency_ms": 3559.6700820024125}
+{"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has four key points: (1) init containers run one at a time in defined order, (2) each must complete before next starts, (3) regular containers start only after all init containers terminate successfully, (4) on failure, Pod restarts per restartPolicy and init sequence begins again, (5) suitable for one-time setup. The answer covers sequential ordering, completion requirement, app containers starting after init completion, and restartPolicy retry behavior. The 'one-time setup' purpose is implied at the end ('initialization tasks complete reliably before your application containers begin running').", "label_timestamp": "2026-05-05T14:41:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 5247.3236820078455}
+{"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only mention that HorizontalPodAutoscaler automatically updates a workload resource and reference a current stable version. They do not support specific claims about autoscaling/v2 being stable since v1.23, support for custom/memory/external metrics, multiple metrics evaluation, scale-up/scale-down behaviors, or that v1 only supports CPU. Nearly every claim in the answer is unsupported by the snippets.", "label_timestamp": "2026-05-05T14:32:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020632, "_latency_ms": 4089.8927800008096}
+{"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it specifies the API version (autoscaling/v2) and explains why (stable support for custom metrics, memory metrics, multiple metrics, advanced features). Both the 'which' and 'why' components are covered.", "label_timestamp": "2026-05-05T14:36:28Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020032, "_latency_ms": 4004.737361043226}
+{"item_id": "k8s_018", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) current stable HPA API is autoscaling/v2, (2) it adds memory and custom metrics beyond CPU-only autoscaling/v1, (3) new fields are preserved as annotations when working with autoscaling/v1, (4) use autoscaling/v2 for memory/custom metric scaling. The answer covers points 1, 2, and 4 clearly. Point 3 about annotation preservation in v1 is not mentioned. This is partial coverage.", "label_timestamp": "2026-05-05T14:41:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002352, "_latency_ms": 4204.742238041945}
+{"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a ConfigMap is an API object for storing non-confidential key-value pairs. The answer makes many claims about four mechanisms (env vars individual/bulk, volume mounts, API access), update behaviors, subPath limitations, kubelet handling, etc. None of these claims are supported by the single provided snippet.", "label_timestamp": "2026-05-05T14:32:18Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002272, "_latency_ms": 4105.534160975367}
+{"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing and explaining the mechanisms Kubernetes provides to make ConfigMap values available to a Pod: individual env vars, bulk env vars, volume mounts, and direct API access. It includes examples and notes on update behavior, fully covering the core ask.", "label_timestamp": "2026-05-05T14:36:37Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0023304000000000003, "_latency_ms": 3380.863350990694}
+{"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) env variables expose keys, (2) volume mount makes keys into files, (3) volume-mounted data updates in place while env vars require pod restart. The answer covers all three: it describes env vars, volume mounts with keys as filenames, and explicitly notes env vars require restart while volume mounts support dynamic updates.", "label_timestamp": "2026-05-05T14:41:19Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 4229.396597947925}
+{"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only contain two fragments: 'non-isolated' and 'namespaceSelector'. The answer makes many detailed claims about ingress/egress defaults, policyTypes, podSelector, connections from the Pod's own node, etc. While 'non-isolated' supports the default claim, the vast majority of claims (policyTypes values, podSelector mechanism, node connection allowance, egress rule behavior) are not supported by the minimal snippets provided.", "label_timestamp": "2026-05-05T14:32:26Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020512, "_latency_ms": 3578.6442419630475}
+{"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it explains that by default Pods are non-isolated (and thus not subject to NetworkPolicy filtering in the restrictive sense), and it explains how applying a NetworkPolicy that selects a Pod changes the baseline by making the Pod isolated for the specified direction(s), allowing only explicitly permitted traffic.", "label_timestamp": "2026-05-05T14:36:45Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020232, "_latency_ms": 3397.536567004863}
+{"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has four key points: (1) Pods are non-isolated by default, (2) a Pod becomes isolated when a NetworkPolicy selects it via podSelector, (3) only traffic allowed by the union of NetworkPolicies is permitted, (4) policies can target ingress/egress/both, and (5) the CNI plugin enforces the policy, not Kubernetes itself. The answer covers points 1, 2, 3, and 4 clearly. However, it does not mention that the CNI plugin is what enforces the policy.", "label_timestamp": "2026-05-05T14:41:28Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025376000000000005, "_latency_ms": 4806.188436981756}
+{"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports the claim that RBAC is purely additive with no deny rules. The answer makes many additional claims about CEL-based AuthorizationPolicy (with specific API version and schema), ValidatingAdmissionPolicy syntax, Kubernetes 1.26+ version requirements, and specific YAML structures that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:32:35Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023096, "_latency_ms": 3849.2560360464267}
+{"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the user's question about writing an RBAC deny rule for blocking pod deletion. It correctly explains that RBAC doesn't support explicit deny rules and provides alternative approaches including not granting delete permission, admission controllers, etc. This directly engages with the core ask.", "label_timestamp": "2026-05-05T14:36:53Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0022656000000000004, "_latency_ms": 3391.0853110137396}
+{"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference's key points are: (1) RBAC does not support deny rules, (2) rules are purely additive, (3) to block an action, simply don't grant the permission. The answer covers all of these: it states RBAC does not have explicit deny rules, is allow-based only, and recommends not granting the delete verb as the primary approach.", "label_timestamp": "2026-05-05T14:41:37Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024144, "_latency_ms": 3988.3979110163637}
+{"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided is 'Unrestricted policy', which is a very minimal piece of information. The answer makes multiple specific claims: that the privileged profile is 'entirely unrestricted', that pods can 'bypass typical container isolation mechanisms', that pods can access the node's host network, and that it's aimed at system/infrastructure-level workloads managed by privileged trusted users. While the snippet's phrase 'Unrestricted policy' loosely supports the 'unrestricted' claim, the specific claims about bypassing container isolation, host network access, and the intended audience (system/infrastructure workloads, trusted users) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:32:45Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002084, "_latency_ms": 4419.9781229835935}
+{"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks which container-isolation restrictions the 'privileged' profile enforces. The answer directly addresses this by stating that the privileged profile enforces no restrictions and is unrestricted by design, which is a direct on-topic answer to the core ask.", "label_timestamp": "2026-05-05T14:37:01Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017008000000000001, "_latency_ms": 2984.6201190375723}
+{"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) privileged enforces none/absence of restrictions, (2) purposely-open and entirely unrestricted, (3) bypasses typical container isolation mechanisms with host network example. The only minor omission is the explicit recommendation to use baseline or restricted for actual isolation, but the answer does contrast with these more restrictive profiles.", "label_timestamp": "2026-05-05T14:41:47Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0018664, "_latency_ms": 4233.887287962716}
+{"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets section is empty - there are no snippets provided to ground any claims against. The answer makes numerous technical claims about Envoy ADS configuration, bootstrap settings, Kubernetes deployment, and control plane setup, none of which can be supported by the (nonexistent) snippets. The answer itself acknowledges the documentation doesn't contain this information but then proceeds to provide unsupported claims rather than abstaining. Since the answer is not a pure refusal/abstention but instead provides substantive unsupported claims, it fails groundedness.", "label_timestamp": "2026-05-05T14:32:55Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0027576000000000002, "_latency_ms": 5302.7043039910495}
+{"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question about configuring Envoy xDS ADS for sidecar proxies in a Kubernetes Deployment. It provides bootstrap configuration with ADS settings, a Kubernetes Deployment manifest with the Envoy sidecar, and discusses the control plane setup. Both aspects of the question (ADS configuration and Kubernetes Deployment integration) are covered.", "label_timestamp": "2026-05-05T14:37:11Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002908, "_latency_ms": 4470.350661023986}
+{"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference states the correct answer is to refuse and cite zero sources because the corpus doesn't cover Envoy/xDS/ADS. The agent does acknowledge that the documentation doesn't contain this information ('The documentation corpus doesn't appear to contain specific Envoy ADS configuration documentation'), which partially covers the reference point. However, instead of refusing, it then provides extensive configuration guidance, which goes against the 'refuse' aspect of the reference. The acknowledgment of absence is present but the refusal stance is not maintained.", "label_timestamp": "2026-05-05T14:41:56Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0029904000000000003, "_latency_ms": 4743.537011963781}
+{"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only describe ClusterIP ('Exposes the Service on a cluster-internal IP') and LoadBalancer ('Exposes the Service externally using an external load balancer'). The answer makes many claims about NodePort, ExternalIPs, Ingress, Gateway, and details about LoadBalancer (cloud provider, stable external IP) that are not supported by the snippets. Multiple unsupported claims fail the strict conjunction rule.", "label_timestamp": "2026-05-05T14:33:05Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020808000000000003, "_latency_ms": 4765.863787964918}
+{"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing Kubernetes Service types that expose applications externally (NodePort, LoadBalancer) and adds ExternalIPs as additional context, while clarifying ClusterIP is internal-only. This directly addresses the core ask.", "label_timestamp": "2026-05-05T14:37:19Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017576000000000002, "_latency_ms": 3343.788483995013}
+{"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference identifies three external Service types: NodePort, LoadBalancer, and ExternalName, plus mentions ClusterIP as default and Ingress as alternative. The answer covers NodePort, LoadBalancer, ClusterIP as default, and Ingress. However, it does not mention ExternalName (instead it discusses ExternalIPs, which is a different concept). So one of the three external-exposing types is missing.", "label_timestamp": "2026-05-05T14:42:06Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0021791999999999996, "_latency_ms": 4295.65988300601}
+{"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer's core claims about default caching within a single request and using use_cache=False to disable it are supported by the snippet. However, the answer adds claims not supported: 'particularly useful for expensive operations like database connections' is an embellishment not in the snippet. More critically, the example shows using use_cache=False on two separate Depends calls in the same handler and claims this 'ensures that the dependency function is called twice' \u2014 but the snippet only describes caching when the same dependency is used multiple times (e.g., route + sub-dependency). The performance improvement claim ('improves performance by avoiding redundant function calls') is a reasonable inference but not explicitly stated. The strict-conjunction rule means unsupported claims fail the rubric.", "label_timestamp": "2026-05-05T14:28:25Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0026408, "_latency_ms": 6825.492414005566}
+{"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how dependency caching works in FastAPI (per-request caching of dependency results) and how to disable it (using use_cache=False). Both core asks are covered with explanation and example.", "label_timestamp": "2026-05-05T14:33:13Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001804, "_latency_ms": 3368.1467549758963}
+{"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers both reference points: (1) FastAPI caches dependency results so each dependency is called only once per request, and (2) caching can be disabled by setting use_cache=False in Depends().", "label_timestamp": "2026-05-05T14:37:29Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001668, "_latency_ms": 5226.763973012567}
+{"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "No gold snippets are provided to support any of the claims in the answer. The answer makes multiple specific claims (FastAPI generates OpenAPI 3.1.0, no native GraphQL support, specific third-party libraries like Strawberry/Graphene/Ariadne) that cannot be verified against the empty snippet set.", "label_timestamp": "2026-05-05T14:28:33Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0016656000000000002, "_latency_ms": 2784.843849018216}
+{"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question of whether FastAPI can natively generate GraphQL schemas, clearly stating it cannot, and provides additional context about alternatives.", "label_timestamp": "2026-05-05T14:33:21Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0016496000000000002, "_latency_ms": 2810.719419037923}
+{"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "Without a gold reference provided, I cannot assess coverage of specific reference points against the answer.", "label_timestamp": "2026-05-05T14:37:38Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0012887999999999999, "_latency_ms": 3803.2577409758233}
+{"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only confirm that `/docs` is the Swagger UI endpoint and that FastAPI exposes three documentation-related endpoints by default. The answer makes many additional claims not supported by the snippets: the specific names `/redoc` and `/openapi.json`, that ReDoc is the alternative viewer, that `/openapi.json` returns the raw OpenAPI schema, and the customization/disabling code with parameters like `docs_url`, `redoc_url`, `openapi_url`. None of these specifics are entailed by the two snippets provided.", "label_timestamp": "2026-05-05T14:28:43Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021912, "_latency_ms": 5105.2629159530625}
+{"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by stating that the default Swagger UI endpoint in FastAPI is /docs, and provides additional helpful context about other documentation endpoints and customization.", "label_timestamp": "2026-05-05T14:33:29Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0015224000000000001, "_latency_ms": 3036.487515026238}
+{"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states two points: (1) the default Swagger UI endpoint is /docs, and (2) it provides interactive API documentation. The answer covers both, explicitly stating /docs is the endpoint and describing it as an interactive API explorer.", "label_timestamp": "2026-05-05T14:37:46Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0016392, "_latency_ms": 2931.246972992085}
+{"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains several claims not supported by the snippets. The snippets only mention HTTPException as the primary way to return errors, that it terminates processing, returns status code and detail, and that detail can be string/list/dict serialized to JSON. Unsupported claims include: the 'headers' parameter, the example with custom headers, the default JSON response format showing {\"detail\": \"Item not found\"}, that HTTPException inherits from Starlette's HTTPException, and that you can create custom exception handlers. These are not entailed by the gold snippets.", "label_timestamp": "2026-05-05T14:28:52Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00242, "_latency_ms": 4274.66034895042}
+{"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to raise an HTTP error in a FastAPI route handler by explaining the use of HTTPException, providing code examples, parameters, and additional context. This is a direct, on-target response to the question's core ask.", "label_timestamp": "2026-05-05T14:33:37Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018336000000000001, "_latency_ms": 3302.3087149485946}
+{"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) raise HTTPException, (2) with a status_code, (3) and a detail message. The answer covers all three explicitly, showing the raise statement, status_code parameter, and detail parameter.", "label_timestamp": "2026-05-05T14:37:55Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0017992000000000001, "_latency_ms": 4265.608784975484}
+{"item_id": "q021", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet states max_age=600 seconds means the browser caches preflight results. The answer's core claim that 600 seconds = 10 minutes is a math conversion supported by the snippet's value. However, the answer adds claims about the 'Access-Control-Max-Age' header name and that max_age is a 'directive' specifying maximum cache time of preflight requests \u2014 these specifics (header name in particular) are not in the snippet. The snippet only mentions 'max_age' as a config with value 600 and that browser caches preflight results.", "label_timestamp": "2026-05-05T14:29:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0018016, "_latency_ms": 5443.063500977587}
+{"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by converting 600 seconds to 10 minutes and explaining the calculation.", "label_timestamp": "2026-05-05T14:33:45Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0012775999999999998, "_latency_ms": 2611.7517079692334}
+{"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states that with max_age 600 seconds, the browser caches preflight results for 10 minutes. The answer covers exactly this: 10 minutes caching, with the 600/60=10 calculation shown.", "label_timestamp": "2026-05-05T14:38:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0014024, "_latency_ms": 2987.084314983804}
+{"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer makes many specific claims that go far beyond the two snippets provided. The snippets only state that HTTPException is the primary way to return error responses and that CORS is configured using CORSMiddleware from Starlette. The answer includes extensive code with specific parameters (allow_credentials, allow_methods, expose_headers, max_age), custom exception handlers with @app.exception_handler decorator syntax, RequestValidationError handling, dependency injection patterns, and many other details not supported by the snippets.", "label_timestamp": "2026-05-05T14:29:12Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00298, "_latency_ms": 4431.315872003324}
+{"item_id": "q023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three parts of the question: custom error handling (custom exception classes and handlers), CORS middleware (CORSMiddleware setup), and begins setting up dependency injection for testing. Although the answer is truncated mid-code, it clearly engages with the core ask on-topic.", "label_timestamp": "2026-05-05T14:33:53Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0024936000000000003, "_latency_ms": 3396.7526520136744}
+{"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has three key points: (1) custom error handling via HTTPException/exception handlers, (2) CORS via CORSMiddleware with allowed origins, (3) testing with TestClient and app.dependency_overrides. The answer covers exception handlers extensively (point 1) and CORSMiddleware with allow_origins (point 2). However, the answer is cut off and does not cover testing with TestClient or app.dependency_overrides (point 3 missing).", "label_timestamp": "2026-05-05T14:38:13Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028344, "_latency_ms": 4287.58555999957}
+{"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided describes cursor-based pagination as using an opaque token pointing to the last item in the previous page, to avoid performance issues with large offsets. The answer makes many claims far beyond this: specific Pydantic models, field constraints (min_length, max_length, le=100), base64 encoding of cursors, analytics logging structure, FastAPI implementation details, etc. None of these specifics are supported by the single snippet. While the general concept of cursor pagination is mentioned, the vast majority of the answer's claims (validation rules, base64 encoding scheme, analytics fields, etc.) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:29:23Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0030064000000000002, "_latency_ms": 5922.551519994158}
+{"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three components of the question: cursor-based pagination (with encode/decode utilities and cursor query params), response model validation (Pydantic models with Field validators), and background task processing for analytics logging (FastAPI BackgroundTasks with analytics logger). Although the code is cut off at the end, the answer substantively engages with the core ask.", "label_timestamp": "2026-05-05T14:34:02Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002676, "_latency_ms": 4254.6321749687195}
+{"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all three reference points: (1) cursor-based pagination with opaque cursor tokens (shown via base64 encoding/decoding utilities and cursor parameter), (2) response_model parameter on route decorators for validation (shown in @app.get with response_model=PaginatedResponse[Item]), and (3) analytics logging handled asynchronously via FastAPI's BackgroundTasks (imported and used as a parameter).", "label_timestamp": "2026-05-05T14:38:22Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028224, "_latency_ms": 3933.714357961435}
+{"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "There are no gold source snippets provided. With no snippets, no claims in the answer can be grounded/supported. The answer makes multiple substantive claims (about Uvicorn workers, Nginx/HAProxy, Kubernetes, statelessness, etc.) that cannot be verified against any snippet. This is not a refusal answer, so abstain doesn't apply.", "label_timestamp": "2026-05-05T14:29:32Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019168000000000002, "_latency_ms": 3519.5088270120323}
+{"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks how FastAPI handles automatic load balancing across multiple servers. The answer directly addresses this by clarifying that FastAPI does not provide built-in load balancing, explains what it does provide (worker processes, ASGI foundation), and describes how load balancing is actually achieved via external tools. This directly engages with the question's core ask.", "label_timestamp": "2026-05-05T14:34:11Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020432000000000002, "_latency_ms": 3767.282517044805}
+{"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "No gold reference answer was provided to compare against. Without reference points to check coverage, the score cannot be determined.", "label_timestamp": "2026-05-05T14:38:29Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001504, "_latency_ms": 2372.651186946314}

results/calibration_v1_judge_baseline.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_abstain.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_anchors.json ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_baseline_no_cot.json ADDED Viewed

	@@ -0,0 +1,2115 @@

+[
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer contains a claim not entailed by the snippet. The snippet describes the caching mechanism (what it does) but does not state that caching 'is particularly useful for expensive operations like database connections.' This is an embellishment importing general LLM knowledge about why caching is useful, similar to Example B in the rubric. The snippet uses `Depends(get_db)` as an identifier in its example, not as a normative argument that caching is useful for database connections. All other claims in the answer are entailed: the default caching behavior, the once-per-request semantics, the mechanism of reusing cached values, the `use_cache=False` disable mechanism, and the code example demonstrating the disable feature. However, the single ungrounded claim about usefulness for expensive operations fails the strict binary rubric.",
+    "evidence_quotes": [
+      "Snippet: 'By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once.'",
+      "Answer claim: 'This improves performance by avoiding redundant function calls and is particularly useful for expensive operations like database connections.'",
+      "The 'particularly useful for expensive operations like database connections' portion is not entailed by the snippet."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.005529600000000001,
+    "latency_ms": 5959.3987130210735,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims that need to be evaluated against the gold snippets:\n\n1. 'The default Swagger UI endpoint in FastAPI is `/docs`' - Snippet [1] directly states this.\n2. 'This endpoint is automatically available in every FastAPI application with zero configuration required' - Snippet [2] states 'Every FastAPI application exposes three documentation-related endpoints by default', which entails this claim.\n3. 'When you navigate to http://localhost:8000/docs... you'll see an interactive API explorer powered by Swagger UI' - Snippet [1] describes `/docs` as 'Swagger UI -- interactive API explorer', which entails this.\n4. 'FastAPI also provides two other documentation endpoints by default: `/redoc` and `/openapi.json`' - Snippet [2] states there are 'three documentation-related endpoints by default', and the answer identifies three total endpoints. However, the snippets do not explicitly name `/redoc` and `/openapi.json` as the other two endpoints.\n5. The code examples showing customization of `docs_url`, `redoc_url`, and `openapi_url` parameters - These are not entailed by the snippets provided.\n6. The code example showing how to disable endpoints by setting them to `None` - This is not entailed by the snippets provided.\n\nThe critical issue is that while Snippet [2] confirms there are three documentation endpoints by default, it does not specify what those three endpoints are. The answer names `/redoc` and `/openapi.json` as the other two, but this is not entailed by the snippets. Additionally, the customization examples and the ability to disable endpoints are not mentioned in the gold snippets at all.\n\nUnder the strict reference scope rule, claims that are factually correct but not entailed by the snippets must score 0. The answer goes beyond what the snippets explicitly state regarding the specific endpoint names and the customization/disabling capabilities.",
+    "evidence_quotes": [
+      "Snippet [1]: '/docs | Swagger UI -- interactive API explorer'",
+      "Snippet [2]: 'Every FastAPI application exposes three documentation-related endpoints by default'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.006808,
+    "latency_ms": 11671.607129916083,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "Evaluating each major claim in the answer against the gold snippets:\n\n1. 'HTTPException is the primary way to raise HTTP errors' - ENTAILED by snippet [1].\n2. 'Raising HTTPException immediately terminates request processing and returns specified status code and detail' - ENTAILED by snippet [2].\n3. 'detail parameter can be a string, list, or dictionary' - ENTAILED by snippet [2].\n4. 'FastAPI serializes detail to JSON automatically' - ENTAILED by snippet [2].\n5. 'status_code is required parameter' - NOT ENTAILED. Snippets do not specify which parameters are required vs optional.\n6. 'detail is optional parameter' - NOT ENTAILED. Snippets do not specify optionality.\n7. 'headers parameter (optional)' - NOT ENTAILED. Snippets make no mention of a headers parameter.\n8. 'Custom headers example with X-Custom-Header' - NOT ENTAILED. Snippets do not mention headers capability.\n9. 'Response is automatically formatted as JSON with detail field' - ENTAILED by snippet [2] (serializes to JSON).\n10. 'HTTPException inherits from Starlette's HTTPException' - NOT ENTAILED. Snippets do not mention Starlette or inheritance.\n11. 'Can create custom exception handlers' - NOT ENTAILED. Snippets do not mention custom exception handlers.\n\nThe answer contains multiple ungrounded claims: the headers parameter and its usage, parameter optionality/requirement specifications, the Starlette inheritance detail, and custom exception handlers. Under the strict binary rubric, even one ungrounded claim requires a score of 0.",
+    "evidence_quotes": [
+      "Snippet [1]: 'The HTTPException class is the primary way to return error responses from route handlers'",
+      "Snippet [2]: 'When raised, HTTPException immediately terminates request processing and returns the specified status code and detail message. The detail parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically.'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.006624,
+    "latency_ms": 7320.906443987042,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0016296000000000001,
+    "latency_ms": 806.1586309922859,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer provides a comprehensive FastAPI setup guide covering error handling, CORS, and dependency injection. However, examining against the gold snippets: Snippet [1] states 'The `HTTPException` class is the primary way to return error responses from route handlers.' Snippet [2] states 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette.' The answer makes numerous claims beyond what is entailed by these snippets: (1) Custom exception classes (ItemNotFoundException, DatabaseError) and their handlers are presented as a primary pattern, but the snippet specifies HTTPException as 'the primary way'; (2) Custom exception handlers for RequestValidationError are shown without snippet support; (3) Detailed CORS configuration parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age) are provided, but the snippet only states that CORS 'is configured using CORSMiddleware from Starlette' without detailing these parameters; (4) The entire dependency injection section is completely unsupported by the snippets; (5) Database class implementation, Pydantic models, and route handler examples are all beyond snippet scope. While the answer is factually correct FastAPI documentation, it dramatically exceeds what is entailed by the two minimal snippets provided. The strict-conjunction rule applies: even though most claims are correct in general FastAPI knowledge, they are not entailed by the gold snippets.",
+    "evidence_quotes": [
+      "Snippet [1]: 'The `HTTPException` class is the primary way to return error responses from route handlers'",
+      "Snippet [2]: 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette'",
+      "Answer claims: Custom exception classes, RequestValidationError handlers, detailed CORS parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age), dependency injection patterns, database classes, Pydantic models, and route implementations"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.007796000000000001,
+    "latency_ms": 9348.714338964783,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer provides a comprehensive implementation of a paginated API with cursor-based pagination, validation, and analytics. However, the gold snippet only states: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.' The answer goes far beyond this single sentence by providing: (1) complete Pydantic response models with validation, (2) cursor encoding/decoding utilities using base64, (3) analytics logging with background tasks, (4) a complete FastAPI endpoint implementation with mock database, and (5) detailed field specifications and configurations. While these implementations are factually correct and follow best practices, they are NOT entailed by the gold snippet. The snippet only describes what cursor-based pagination is and why it's useful (avoiding performance degradation). The answer adds extensive implementation details, specific technologies (Pydantic, base64 encoding, logging, BackgroundTasks), and architectural patterns that are not mentioned in or entailed by the snippet. This is similar to the k8s_006 calibration anchor: the answer dramatically extends beyond what the gold snippet covers.",
+    "evidence_quotes": [
+      "Gold snippet: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.'",
+      "Answer claims not in snippet: Complete Pydantic BaseModel definitions, Generic type parameters, Field validators with constraints, base64 encoding/decoding implementation, analytics logging with JSON serialization, FastAPI endpoint decorators, BackgroundTasks integration, mock database with 100 items, datetime handling, and specific HTTP query parameter configurations."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.007131199999999999,
+    "latency_ms": 8201.74567901995,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0018592,
+    "latency_ms": 16725.520576001145,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets provided. The snippets state: (1) A Deployment manages Pods for stateless workloads, and (2) A StatefulSet maintains sticky identity for each Pod. The answer goes far beyond these minimal claims to include: ordinal-based naming conventions, stable hostnames, persistent volumes via volumeClaimTemplates, ordered scaling/updates, specific use cases (databases, message queues, etc.), YAML configuration examples, headless service requirements, and detailed comparisons across multiple dimensions. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only establish the core distinction (stateless vs. stateful identity), not the mechanisms, naming schemes, storage patterns, ordering semantics, or use-case guidance that comprise most of the answer. This is analogous to the k8s_006 calibration anchor: the answer dramatically extends beyond the gold-snippet scope into general Kubernetes knowledge not present in the reference material.",
+    "evidence_quotes": [
+      "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
+      "Snippet [2]: 'Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods'",
+      "Answer claims not in snippets: 'Ordinal-based names (e.g., web-0, web-1, web-2)', 'Stable hostname maintained across rescheduling', 'Persistent volumes per Pod via volumeClaimTemplates', 'Ordered, sequential operations (0\u21921\u21922)', 'Headless Service Required', detailed YAML examples, specific use cases like 'Databases (MySQL, PostgreSQL, MongoDB)'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0074048,
+    "latency_ms": 20093.030811985955,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011CajszTeiZPK5rrayb3438'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0031856,
+    "latency_ms": 51021.941915969364,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about Pod Security Admission's enforce and warn modes. Evaluating against the gold snippets:\n\nSnippet [1] states: 'Policy violations will cause the pod to be rejected'\nSnippet [2] only provides feature state information (Kubernetes v1.25 [stable])\n\nThe answer's key claims:\n1. 'Policy violations will cause the pod to be rejected' in enforce mode - ENTAILED by snippet [1]\n2. 'Policy violations are otherwise allowed' in warn mode - NOT ENTAILED by snippets\n3. 'Pod creation fails silently' in enforce mode - NOT ENTAILED by snippets\n4. 'Policy violations trigger a user-facing warning' in warn mode - NOT ENTAILED by snippets\n5. Claims about workload resources not being applied in enforce mode - NOT ENTAILED by snippets\n6. Claims about workload resources being applied in warn mode - NOT ENTAILED by snippets\n7. Namespace labeling examples and configuration details - NOT ENTAILED by snippets\n8. Claims about independent configuration - NOT ENTAILED by snippets\n\nThe gold snippets provide only minimal information: that policy violations cause pod rejection (in some mode, presumably enforce), and a feature state marker. The answer goes far beyond this, making numerous claims about warn mode behavior, workload resource handling, configuration examples, and practical implications that are not entailed by the provided snippets. While these claims may be factually correct in Kubernetes documentation, they are not grounded in the gold snippets provided. The strict-entailment rule requires that every claim be entailed by the gold snippets only.",
+    "evidence_quotes": [
+      "Snippet [1]: 'Policy violations will cause the pod to be rejected'",
+      "Snippet [2]: 'FEATURE STATE: Kubernetes v1.25 [stable]'",
+      "Answer claims about warn mode, workload resources, configuration examples, and practical implications are not present in or entailed by the gold snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0065352000000000006,
+    "latency_ms": 36792.009823955595,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer makes numerous claims beyond this single sentence: (1) three consumption methods (environment variables, command-line arguments, volumes), (2) specific examples of what to store (application configuration settings, environment-specific parameters, configuration files, feature flags), (3) what NOT to store (passwords, API keys, tokens, certificates), (4) recommendation to use Kubernetes Secrets instead, and (5) details about `data` and `binaryData` fields. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the gold snippet provided. The snippet only defines what a ConfigMap is and that it stores non-confidential data in key-value pairs. It does not describe consumption methods, specific use cases, security warnings, or schema fields. This is analogous to calibration anchor k8s_006: the answer dramatically extends beyond the snippet's scope. The strict-conjunction rule applies\u2014a single ungrounded claim fails the binary rubric, and this answer contains multiple ungrounded claims.",
+    "evidence_quotes": [
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      ":",
+      " ",
+      "'",
+      "A",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      " ",
+      "i",
+      "s",
+      " ",
+      "a",
+      "n",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "o",
+      "b",
+      "j",
+      "e",
+      "c",
+      "t",
+      " ",
+      "u",
+      "s",
+      "e",
+      "d",
+      " ",
+      "t",
+      "o",
+      " ",
+      "s",
+      "t",
+      "o",
+      "r",
+      "e",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      " ",
+      "i",
+      "n",
+      " ",
+      "k",
+      "e",
+      "y",
+      "-",
+      "v",
+      "a",
+      "l",
+      "u",
+      "e",
+      " ",
+      "p",
+      "a",
+      "i",
+      "r",
+      "s",
+      ".",
+      "'",
+      " ",
+      "A",
+      "n",
+      "s",
+      "w",
+      "e",
+      "r",
+      " ",
+      "c",
+      "l",
+      "a",
+      "i",
+      "m",
+      "s",
+      " ",
+      "n",
+      "o",
+      "t",
+      " ",
+      "i",
+      "n",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      ":",
+      " ",
+      "'",
+      "P",
+      "o",
+      "d",
+      "s",
+      " ",
+      "c",
+      "a",
+      "n",
+      " ",
+      "c",
+      "o",
+      "n",
+      "s",
+      "u",
+      "m",
+      "e",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "i",
+      "n",
+      " ",
+      "t",
+      "h",
+      "r",
+      "e",
+      "e",
+      " ",
+      "w",
+      "a",
+      "y",
+      "s",
+      ":",
+      " ",
+      "E",
+      "n",
+      "v",
+      "i",
+      "r",
+      "o",
+      "n",
+      "m",
+      "e",
+      "n",
+      "t",
+      " ",
+      "v",
+      "a",
+      "r",
+      "i",
+      "a",
+      "b",
+      "l",
+      "e",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "m",
+      "m",
+      "a",
+      "n",
+      "d",
+      "-",
+      "l",
+      "i",
+      "n",
+      "e",
+      " ",
+      "a",
+      "r",
+      "g",
+      "u",
+      "m",
+      "e",
+      "n",
+      "t",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "f",
+      "i",
+      "l",
+      "e",
+      "s",
+      " ",
+      "i",
+      "n",
+      " ",
+      "a",
+      " ",
+      "v",
+      "o",
+      "l",
+      "u",
+      "m",
+      "e",
+      "'",
+      ";",
+      " ",
+      "'",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "a",
+      "r",
+      "e",
+      " ",
+      "d",
+      "e",
+      "s",
+      "i",
+      "g",
+      "n",
+      "e",
+      "d",
+      " ",
+      "f",
+      "o",
+      "r",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      ",",
+      " ",
+      "n",
+      "o",
+      "n",
+      "-",
+      "s",
+      "e",
+      "n",
+      "s",
+      "i",
+      "t",
+      "i",
+      "v",
+      "e",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      ",",
+      " ",
+      "s",
+      "u",
+      "c",
+      "h",
+      " ",
+      "a",
+      "s",
+      ":",
+      " ",
+      "A",
+      "p",
+      "p",
+      "l",
+      "i",
+      "c",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "s",
+      "e",
+      "t",
+      "t",
+      "i",
+      "n",
+      "g",
+      "s",
+      ",",
+      " ",
+      "E",
+      "n",
+      "v",
+      "i",
+      "r",
+      "o",
+      "n",
+      "m",
+      "e",
+      "n",
+      "t",
+      "-",
+      "s",
+      "p",
+      "e",
+      "c",
+      "i",
+      "f",
+      "i",
+      "c",
+      " ",
+      "p",
+      "a",
+      "r",
+      "a",
+      "m",
+      "e",
+      "t",
+      "e",
+      "r",
+      "s",
+      ",",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "u",
+      "r",
+      "a",
+      "t",
+      "i",
+      "o",
+      "n",
+      " ",
+      "f",
+      "i",
+      "l",
+      "e",
+      "s",
+      ",",
+      " ",
+      "F",
+      "e",
+      "a",
+      "t",
+      "u",
+      "r",
+      "e",
+      " ",
+      "f",
+      "l",
+      "a",
+      "g",
+      "s",
+      "'",
+      ";",
+      " ",
+      "'",
+      "D",
+      "o",
+      " ",
+      "N",
+      "O",
+      "T",
+      " ",
+      "s",
+      "t",
+      "o",
+      "r",
+      "e",
+      " ",
+      "s",
+      "e",
+      "n",
+      "s",
+      "i",
+      "t",
+      "i",
+      "v",
+      "e",
+      " ",
+      "o",
+      "r",
+      " ",
+      "c",
+      "o",
+      "n",
+      "f",
+      "i",
+      "d",
+      "e",
+      "n",
+      "t",
+      "i",
+      "a",
+      "l",
+      " ",
+      "d",
+      "a",
+      "t",
+      "a",
+      " ",
+      "i",
+      "n",
+      " ",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      ",",
+      " ",
+      "s",
+      "u",
+      "c",
+      "h",
+      " ",
+      "a",
+      "s",
+      ":",
+      " ",
+      "P",
+      "a",
+      "s",
+      "s",
+      "w",
+      "o",
+      "r",
+      "d",
+      "s",
+      ",",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "k",
+      "e",
+      "y",
+      "s",
+      ",",
+      " ",
+      "T",
+      "o",
+      "k",
+      "e",
+      "n",
+      "s",
+      ",",
+      " ",
+      "C",
+      "e",
+      "r",
+      "t",
+      "i",
+      "f",
+      "i",
+      "c",
+      "a",
+      "t",
+      "e",
+      "s",
+      "'",
+      ";",
+      " ",
+      "'",
+      "y",
+      "o",
+      "u",
+      " ",
+      "s",
+      "h",
+      "o",
+      "u",
+      "l",
+      "d",
+      " ",
+      "u",
+      "s",
+      "e",
+      " ",
+      "K",
+      "u",
+      "b",
+      "e",
+      "r",
+      "n",
+      "e",
+      "t",
+      "e",
+      "s",
+      " ",
+      "S",
+      "e",
+      "c",
+      "r",
+      "e",
+      "t",
+      "s",
+      " ",
+      "i",
+      "n",
+      "s",
+      "t",
+      "e",
+      "a",
+      "d",
+      "'",
+      ";",
+      " ",
+      "'",
+      "C",
+      "o",
+      "n",
+      "f",
+      "i",
+      "g",
+      "M",
+      "a",
+      "p",
+      "s",
+      " ",
+      "h",
+      "a",
+      "v",
+      "e",
+      " ",
+      "`",
+      "d",
+      "a",
+      "t",
+      "a",
+      "`",
+      " ",
+      "a",
+      "n",
+      "d",
+      " ",
+      "`",
+      "b",
+      "i",
+      "n",
+      "a",
+      "r",
+      "y",
+      "D",
+      "a",
+      "t",
+      "a",
+      "`",
+      " ",
+      "f",
+      "i",
+      "e",
+      "l",
+      "d",
+      "s",
+      "'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.006216,
+    "latency_ms": 16998.984744946938,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.' The answer makes numerous claims that go far beyond what is entailed by this single snippet. Specifically: (1) The answer describes Jobs as 'finite or batch tasks to completion' \u2014 not entailed by the snippet. (2) The answer explains completion modes (NonIndexed vs Indexed) with detailed specifications about `.spec.completionMode`, `.spec.completions`, and how each mode determines completion \u2014 none of this is mentioned in the snippet. (3) The answer describes how completion indices are made available through pod annotations, labels, and environment variables \u2014 not entailed. (4) The answer explains `.spec.parallelism` and `.spec.backoffLimit` parameters \u2014 not entailed. (5) The answer states Jobs are 'useful for running tasks that need to run once and then stop, unlike Deployments which run continuously' \u2014 a use-case comparison not in the snippet. While the answer is factually correct about Kubernetes Jobs, the strict-snippet scope rule is clear: claims not entailed by the gold snippet must score 0, even if they are correct in the broader documentation or world knowledge. This answer dramatically extends beyond the single sentence provided, similar to the k8s_006 calibration anchor. The only claim entailed by the snippet is that a Job 'creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate,' which appears in the opening paragraph. Everything else fails strict entailment.",
+    "evidence_quotes": [
+      "Gold snippet: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate'",
+      "Answer claims about completion modes, indexed vs non-indexed, .spec.completionMode, environment variables, parallelism, backoffLimit \u2014 none present in snippet"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0067864,
+    "latency_ms": 12302.207177970558,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding.' The answer correctly identifies these four object types. However, the answer goes significantly beyond what is entailed by the snippet. The snippet only declares that these four kinds exist; it does not describe their purposes, scopes, or relationships. The answer makes numerous claims not entailed by the snippet: (1) Role 'defines a set of permissions within a specific namespace' and 'contains rules that represent a set of permissions'; (2) ClusterRole is 'cluster-scoped rather than namespace-scoped' and 'can be used to grant permissions across the entire cluster'; (3) RoleBinding 'grants the permissions defined in a Role' and 'connects a Role to subjects'; (4) ClusterRoleBinding 'grants the permissions defined in a ClusterRole at the cluster level'; (5) the distinction that 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions'; (6) the namespace vs cluster-scoped distinction and its implications. While these claims are factually correct in Kubernetes documentation, they are not entailed by the single-sentence snippet provided. The snippet only names the four objects; it does not describe their functionality, scope, or relationships. This is analogous to calibration anchor k8s_006, where the answer dramatically extends beyond what the snippet entails.",
+    "evidence_quotes": [
+      "Gold snippet: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding'",
+      "Answer claims about Role: 'Defines a set of permissions within a specific namespace. It contains rules that represent a set of permissions.' \u2014 NOT in snippet",
+      "Answer claims about ClusterRole: 'Similar to a Role, but is cluster-scoped rather than namespace-scoped. It can be used to grant permissions across the entire cluster.' \u2014 NOT in snippet",
+      "Answer claims about RoleBinding: 'Grants the permissions defined in a Role to a user or set of users within a specific namespace. It connects a Role to subjects' \u2014 NOT in snippet",
+      "Answer claims about ClusterRoleBinding: 'Similar to a RoleBinding, but grants the permissions defined in a ClusterRole at the cluster level.' \u2014 NOT in snippet",
+      "Answer meta-claim: 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions.' \u2014 NOT in snippet"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0070079999999999995,
+    "latency_ms": 10670.279069978278,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet provided is incomplete: '[1] Secrets can be mounted as data volumes or exposed as'. It cuts off mid-sentence and provides no substantive information about the behavioral differences between these two approaches. The answer makes numerous specific claims about automatic updates, dynamic refresh, subPath mount limitations, static environment variables, pod restart requirements, character restrictions, and use-case recommendations. None of these claims are entailed by the incomplete snippet fragment, which merely states that Secrets *can* be mounted or exposed without elaborating on how they differ or what their characteristics are. Under strict reference scope (gold snippets only) and the binary rubric requiring every claim to be entailed, this answer fails because the snippet provides no content to ground any of the detailed behavioral claims made.",
+    "evidence_quotes": [
+      "Gold snippet: '[1] Secrets can be mounted as data volumes or exposed as' (incomplete/truncated)"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0053408,
+    "latency_ms": 11916.670492035337,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'. This is a single factual claim about what happens when medium is set to Memory. The answer makes numerous claims beyond this snippet: (1) default emptyDir storage location and behavior, (2) performance characteristics of both modes, (3) memory impact and counting against container limits, (4) size limit specifications and node allocatable memory, (5) resource considerations about memory cost and capacity. While many of these claims may be factually correct in Kubernetes documentation, they are NOT entailed by the provided gold snippet. The snippet only establishes that setting medium to 'Memory' results in tmpfs mounting. Claims about default behavior, memory accounting, size limits, and resource considerations go dramatically beyond what the snippet entails. This is analogous to the k8s_006 calibration anchor where the answer extends far beyond the single-sentence snippet provided.",
+    "evidence_quotes": [
+      "Gold snippet: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'",
+      "Answer claims not in snippet: storage location for default emptyDir, performance characteristics, memory impact and counting against limits, size limit behavior, node allocatable memory sizing, resource considerations"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.005747200000000001,
+    "latency_ms": 30587.856588943396,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about liveness and readiness probes. Checking against the gold snippets:\n\nSnippet [1]: 'uses liveness probes to know when to restart a container'\nSnippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'\n\nThe answer claims:\n1. Liveness probe failure \u2192 kubelet kills container and restarts it\n2. Readiness probe failure \u2192 kubelet continues running container, doesn't kill it\n3. Readiness probe failure \u2192 pod won't receive traffic from Services\n4. Liveness probe is for detecting unrecoverable failures\n5. Readiness probe is for detecting temporary unavailability\n6. Best practice details about using both probes together with same endpoint, higher failureThreshold for liveness\n\nSnippet [1] only states that liveness probes are used 'to know when to restart a container' - it entails that liveness probes trigger restarts.\n\nSnippet [2] only states that readiness probes are used 'to know when a container is ready to start accepting traffic' - it entails that readiness probes relate to traffic acceptance.\n\nHowever, the answer makes many claims NOT entailed by these minimal snippets:\n- The specific behavior that readiness probe failure keeps the container running (not stated in snippet [2])\n- The detail about Pod's Ready condition being set to false\n- The use-case explanations (unrecoverable failures, temporary unavailability)\n- The best practice section about using both together, same endpoint, failureThreshold settings\n- The terminationGracePeriodSeconds detail\n\nWhile these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only state the basic purpose of each probe type, not the detailed behavioral differences or best practices described in the answer.\n\nUnder strict reference scope (gold snippets only), this answer goes far beyond what is entailed.",
+    "evidence_quotes": [
+      "Snippet [1]: 'uses liveness probes to know when to restart a container'",
+      "Snippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'",
+      "Answer claims: 'The kubelet kills the container and restarts it', 'The kubelet continues running the container', 'The pod will not receive traffic from Kubernetes Services', 'Use case: Detecting unrecoverable application failures', 'Use case: Indicating that a container is temporarily unable to handle traffic', 'A common pattern is to use both probes together'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.007291200000000001,
+    "latency_ms": 26607.242761005182,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets. The snippets only state: (1) NodePort 'Exposes the Service on each Node' and (2) LoadBalancer 'Exposes the Service externally using an external load balancer.' The answer adds many ungrounded claims including: the NodePort range (30000-32767), that NodePort creates a cluster IP internally, that external traffic accesses via <NodeIP>:<NodePort>, that LoadBalancer requires cloud provider integration, that LoadBalancer creates a NodePort internally, the optional `spec.allocateLoadBalancerNodePorts: false` configuration, and various use-case recommendations (development/testing vs production). While these claims may be factually correct in Kubernetes documentation, they are not entailed by the gold snippets provided. The strict-reference-scope rule requires that claims be entailed by the gold snippets only, not by broader domain knowledge or the underlying documentation. This is analogous to calibration anchor k8s_006, where correct information beyond the snippet scope scores 0.",
+    "evidence_quotes": [
+      "Gold snippet [1]: 'Exposes the Service on each Node'",
+      "Gold snippet [2]: 'Exposes the Service externally using an external load balancer'",
+      "Answer claims not in snippets: NodePort range (30000-32767), cluster IP creation, <NodeIP>:<NodePort> access pattern, cloud provider integration requirement, LoadBalancer creates NodePort internally, spec.allocateLoadBalancerNodePorts configuration, development vs production suitability"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0065256,
+    "latency_ms": 11835.180599999148,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims that must be checked against the two gold snippets: [1] 'Jobs represent one-off tasks that run to completion and then stop' and [2] 'A CronJob starts one-time Jobs on a repeating schedule'. The answer's core claims are: (1) Jobs run once when created; (2) CronJobs run on a repeating schedule; (3) Jobs are for one-time batch tasks; (4) CronJobs are for recurring tasks; (5) CronJobs create Job objects approximately once per scheduled execution; (6) Kubernetes doesn't guarantee exactly-once semantics; (7) Two Jobs might be created for a single schedule; (8) No Job might be created for a scheduled time; (9) Jobs created by CronJobs should be idempotent; (10) You can set startingDeadlineSeconds on a CronJob; (11) startingDeadlineSeconds controls what happens if a scheduled Job is missed. Claims 1-4 are directly entailed by the snippets (paraphrase of 'one-off tasks' and 'repeating schedule'). However, claims 5-11 regarding exactly-once semantics, idempotency requirements, the possibility of multiple or zero Job creations, and the startingDeadlineSeconds mechanism are NOT entailed by the gold snippets. These are factually correct Kubernetes behaviors but are not present in the provided reference material. The strict-entailment rubric requires that every claim be entailed by the gold snippets only. The answer goes well beyond what the snippets support, similar to the k8s_006 calibration anchor where correct information from the broader documentation fails because it's not in the gold snippet scope.",
+    "evidence_quotes": [
+      "Gold snippet [1]: 'Jobs represent one-off tasks that run to completion and then stop'",
+      "Gold snippet [2]: 'A CronJob starts one-time Jobs on a repeating schedule'",
+      "Answer claims not in snippets: 'Kubernetes doesn't guarantee exactly-once semantics', 'Two Jobs might be created for a single schedule', 'No Job might be created for a scheduled time', 'Jobs created by CronJobs should be idempotent', 'You can set startingDeadlineSeconds on a CronJob'"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.007357600000000001,
+    "latency_ms": 27444.345601077657,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes several claims about Deployments and DaemonSets. Checking against the gold snippets:\n\nSnippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'\nSnippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'\n\nClaims in the answer:\n1. Deployment manages 'specified number of Pod replicas' - entailed by [1] ('manages a set of Pods')\n2. Deployment Pods run on 'any eligible nodes' - NOT entailed by snippet [1]\n3. Deployment 'uses the standard Kubernetes scheduler' - NOT entailed by snippets\n4. Deployment 'Pods are distributed across nodes based on availability and scheduling policies' - NOT entailed by snippets\n5. DaemonSet 'ensures that all eligible nodes (or a subset based on node selectors/affinity) run exactly one copy' - partially entailed by [2] ('all (or some) Nodes run a copy of a Pod'), though [2] doesn't specify 'exactly one'\n6. DaemonSet controller 'creates a Pod for each eligible node and adds nodeAffinity rules' - NOT entailed by [2]\n7. DaemonSet uses '.spec.nodeName field' - NOT entailed by [2]\n8. Use cases section - NOT entailed by snippets\n\nThe answer goes far beyond what the snippets entail. While some core concepts are present (Deployment manages Pods, DaemonSet runs on nodes), the answer adds substantial detail about scheduling mechanisms, nodeAffinity, nodeName fields, and use cases that are not in the gold snippets. Under strict snippet entailment, this fails the binary rubric.",
+    "evidence_quotes": [
+      "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
+      "Snippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'",
+      "Answer claims about 'standard Kubernetes scheduler', 'nodeAffinity rules', '.spec.nodeName field', and use cases are not entailed by the snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.006692,
+    "latency_ms": 14147.950275044423,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt33D8HsUdB7oGVqJ1A'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0,
+    "latency_ms": 34940.02001499757,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The answer makes multiple specific claims about the `autoscaling/v2` API version and HorizontalPodAutoscaler functionality. However, the gold snippets provided are extremely minimal: [1] states that HorizontalPodAutoscaler 'automatically updates a workload resource' and [2] is incomplete ('The current stable version can be found in the'). Neither snippet entails the detailed claims in the answer about: (a) autoscaling/v2 being the current stable version since Kubernetes v1.23, (b) support for custom metrics, (c) support for memory metrics, (d) the ability to specify multiple metrics, (e) configurable scaling behavior with separate scale-up/scale-down behaviors, (f) autoscaling/v1 only supporting CPU-based scaling, or (g) the comparison between v1 and v2 features. While these claims may be factually correct in the broader Kubernetes documentation, they are not entailed by the gold snippets provided. The strict reference scope is the gold snippets only, and the answer dramatically exceeds what can be inferred from them. This is analogous to the k8s_006 calibration anchor: most of the answer is well-supported by the broader corpus, but the gold-snippet scope is what the rubric measures.",
+    "evidence_quotes": [
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "[",
+      "1",
+      "]",
+      ":",
+      " ",
+      "'",
+      "H",
+      "o",
+      "r",
+      "i",
+      "z",
+      "o",
+      "n",
+      "t",
+      "a",
+      "l",
+      "P",
+      "o",
+      "d",
+      "A",
+      "u",
+      "t",
+      "o",
+      "s",
+      "c",
+      "a",
+      "l",
+      "e",
+      "r",
+      "*",
+      " ",
+      "a",
+      "u",
+      "t",
+      "o",
+      "m",
+      "a",
+      "t",
+      "i",
+      "c",
+      "a",
+      "l",
+      "l",
+      "y",
+      " ",
+      "u",
+      "p",
+      "d",
+      "a",
+      "t",
+      "e",
+      "s",
+      " ",
+      "a",
+      " ",
+      "w",
+      "o",
+      "r",
+      "k",
+      "l",
+      "o",
+      "a",
+      "d",
+      " ",
+      "r",
+      "e",
+      "s",
+      "o",
+      "u",
+      "r",
+      "c",
+      "e",
+      "'",
+      ".",
+      " ",
+      "G",
+      "o",
+      "l",
+      "d",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "[",
+      "2",
+      "]",
+      ":",
+      " ",
+      "'",
+      "T",
+      "h",
+      "e",
+      " ",
+      "c",
+      "u",
+      "r",
+      "r",
+      "e",
+      "n",
+      "t",
+      " ",
+      "s",
+      "t",
+      "a",
+      "b",
+      "l",
+      "e",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "a",
+      "n",
+      " ",
+      "b",
+      "e",
+      " ",
+      "f",
+      "o",
+      "u",
+      "n",
+      "d",
+      " ",
+      "i",
+      "n",
+      " ",
+      "t",
+      "h",
+      "e",
+      "'",
+      " ",
+      "(",
+      "i",
+      "n",
+      "c",
+      "o",
+      "m",
+      "p",
+      "l",
+      "e",
+      "t",
+      "e",
+      ")",
+      ".",
+      " ",
+      "N",
+      "e",
+      "i",
+      "t",
+      "h",
+      "e",
+      "r",
+      " ",
+      "s",
+      "n",
+      "i",
+      "p",
+      "p",
+      "e",
+      "t",
+      " ",
+      "s",
+      "u",
+      "p",
+      "p",
+      "o",
+      "r",
+      "t",
+      "s",
+      " ",
+      "c",
+      "l",
+      "a",
+      "i",
+      "m",
+      "s",
+      " ",
+      "a",
+      "b",
+      "o",
+      "u",
+      "t",
+      " ",
+      "A",
+      "P",
+      "I",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      "s",
+      ",",
+      " ",
+      "c",
+      "u",
+      "s",
+      "t",
+      "o",
+      "m",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "m",
+      "e",
+      "m",
+      "o",
+      "r",
+      "y",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "m",
+      "u",
+      "l",
+      "t",
+      "i",
+      "p",
+      "l",
+      "e",
+      " ",
+      "m",
+      "e",
+      "t",
+      "r",
+      "i",
+      "c",
+      "s",
+      ",",
+      " ",
+      "s",
+      "c",
+      "a",
+      "l",
+      "i",
+      "n",
+      "g",
+      " ",
+      "b",
+      "e",
+      "h",
+      "a",
+      "v",
+      "i",
+      "o",
+      "r",
+      "s",
+      ",",
+      " ",
+      "o",
+      "r",
+      " ",
+      "v",
+      "e",
+      "r",
+      "s",
+      "i",
+      "o",
+      "n",
+      " ",
+      "c",
+      "o",
+      "m",
+      "p",
+      "a",
+      "r",
+      "i",
+      "s",
+      "o",
+      "n",
+      "s",
+      "."
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.005912000000000001,
+    "latency_ms": 29024.256383010652,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states only that 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer provides four detailed mechanisms for consuming ConfigMap values (environment variables individual, environment variables bulk, volume mounts, and Kubernetes API access), along with YAML examples, limitations, update behavior, and a comparison table. None of these consumption methods, technical details, YAML syntax, update semantics, or use-case guidance are entailed by the single-sentence snippet. While these claims are factually correct in Kubernetes documentation, they are not entailed by the gold snippet provided. This is analogous to calibration anchor k8s_006: the answer goes dramatically beyond what the snippet supports, covering consumption methods, schema details, and operational guidance that the snippet does not mention. The strict-conjunction rule applies: even though the answer is well-supported by the broader Kubernetes corpus, the gold-snippet scope is what the rubric measures.",
+    "evidence_quotes": [
+      "Gold snippet: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs'",
+      "Answer claims not in snippet: 'four main mechanisms', 'Environment Variables (Individual)', 'Environment Variables (Bulk)', 'Volume Mounts (Files)', 'Kubernetes API (Direct Access)', YAML examples, 'not updated automatically', 'supports dynamic updates', 'propagation delays', 'subPath volume mounts will not receive ConfigMap updates', comparison table with update behavior"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0060544,
+    "latency_ms": 21910.545678983908,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50 requests per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt3cHxhyPgmbXLYt5AS'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0022536000000000006,
+    "latency_ms": 36177.12139704963,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippet states: 'purely additive (there are no \"deny\" rules)'. This establishes that standard Kubernetes RBAC is allow-based only with no deny rules. The answer's opening claim that 'Standard Kubernetes RBAC does **not have explicit deny rules**' and 'RBAC is **allow-based only**' are directly entailed by this snippet. However, the answer then introduces Option 2 (CEL-based Authorization Policy) and Option 3 (Admission Controllers) as mechanisms to implement explicit deny rules. These options are presented as valid approaches to achieve deny functionality in Kubernetes. The gold snippet does not mention CEL-based authorization policies, AuthorizationPolicy resources, or admission controllers as ways to implement deny rules. While these may be factually correct in the broader Kubernetes ecosystem, they are not entailed by the gold snippet provided. The snippet only establishes what standard RBAC is (allow-based, no deny rules), not what alternative mechanisms exist to add deny functionality. The answer goes beyond the scope of the gold snippet by introducing solutions not mentioned in the reference material. This violates the strict-entailment requirement that claims must be entailed by the gold snippets only, not by general LLM knowledge.",
+    "evidence_quotes": [
+      "Gold snippet: 'purely additive (there are no \"deny\" rules)'",
+      "Answer claim: 'Option 2: Use CEL-based Authorization Policy (Kubernetes 1.26+)' with detailed YAML examples",
+      "Answer claim: 'Option 3: Use Admission Controllers' with ValidatingAdmissionPolicy examples"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0063864,
+    "latency_ms": 14159.782970964443,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.001712,
+    "latency_ms": 558.5277889622375,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "The gold snippets provided are minimal: [1] 'Exposes the Service on a cluster-internal IP' and [2] 'Exposes the Service externally using an external load balancer.' Snippet [1] describes ClusterIP (internal exposure), and snippet [2] describes LoadBalancer (external exposure). The answer makes numerous claims about NodePort, LoadBalancer, ExternalIPs, ClusterIP, and Ingress/Gateway resources. Most of these claims are not entailed by the two snippets. Specifically: (1) The detailed description of NodePort (allocating ports on every node, connecting via node IP and port, traffic forwarding) is not mentioned in either snippet. (2) The claim about LoadBalancer 'automatically provisions an external load balancer' and 'provides a stable external IP address' goes beyond snippet [2], which only states it 'exposes the Service externally using an external load balancer.' (3) The entire section on ExternalIPs is not covered by either snippet. (4) The ClusterIP section's claim that it 'only exposes the Service on a cluster-internal IP' is a paraphrase of snippet [1], which is grounded. (5) The claim about using Ingress or Gateway resources to expose ClusterIP is not entailed by either snippet. Under the strict-conjunction rule, a single ungrounded claim fails the binary rubric. This answer contains multiple ungrounded claims beyond what the snippets entail.",
+    "evidence_quotes": [
+      "Snippet [1]: 'Exposes the Service on a cluster-internal IP'",
+      "Snippet [2]: 'Exposes the Service externally using an external load balancer'",
+      "Answer claims about NodePort (allocation, node IP access, traffic forwarding) are not in snippets",
+      "Answer claims about LoadBalancer details (automatic provisioning, stable external IP) exceed snippet [2]",
+      "Answer section on ExternalIPs is entirely absent from snippets",
+      "Answer claims about Ingress/Gateway are not entailed by snippets"
+    ],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0067864,
+    "latency_ms": 20653.504143992905,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0008928,
+    "latency_ms": 5275.652518961579,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930",
+    "cost_usd": 0.000776,
+    "latency_ms": 613.2173900259659,
+    "item_id": "q010"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0007768,
+    "latency_ms": 4376.481520943344,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0008864000000000001,
+    "latency_ms": 8851.143473002594,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0006552000000000001,
+    "latency_ms": 6161.781317030545,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.001404,
+    "latency_ms": 1018.7896409770474,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.001404,
+    "latency_ms": 22714.352431998122,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e",
+    "cost_usd": 0.0009288,
+    "latency_ms": 4240.290573972743,
+    "item_id": "q027"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.00088,
+    "latency_ms": 4285.718351020478,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.001404,
+    "latency_ms": 6870.535210997332,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0011152,
+    "latency_ms": 2953.8072769646533,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437",
+    "cost_usd": 0.0010424000000000002,
+    "latency_ms": 4848.448178963736,
+    "item_id": "k8s_004"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0008928,
+    "latency_ms": 10320.249837997835,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0008608000000000001,
+    "latency_ms": 8522.756394988392,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0009432,
+    "latency_ms": 6325.502372987103,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0007928000000000001,
+    "latency_ms": 1811.099338985514,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0009128000000000001,
+    "latency_ms": 768.1542619829997,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0008704000000000001,
+    "latency_ms": 12236.18174099829,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.000932,
+    "latency_ms": 11616.58075498417,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0009008000000000001,
+    "latency_ms": 4196.989124990068,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.0009376,
+    "latency_ms": 838.445411005523,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0008352,
+    "latency_ms": 5632.905109028798,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0009328000000000001,
+    "latency_ms": 2904.8574669868685,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.000872,
+    "latency_ms": 16631.02817395702,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0011104,
+    "latency_ms": 5025.444047001656,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.000888,
+    "latency_ms": 6382.348418002948,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6uVe7LRm3cyR9dtBR'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0,
+    "latency_ms": 20767.425035010092,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0007504000000000001,
+    "latency_ms": 5516.492156020831,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6wc9SEcDRfPrzZBVp'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f",
+    "cost_usd": 0.0,
+    "latency_ms": 18978.80935500143,
+    "item_id": "k8s_024"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0008464000000000001,
+    "latency_ms": 3490.8632279839367,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0008888,
+    "latency_ms": 722.3775110323913,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0007648,
+    "latency_ms": 4397.730973025318,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0008768000000000001,
+    "latency_ms": 1102.155871980358,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt7RiL9hz5pVxt333xL'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0,
+    "latency_ms": 18437.84686899744,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.0014032,
+    "latency_ms": 1967.281456978526,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.0014048,
+    "latency_ms": 4361.171844007913,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0008992000000000001,
+    "latency_ms": 821.1477959994227,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.001436,
+    "latency_ms": 4178.335952979978,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0011384000000000001,
+    "latency_ms": 5098.598277952988,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0009088000000000001,
+    "latency_ms": 1836.335435975343,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0008712,
+    "latency_ms": 20613.944871001877,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0009632,
+    "latency_ms": 1971.2769520119764,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0008248000000000001,
+    "latency_ms": 5351.545320998412,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0009288,
+    "latency_ms": 2994.747666991316,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt8UJ8fPy6FvgG5MQKo'}",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0,
+    "latency_ms": 19476.016786997207,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.0009552,
+    "latency_ms": 574.3700260063633,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0009384,
+    "latency_ms": 5021.697896998376,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.000972,
+    "latency_ms": 776.8337430316024,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0008712,
+    "latency_ms": 4251.321821007878,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0009608,
+    "latency_ms": 4064.9051589425653,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0008984,
+    "latency_ms": 4741.949952032883,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer identifies two main ways to surface ConfigMap data to Pods: (1) exposing specific keys as environment variables, and (2) mounting as a volume. It also specifies a key difference: volume-mounted data updates in place when ConfigMap changes, while environment variables do not update until Pod restart. The agent's answer covers four mechanisms instead of two. Evaluating against the reference's key points: (1) Environment variables as a method - COVERED (both individual and bulk import shown). (2) Volume mounting as a method - COVERED (with detailed example). (3) Environment variables don't update until Pod restart - COVERED (explicitly stated in sections 1 and 2). (4) Volume-mounted data updates in place when ConfigMap changes - COVERED (stated in section 3). The agent adds two additional mechanisms (Kubernetes API direct access) not in the reference, but the rubric explicitly states 'The judge does not penalize the agent for adding correct extra detail.' All four key points from the reference are present in the answer, though paraphrased and expanded with additional context and examples.",
+    "evidence_quotes": [
+      "Reference point 1 - env vars: Agent states 'ConfigMap key-value pairs can be injected as individual environment variables' and 'All key-value pairs from a ConfigMap can be imported as environment variables'",
+      "Reference point 2 - volume mounting: Agent states 'ConfigMap data can be mounted as files in the container's filesystem' with 'Each key in the ConfigMap becomes a filename'",
+      "Reference point 3 - env vars don't auto-update: Agent explicitly states 'ConfigMaps consumed as environment variables are not updated automatically and require a pod restart'",
+      "Reference point 4 - volume updates in place: Agent states 'This mechanism supports dynamic updates when the ConfigMap changes (though there may be propagation delays)'"
+    ],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.004544,
+    "latency_ms": 10436.621424974874,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "Comparing the agent's answer against the reference's key points: (1) 'By default, Pods are non-isolated \u2014 they accept traffic from any source' - The agent covers this explicitly in the 'Default Baseline' section, stating 'By default, all Pods are non-isolated' for both ingress and egress. (2) 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' - The agent covers this, explaining that isolation occurs when a NetworkPolicy selects the Pod via podSelector. (3) 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' - The agent covers this concept, stating 'only the following connections are allowed' and 'only connections explicitly allowed by the rules...are permitted'. (4) 'NetworkPolicy rules can target ingress, egress, or both' - The agent covers this extensively with separate sections on ingress and egress isolation. (5) 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' - The agent does NOT mention the CNI plugin or clarify that Kubernetes itself does not enforce the policy. This is a notable omission of a key technical point from the reference.",
+    "evidence_quotes": [
+      "Reference: 'By default, Pods are non-isolated' | Agent: 'By default, all Pods are non-isolated for both ingress and egress traffic'",
+      "Reference: 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' | Agent: 'A Pod becomes isolated for ingress if there is any NetworkPolicy that selects the Pod (via podSelector)'",
+      "Reference: 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' | Agent: 'only the following connections are allowed' and 'only connections explicitly allowed by the rules'",
+      "Reference: 'NetworkPolicy rules can target ingress, egress, or both' | Agent: Covers this with separate ingress and egress sections",
+      "Reference: 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' | Agent: No mention of CNI plugin or enforcement mechanism"
+    ],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0041456,
+    "latency_ms": 9525.302856054623,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer makes one key point: Kubernetes RBAC does not support deny rules; it is purely additive, and the way to prevent an action is to not grant the permission. The answer to score covers this point explicitly in the opening section ('Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only') and recommends Option 1 (not granting the delete verb) as the primary solution. However, the answer goes significantly beyond the reference by introducing two additional options (CEL-based Authorization Policy and Admission Controllers) that are not mentioned in the reference. While these additions are correct and helpful, the scoring rubric explicitly states: 'The judge does not penalize the agent for adding correct extra detail (that's a separate concern). Score only on what fraction of the reference's points are present.' The single key point from the reference\u2014that RBAC has no deny rules and you must not grant permissions\u2014is fully covered in the answer.",
+    "evidence_quotes": [
+      "Reference: 'Kubernetes RBAC does not support deny rules...there are no deny rules. To prevent a user from deleting Pods you simply do not grant them a Role that contains the delete verb'",
+      "Answer: 'Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only...The simplest approach is to not grant the delete verb to the user.'"
+    ],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0041928,
+    "latency_ms": 13928.576675010845,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0007776,
+    "latency_ms": 2871.9944750191644,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "The reference answer identifies four key points: (1) NodePort exposes on each Node's IP at a static port, (2) LoadBalancer uses external load balancer from cloud integration, (3) ExternalName maps to external DNS name via CNAME record, and (4) ClusterIP is default/cluster-internal only, with Ingress as alternative. The agent's answer covers: (1) NodePort - fully covered with detailed explanation, (2) LoadBalancer - fully covered, (3) ExternalIPs - this is mentioned but is NOT the same as ExternalName from the reference (ExternalIPs route to cluster nodes; ExternalName maps to external DNS via CNAME), (4) ClusterIP - covered as default/cluster-internal, (5) Ingress - mentioned as alternative. The agent's answer is missing the ExternalName Service type entirely, which is explicitly listed in the reference as one of the three main types that expose applications outside the cluster. The agent substitutes ExternalIPs instead, which is a different mechanism. This represents incomplete coverage of the reference's key points.",
+    "evidence_quotes": [
+      "Reference lists: NodePort, LoadBalancer, ExternalName, and ClusterIP/Ingress alternative",
+      "Agent covers: NodePort (\u2713), LoadBalancer (\u2713), ExternalIPs (\u2717 - not in reference), ClusterIP (\u2713), Ingress (\u2713)",
+      "Missing: ExternalName (maps Service to external DNS name via CNAME record)"
+    ],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0036983999999999997,
+    "latency_ms": 8602.465078001842,
+    "item_id": "k8s_025"
+  }
+]

results/calibration_v1_judge_jury_kappa_weighted.json ADDED Viewed

	@@ -0,0 +1,1068 @@

+[
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.00334945,
+    "latency_ms": 5274.687575001735,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.00374865,
+    "latency_ms": 6121.622106002178,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0035575000000000003,
+    "latency_ms": 4093.9200639841147,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[1, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.00290715,
+    "latency_ms": 2376.422136032488,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.004488450000000001,
+    "latency_ms": 5196.281513024587,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.0041107999999999995,
+    "latency_ms": 4710.671422013547,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0034953000000000007,
+    "latency_ms": 3891.403643996455,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.004449850000000001,
+    "latency_ms": 5046.223370009102,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.00369115,
+    "latency_ms": 4370.564304001164,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0034766,
+    "latency_ms": 3641.3472470012493,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0034145,
+    "latency_ms": 3931.418454973027,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0037749000000000003,
+    "latency_ms": 4143.7127979588695,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0032585,
+    "latency_ms": 3237.493321998045,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0031490000000000003,
+    "latency_ms": 3059.5294540398754,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.00336595,
+    "latency_ms": 3805.040806008037,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.00348385,
+    "latency_ms": 3761.2991870264523,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0033831499999999997,
+    "latency_ms": 3078.7829870241694,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.0034280499999999998,
+    "latency_ms": 3479.930945963133,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.00355015,
+    "latency_ms": 3683.550757006742,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.003197,
+    "latency_ms": 3072.0106119988486,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0034425,
+    "latency_ms": 3274.0556380013004,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0034893500000000004,
+    "latency_ms": 4093.8572920276783,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.00330385,
+    "latency_ms": 5700.107204960659,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0037631999999999995,
+    "latency_ms": 3919.9696639552712,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.00314385,
+    "latency_ms": 3100.1224310020916,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "jury_kappa_weighted: members=[0, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.00371725,
+    "latency_ms": 4686.324876034632,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0020414499999999998,
+    "latency_ms": 4048.5086200060323,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930",
+    "cost_usd": 0.00196085,
+    "latency_ms": 7151.058950985316,
+    "item_id": "q010"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.00172915,
+    "latency_ms": 4002.432085981127,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0017475500000000003,
+    "latency_ms": 4597.96305996133,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0014225,
+    "latency_ms": 2652.675523015205,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.00249825,
+    "latency_ms": 5529.739827965386,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.002748,
+    "latency_ms": 6767.519410001114,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e",
+    "cost_usd": 0.00216725,
+    "latency_ms": 4897.304863028694,
+    "item_id": "q027"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0020017,
+    "latency_ms": 4283.594851032831,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.00290155,
+    "latency_ms": 7233.784672047477,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.00267875,
+    "latency_ms": 6703.730974986684,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437",
+    "cost_usd": 0.0021494500000000002,
+    "latency_ms": 4179.513328999747,
+    "item_id": "k8s_004"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.00211195,
+    "latency_ms": 3594.405687006656,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.00209915,
+    "latency_ms": 10237.164919963107,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0020244499999999997,
+    "latency_ms": 3682.6557849999517,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.00199385,
+    "latency_ms": 5115.608006017283,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.00209785,
+    "latency_ms": 4300.788230029866,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0019103500000000003,
+    "latency_ms": 4295.007671986241,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.00192045,
+    "latency_ms": 2745.3473570058122,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0021632500000000002,
+    "latency_ms": 2785.768751986325,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.0019623,
+    "latency_ms": 2757.5887810089625,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0020022,
+    "latency_ms": 3992.3200099729,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.00220855,
+    "latency_ms": 3684.430745022837,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0022516000000000003,
+    "latency_ms": 3479.075693991035,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0023306000000000004,
+    "latency_ms": 3675.9421579772606,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0020632000000000003,
+    "latency_ms": 3642.3342850175686,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0027547500000000003,
+    "latency_ms": 8526.330899971072,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0016788500000000004,
+    "latency_ms": 2267.0602449798025,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f",
+    "cost_usd": 0.0027766,
+    "latency_ms": 2918.413867999334,
+    "item_id": "k8s_024"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0019888500000000003,
+    "latency_ms": 3734.172669996042,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.00227215,
+    "latency_ms": 3487.446639977861,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0017535500000000002,
+    "latency_ms": 2375.5902140401304,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.00192865,
+    "latency_ms": 2719.3151200190187,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.00171755,
+    "latency_ms": 4097.853753017262,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[1, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.0028303000000000004,
+    "latency_ms": 3760.74364897795,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 0], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.0026877999999999997,
+    "latency_ms": 3566.6172149940394,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[1, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0022491,
+    "latency_ms": 3487.81520902412,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0027049500000000002,
+    "latency_ms": 3581.0006340034306,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.00320695,
+    "latency_ms": 3940.6566029647365,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.002138,
+    "latency_ms": 2982.359042041935,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0023091,
+    "latency_ms": 4743.728513014503,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[1, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0026847,
+    "latency_ms": 4049.3851299979724,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0025341499999999998,
+    "latency_ms": 3183.077324007172,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0024284999999999997,
+    "latency_ms": 3271.860691020265,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0020034,
+    "latency_ms": 2932.2378120268695,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.0026169500000000003,
+    "latency_ms": 3659.647927968763,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.00244685,
+    "latency_ms": 4667.101112019736,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[1, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.005007500000000001,
+    "latency_ms": 6751.217185985297,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.00215265,
+    "latency_ms": 3631.993369956035,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0023618499999999995,
+    "latency_ms": 3078.8671349873766,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.002239,
+    "latency_ms": 5300.237592018675,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0028338,
+    "latency_ms": 4155.337881005835,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0022305,
+    "latency_ms": 2631.397784978617,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0025917,
+    "latency_ms": 3302.9488619649783,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[2, 2], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0020013,
+    "latency_ms": 3276.021789992228,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "jury_kappa_weighted: members=[1, 1], weights=[1.0, 1.0]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "jury_v1_kappa_weighted",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0028785,
+    "latency_ms": 3899.847693974152,
+    "item_id": "k8s_025"
+  }
+]

results/calibration_v1_judge_jury_kappa_weighted_members.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_judge_permute.json ADDED Viewed

	@@ -0,0 +1,1068 @@

+[
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0063424,
+    "latency_ms": 7439.187292999122,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0056928000000000005,
+    "latency_ms": 6489.0113399596885,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0067528,
+    "latency_ms": 7367.937859031372,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [1, 1]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.00506,
+    "latency_ms": 4777.711889939383,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.007204800000000001,
+    "latency_ms": 8124.169834016357,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.00724,
+    "latency_ms": 8143.856438982766,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.006223200000000001,
+    "latency_ms": 6801.365238032304,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0078816,
+    "latency_ms": 8859.419776010327,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.006928,
+    "latency_ms": 7245.568608981557,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.006192,
+    "latency_ms": 7918.835818942171,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0059008,
+    "latency_ms": 6689.802628941834,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.006663200000000001,
+    "latency_ms": 7172.215738974046,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0059648,
+    "latency_ms": 6344.292242021766,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.006105600000000001,
+    "latency_ms": 6994.907460000832,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.005804,
+    "latency_ms": 6578.727920015808,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.006555999999999999,
+    "latency_ms": 7542.0630279695615,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0068024,
+    "latency_ms": 12256.505016062874,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.0064984,
+    "latency_ms": 7573.051882034633,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0067568,
+    "latency_ms": 7032.9406569944695,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.005955200000000001,
+    "latency_ms": 6565.044349990785,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0060967999999999994,
+    "latency_ms": 6469.805482018273,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0068792,
+    "latency_ms": 8244.90773503203,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "any_abstain_propagated: 1/2 permutations abstained",
+    "evidence_quotes": [],
+    "score": "Unknown",
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.005476,
+    "latency_ms": 5268.541873956565,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0069792,
+    "latency_ms": 9833.042712009046,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0057008,
+    "latency_ms": 6508.126194996294,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "groundedness",
+    "reasoning": "perm_mean over 2 seeds: [0, 0]",
+    "evidence_quotes": [],
+    "score": 0,
+    "judge_id": "claude-haiku-4-5-20251001_groundedness_perm2",
+    "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.007047200000000001,
+    "latency_ms": 8842.52748900326,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0035704,
+    "latency_ms": 4780.4513680166565,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930",
+    "cost_usd": 0.0030368,
+    "latency_ms": 4354.728399950545,
+    "item_id": "q010"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0029824,
+    "latency_ms": 4003.0857900273986,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0030736,
+    "latency_ms": 6280.778500949964,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.0024831999999999996,
+    "latency_ms": 4095.9463710314594,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.0041568,
+    "latency_ms": 4912.960623041727,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.004944800000000001,
+    "latency_ms": 5731.701809039805,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e",
+    "cost_usd": 0.0036624,
+    "latency_ms": 4685.736557992641,
+    "item_id": "q027"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0035568,
+    "latency_ms": 4733.0767110106535,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0048568000000000005,
+    "latency_ms": 6143.437234044541,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.0048032000000000005,
+    "latency_ms": 4649.340238014702,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437",
+    "cost_usd": 0.0038656000000000003,
+    "latency_ms": 4954.194063087925,
+    "item_id": "k8s_004"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0039704,
+    "latency_ms": 6946.335693006404,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.0037664,
+    "latency_ms": 5355.748680012766,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0038352,
+    "latency_ms": 5803.420553042088,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0039824000000000005,
+    "latency_ms": 5535.354931023903,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.0035344,
+    "latency_ms": 4430.360327998642,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.0033616,
+    "latency_ms": 6139.458572957665,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.0034048000000000004,
+    "latency_ms": 4505.8067859499715,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0036424000000000005,
+    "latency_ms": 4488.265429972671,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.003672,
+    "latency_ms": 5342.528612993192,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0034592,
+    "latency_ms": 4672.182891983539,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0039904,
+    "latency_ms": 5559.070541989058,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0040128,
+    "latency_ms": 6146.004260983318,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0041776,
+    "latency_ms": 5320.263453002553,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0037088000000000004,
+    "latency_ms": 4666.479801991954,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.0042776,
+    "latency_ms": 5364.486761041917,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.0030736000000000006,
+    "latency_ms": 5120.415877958294,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f",
+    "cost_usd": 0.0047408,
+    "latency_ms": 5116.924833972007,
+    "item_id": "k8s_024"
+  },
+  {
+    "dimension": "relevance",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_relevance_perm2",
+    "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0035696000000000005,
+    "latency_ms": 5120.317210035864,
+    "item_id": "k8s_025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
+    "cost_usd": 0.0039464,
+    "latency_ms": 5564.143811003305,
+    "item_id": "q006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
+    "cost_usd": 0.0029944000000000004,
+    "latency_ms": 4710.874331940431,
+    "item_id": "q011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
+    "cost_usd": 0.0037184,
+    "latency_ms": 6025.77873901464,
+    "item_id": "q012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
+    "cost_usd": 0.00294,
+    "latency_ms": 4618.280446971767,
+    "item_id": "q021"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [1, 1]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
+    "cost_usd": 0.0051152,
+    "latency_ms": 7776.658726972528,
+    "item_id": "q023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
+    "cost_usd": 0.0049104,
+    "latency_ms": 5650.782933982555,
+    "item_id": "q025"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [1, 1]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
+    "cost_usd": 0.0041232,
+    "latency_ms": 6412.802912935149,
+    "item_id": "k8s_001"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
+    "cost_usd": 0.0048408,
+    "latency_ms": 6034.373104048427,
+    "item_id": "k8s_002"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
+    "cost_usd": 0.004849600000000001,
+    "latency_ms": 7163.871316006407,
+    "item_id": "k8s_003"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
+    "cost_usd": 0.0037984,
+    "latency_ms": 5350.762542919256,
+    "item_id": "k8s_005"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
+    "cost_usd": 0.004139199999999999,
+    "latency_ms": 6397.059483046178,
+    "item_id": "k8s_006"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [1, 1]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
+    "cost_usd": 0.0066528,
+    "latency_ms": 10695.532124955207,
+    "item_id": "k8s_007"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
+    "cost_usd": 0.0045704,
+    "latency_ms": 5527.514518005773,
+    "item_id": "k8s_009"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
+    "cost_usd": 0.004110399999999999,
+    "latency_ms": 5684.126902022399,
+    "item_id": "k8s_011"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
+    "cost_usd": 0.00324,
+    "latency_ms": 4349.584889016114,
+    "item_id": "k8s_012"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
+    "cost_usd": 0.0050112,
+    "latency_ms": 9009.996334963944,
+    "item_id": "k8s_013"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
+    "cost_usd": 0.0037056000000000003,
+    "latency_ms": 4503.466013993602,
+    "item_id": "k8s_014"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
+    "cost_usd": 0.008997600000000001,
+    "latency_ms": 13719.642482115887,
+    "item_id": "k8s_015"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
+    "cost_usd": 0.0036591999999999996,
+    "latency_ms": 5323.825706029311,
+    "item_id": "k8s_016"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
+    "cost_usd": 0.0039024,
+    "latency_ms": 4708.659948024433,
+    "item_id": "k8s_017"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
+    "cost_usd": 0.0039536,
+    "latency_ms": 5732.109216041863,
+    "item_id": "k8s_018"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
+    "cost_usd": 0.0050488,
+    "latency_ms": 7046.853430045303,
+    "item_id": "k8s_019"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
+    "cost_usd": 0.0040984,
+    "latency_ms": 5682.077436009422,
+    "item_id": "k8s_020"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
+    "cost_usd": 0.004965600000000001,
+    "latency_ms": 10217.923904012423,
+    "item_id": "k8s_022"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [2, 2]",
+    "evidence_quotes": [],
+    "score": 2,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
+    "cost_usd": 0.003456,
+    "latency_ms": 4900.77328303596,
+    "item_id": "k8s_023"
+  },
+  {
+    "dimension": "completeness",
+    "reasoning": "perm_mean over 2 seeds: [1, 1]",
+    "evidence_quotes": [],
+    "score": 1,
+    "judge_id": "claude-haiku-4-5-20251001_completeness_perm2",
+    "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
+    "prompt_seed": 0,
+    "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
+    "cost_usd": 0.0043472,
+    "latency_ms": 6555.742309021298,
+    "item_id": "k8s_025"
+  }
+]

results/calibration_v1_judge_permute_members.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

results/calibration_v1_system_outputs.json ADDED Viewed

The diff for this file is too large to render. See raw diff

scripts/run_calibration.py CHANGED Viewed

@@ -330,7 +330,7 @@ async def cmd_run_judges(row_config_path: Path, concurrency: int) -> None:
                     return None
                 item, output = _build_item_and_output(rec)
                 result = await judge.score(item, output)
-                return {"dimension": dim, **result.model_dump()}
         coros = [
             score_one(rec, dim, judge)
@@ -365,7 +365,7 @@ async def cmd_run_judges(row_config_path: Path, concurrency: int) -> None:
                     continue
                 item, output = _build_item_and_output(rec)
                 result = await permuted.score(item, output)
-                all_results.append({"dimension": dim, **result.model_dump()})
     elif strategy == "jury":
         # Same sequential rationale as rubric_permute: jury writes a
@@ -396,7 +396,7 @@ async def cmd_run_judges(row_config_path: Path, concurrency: int) -> None:
                     continue
                 item, output = _build_item_and_output(rec)
                 result = await j.score(item, output)
-                all_results.append({"dimension": dim, **result.model_dump()})
     else:
         raise SystemExit(f"unknown strategy: {strategy}")

                     return None
                 item, output = _build_item_and_output(rec)
                 result = await judge.score(item, output)
+                return {"item_id": rec["item_id"], "dimension": dim, **result.model_dump()}
         coros = [
             score_one(rec, dim, judge)
                     continue
                 item, output = _build_item_and_output(rec)
                 result = await permuted.score(item, output)
+                all_results.append({"item_id": rec["item_id"], "dimension": dim, **result.model_dump()})
     elif strategy == "jury":
         # Same sequential rationale as rubric_permute: jury writes a
                     continue
                 item, output = _build_item_and_output(rec)
                 result = await j.score(item, output)
+                all_results.append({"item_id": rec["item_id"], "dimension": dim, **result.model_dump()})
     else:
         raise SystemExit(f"unknown strategy: {strategy}")