Nomearod commited on
Commit
4158bba
·
2 Parent(s): 4161c3e2d9ce3a

Merge remote-tracking branch 'origin/main' into hf-deploy

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .github/workflows/ci.yaml +6 -0
  2. .gitignore +7 -0
  3. DECISIONS.md +701 -0
  4. Makefile +16 -1
  5. README.md +17 -4
  6. agent_bench/core/config.py +3 -0
  7. agent_bench/core/provider.py +22 -6
  8. agent_bench/evaluation/calibration/__init__.py +9 -0
  9. agent_bench/evaluation/calibration/metrics.py +173 -0
  10. agent_bench/evaluation/calibration/report.py +325 -0
  11. agent_bench/evaluation/datasets/calibration_v1.json +158 -0
  12. agent_bench/evaluation/datasets/tech_docs_golden.json +255 -57
  13. agent_bench/evaluation/harness.py +62 -16
  14. agent_bench/evaluation/judges/__init__.py +25 -0
  15. agent_bench/evaluation/judges/base.py +628 -0
  16. agent_bench/evaluation/judges/citation_faithfulness.py +188 -0
  17. agent_bench/evaluation/judges/completeness.py +62 -0
  18. agent_bench/evaluation/judges/groundedness.py +57 -0
  19. agent_bench/evaluation/judges/relevance.py +48 -0
  20. agent_bench/evaluation/metrics.py +9 -85
  21. agent_bench/evaluation/report.py +14 -5
  22. agent_bench/evaluation/rubrics/citation_faithfulness.md +57 -0
  23. agent_bench/evaluation/rubrics/completeness.md +71 -0
  24. agent_bench/evaluation/rubrics/groundedness.md +142 -0
  25. agent_bench/evaluation/rubrics/relevance.md +74 -0
  26. agent_bench/evaluation/variance/__init__.py +9 -0
  27. agent_bench/evaluation/variance/jury.py +181 -0
  28. agent_bench/evaluation/variance/rubric_permute.py +109 -0
  29. agent_bench/serving/static/index.html +235 -0
  30. configs/calibration/rows/baseline.yaml +14 -0
  31. configs/calibration/rows/baseline_no_abstain.yaml +14 -0
  32. configs/calibration/rows/baseline_no_anchors.yaml +13 -0
  33. configs/calibration/rows/baseline_no_cot.yaml +13 -0
  34. configs/calibration/rows/jury_kappa_weighted.yaml +23 -0
  35. configs/calibration/rows/permute.yaml +14 -0
  36. docs/_generated/kappa_table.md +27 -0
  37. docs/judge-design.md +687 -0
  38. docs/plans/2026-05-04-judge-layer-v1-design.md +613 -0
  39. docs/plans/2026-05-04-judge-layer-v1-implementation.md +0 -0
  40. measurements/2026-05-04-judge-calibration-labels.jsonl +90 -0
  41. measurements/2026-05-05-judge-rubric-opus-stress.jsonl +90 -0
  42. measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl +5 -0
  43. measurements/2026-05-06-4a-gpt4o-full-probe.jsonl +5 -0
  44. measurements/2026-05-06-gpt4o-extraction-reasoning-split.md +162 -0
  45. measurements/README.md +1 -0
  46. pyproject.toml +2 -0
  47. results/calibration_v1_judge_baseline.json +0 -0
  48. results/calibration_v1_judge_baseline_no_abstain.json +0 -0
  49. results/calibration_v1_judge_baseline_no_anchors.json +0 -0
  50. results/calibration_v1_judge_baseline_no_cot.json +2115 -0
.github/workflows/ci.yaml CHANGED
@@ -9,6 +9,12 @@ on:
9
  jobs:
10
  test:
11
  runs-on: ubuntu-latest
 
 
 
 
 
 
12
  steps:
13
  - uses: actions/checkout@v4
14
 
 
9
  jobs:
10
  test:
11
  runs-on: ubuntu-latest
12
+ # Explicit empty env: prevents accidental dependency on injected
13
+ # secrets. Tests use MockProvider and require no API keys; if a
14
+ # future test imports a provider that needs a key, it will fail
15
+ # in CI and in any contributor fork the same way (no silent
16
+ # divergence based on whether secrets are present).
17
+ env: {}
18
  steps:
19
  - uses: actions/checkout@v4
20
 
.gitignore CHANGED
@@ -24,6 +24,13 @@ venv/
24
  logs/
25
  *.jsonl
26
 
 
 
 
 
 
 
 
27
  # Opaque binary artifacts — no PDFs in the repo today, and any that
28
  # appear here are almost always local reference material (downloaded
29
  # papers, vendor docs) that should not be committed. If a PDF ever
 
24
  logs/
25
  *.jsonl
26
 
27
+ # Evidence-bearing measurement artifacts referenced from DECISIONS.md.
28
+ # Narrow exception to the *.jsonl ignore above. Add new measurement files
29
+ # explicitly here so the audit-trail intent stays opt-in.
30
+ !measurements/*.jsonl
31
+ # Calibration jury/permute sidecars (per-member detail for κ ablation table).
32
+ !results/*.jsonl
33
+
34
  # Opaque binary artifacts — no PDFs in the repo today, and any that
35
  # appear here are almost always local reference material (downloaded
36
  # papers, vendor docs) that should not be committed. If a PDF ever
DECISIONS.md CHANGED
@@ -2116,3 +2116,704 @@ the actual container filesystem would have caught it pre-deploy.
2116
  Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
2117
  build infrastructure) but is the right long-term mitigation for this
2118
  class of bug.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2116
  Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
2117
  build infrastructure) but is the right long-term mitigation for this
2118
  class of bug.
2119
+
2120
+ ## LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call
2121
+
2122
+ The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
2123
+ (`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
2124
+ and replaced by the per-dimension Judge layer at
2125
+ `agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.
2126
+
2127
+ **Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`.
2128
+
2129
+ **Why this is a supersession, not a refactor.** The new layer differs from
2130
+ the old on six axes: discrete-anchored scale (vs continuous 0–1),
2131
+ reasoning-before-score JSON ordering (vs score-first), per-dimension
2132
+ judges (vs combined faithfulness/correctness), full provenance per call
2133
+ (judge_id + rubric_version + system_output_hash + prompt_seed; old had
2134
+ none), composable variance wrappers (rubric_permute, jury — old was
2135
+ single-call), and an intentional abstain-vs-raise discipline (vs silent
2136
+ `None` from a bare `except Exception`).
2137
+
2138
+ **Evidence backing the supersession claim** — the calibration κ table
2139
+ quantifies the new layer's agreement with hand-labels across 6 ablation
2140
+ rows (baseline + 3 variance ablations + permute + 2-judge jury). The
2141
+ files defending this entry's claim, by file path:
2142
+
2143
+ - `measurements/2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3
2144
+ dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the
2145
+ literature ceiling). Lands in Phase 10.
2146
+ - `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
2147
+ `_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
2148
+ `_permute.json`, `_jury_kappa_weighted.json` — per-row predictions.
2149
+ Land in Phase 11.
2150
+ - `docs/_generated/kappa_table.md` — generated κ ablation table copy-
2151
+ pasted into the writeup. Lands in Phase 11.
2152
+ - `docs/judge-design.md` — interpretive writeup with the closing
2153
+ "when NOT to use LLM-judge" position. Lands in Phase 12.
2154
+
2155
+ **Config-knob preservation.** `evaluation.judge_provider` is unchanged
2156
+ across all 5 YAML configs; new `evaluation.judge_dimensions` field
2157
+ defaults to the three v1 dimensions. Zero user-facing config migration.
2158
+
2159
+ **Out of scope (v1.1+).** Mistral self-hosted as the third jury member,
2160
+ Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
2161
+ prompt optimization, citation_faithfulness in the default
2162
+ judge_dimensions, AC2 sympy-derived parity tests.
2163
+
2164
+ ## Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05
2165
+
2166
+ The Opus stress-test pass over the 30 calibration items × 3 dimensions
2167
+ disagreed with the single-rater human gold on **22 of 30 groundedness
2168
+ items** (8/30 agreement). Relevance and completeness agreed at 28/30 and
2169
+ 25/30 respectively. The groundedness disagreement is consistent in
2170
+ direction — every disagreed-on item is `human=1, opus=0` — and has a
2171
+ single root cause.
2172
+
2173
+ **Root cause: reference-scope drift between rubric author and labeler.**
2174
+ `agent_bench/evaluation/rubrics/groundedness.md` defines the reference
2175
+ scope as the gold snippets attached to each item:
2176
+
2177
+ > The judge sees only the gold snippets — not the retrieved chunks. A
2178
+ > claim that happens to be true in the world but is not entailed by the
2179
+ > snippets fails groundedness.
2180
+
2181
+ The single-rater notes on the disagreed-on items describe checking
2182
+ against the broader documentation, not against `source_snippets`:
2183
+ "supported by the corpus", "supported by the docs", "supported by the
2184
+ provided dependency snippet". For items like `k8s_006` the gold snippet
2185
+ is one sentence ("A ConfigMap is an API object used to store
2186
+ non-confidential data in key-value pairs"), while the agent's answer
2187
+ correctly synthesizes seven or eight additional claims from the full
2188
+ `k8s_configmap.md`. Those claims are true in the world and well-supported
2189
+ by the full doc, but **not entailed by the one snippet**. Opus applied
2190
+ the strict-snippet rubric; the human rater applied a corpus-supported
2191
+ rubric.
2192
+
2193
+ **Why this blocks `make calibrate` against the current gold.** The κ
2194
+ ablation compares Haiku and GPT-4o-mini judges against the human gold.
2195
+ A judge that correctly applies the strict-snippet rubric will disagree
2196
+ with miscalibrated gold; a judge that's too lenient will agree. The
2197
+ ablation rewards leniency and punishes rigor — the opposite of the
2198
+ intended measurement. This is the same failure mode codified earlier in
2199
+ this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
2200
+ compensation when the measurement is wrong, not the intended effect.
2201
+
2202
+ **Why the rubric stays as written, not relaxed to "corpus-supported".**
2203
+ Strict-snippet groundedness measures *RAG behavior*: did the agent
2204
+ synthesize from what it retrieved? Corpus-supported groundedness
2205
+ measures *LLM general knowledge passing through a RAG harness*: did the
2206
+ agent happen to be correct? The first is what this benchmark is for;
2207
+ the second is what `agent_bench/evaluation/metrics.py` measured before
2208
+ supersession. Relaxing the rubric to "corpus-supported" would silently
2209
+ re-introduce the failure mode the supersession entry above just removed.
2210
+
2211
+ **Decision — three-step correction lands before `make calibrate` runs:**
2212
+
2213
+ 1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.**
2214
+ Add an explicit reference-scope line and one anchored example
2215
+ contrasting "supported by the snippet" vs "true in the world but
2216
+ not in the snippet". Audit-trail requirement: the v1.1 writeup will
2217
+ cite "rubric clarified between v1.0 and v1.1", and the git history
2218
+ needs to back that claim.
2219
+ 2. **Re-label the 22 disagreed-on groundedness items** in
2220
+ `measurements/2026-05-04-judge-calibration-labels.jsonl` against the
2221
+ clarified rubric, snippet-only. **Do not mechanically copy Opus's
2222
+ labels.** The labels remain the human single-rater's; what changes is
2223
+ the rubric being applied. Mechanical copy would turn the κ table
2224
+ into "judge vs Opus", which is not what the writeup claims it
2225
+ measures.
2226
+ 3. **Recompute `make calibrate` against the corrected gold** and emit
2227
+ `docs/_generated/kappa_table.md` from the v1.1 labels.
2228
+
2229
+ **Evidence files for the v1.1 writeup section:**
2230
+
2231
+ - `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — 90 Opus
2232
+ labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min
2233
+ org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
2234
+ - `measurements/2026-05-04-judge-calibration-labels.jsonl` — original
2235
+ v1.0 single-rater gold; will be diffed against v1.1 corrected gold to
2236
+ quantify the re-label delta.
2237
+ - `agent_bench/evaluation/rubrics/groundedness.md` — pre/post diff is
2238
+ the rubric clarification.
2239
+
2240
+ **Pre-labeling observations also worth recording for the writeup
2241
+ methodology section:**
2242
+
2243
+ - `q021` (fastapi · calculation) answered the CORS preflight question
2244
+ correctly (600 / 60 = 10 minutes) with `sources: []` and
2245
+ `ranked_sources: []` — the agent did the arithmetic without retrieval
2246
+ and emitted an answer consistent with the snippet without having
2247
+ retrieved it. Methodologically interesting for the
2248
+ citation-faithfulness story (Block 2.7) if it ships: an answer can be
2249
+ correct without being grounded-by-citation.
2250
+ - `q025` (fastapi · multi_hop) answer was truncated mid-token by the
2251
+ orchestrator's max_tokens limit. The labels reflect what the system
2252
+ produced, not a mentally-patched complete version. The completeness
2253
+ rubric does not currently anchor "truncated response" as a level —
2254
+ v1.1 rubric work should add an anchor.
2255
+ - Several K8s items embed external knowledge that's correct but not in
2256
+ the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container
2257
+ success; `k8s_009` describes Roles vs ClusterRoles by their semantics).
2258
+ The clarified groundedness rubric should pick **strict** on this case
2259
+ (claim must be supportable by the retrieved spans, not just consistent
2260
+ with them) and the anchored example should show that ruling.
2261
+
2262
+ **Methodology framing for the writeup.** The Opus stress-test was added
2263
+ specifically to catch hand-labeled-gold fragility before the κ table is
2264
+ published. It caught it. The writeup's calibration section should
2265
+ disclose the rubric clarification, quantify the re-label delta on
2266
+ groundedness, and report κ against the v1.1 corrected gold — that is a
2267
+ more credible story than a first-try clean κ table would have been.
2268
+
2269
+ **Outcome — 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows
2270
+ ran cleanly after three coupled production-code fixes that landed on the
2271
+ same branch as the rubric clarification: (1) markdown fence stripping in
2272
+ `agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because
2273
+ Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens`
2274
+ 512 → 1024 because v1.1 anchored examples elicit longer model reasoning,
2275
+ (3) calibration runner v1.0 omitted `item_id` from prediction records;
2276
+ fixed in v1.1 with backfill of the 6 already-written row files via
2277
+ `hash → item_id` map (no re-spend). Probe-one-cell-before-sweep saved a
2278
+ fourth $0.50 wasted run after the fence-strip change — the methodology
2279
+ note in `feedback_judge_probe_before_sweep.md` was earned by this
2280
+ session's two failed full-row attempts that paid ~$1.15 for unparseable
2281
+ output before the diagnosis converged.
2282
+
2283
+ The κ table at `docs/_generated/kappa_table.md` (regenerated on
2284
+ 2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for
2285
+ completeness — see report.py `_DIM_METRIC`) shows three findings
2286
+ that the writeup interprets rather than reports verbatim:
2287
+
2288
+ **v1.1 finding 1 — relevance is not "judges fail" territory.**
2289
+ Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the
2290
+ 29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00.
2291
+ AC1 is the load-bearing statistic on relevance and groundedness; both
2292
+ metrics agree on completeness where the gold (23×2 / 5×1) is balanced.
2293
+
2294
+ **v1.1 finding 2 — `no_cot completeness` agreement is real, not
2295
+ selective abstain.** AC1 = κ = 1.000 at n=24. The 2 absent cells
2296
+ (`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit
2297
+ retry exhaustion), both gold=`2`, neither in baseline's disagreement
2298
+ set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
2299
+ (3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to
2300
+ agreement when CoT is removed. The interview-relevant claim is the
2301
+ *opposite* of the conventional CoT-helps story: CoT-before-score on
2302
+ 3-point completeness lets the judge over-emphasize partial coverage
2303
+ and rationalize `1` when the human gold sides with the holistic
2304
+ "covers the points" reading.
2305
+
2306
+ **v1.1 finding 3 — `jury_kappa_weighted` underperformed baseline on
2307
+ completeness, with a precise mechanism.** Per-member analysis from
2308
+ `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
2309
+ Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%;
2310
+ gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw
2311
+ 26.9% — systematically harsh on the 3-point scale, almost never
2312
+ scoring `2`. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% —
2313
+ matches gpt-4o-mini alone exactly because the jury verdict reduces
2314
+ to gpt-4o-mini's verdict on every disputed cell.
2315
+
2316
+ The mechanism is *missing-weight + round-down* compounding, not
2317
+ weighted voting in the usual sense. `scripts/run_calibration.py
2318
+ ::_load_weights_from_baseline` is a documented v1 stub that returns
2319
+ weight = 1.0 for every judge_id present in baseline. baseline.json
2320
+ contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
2321
+ gets 1.0 from `jury.py`'s missing-key fallback (with a logged
2322
+ `jury_missing_weight_fallback_to_one` warning per call). Equal
2323
+ weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
2324
+ of 1.5; the `_discretize_mean` rule is `frac > 0.5 → ceil else floor`,
2325
+ and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict
2326
+ wins every disputed cell. The v1 design doc's risks subsection listed
2327
+ "jury κ worse than the better individual judge — (a) kappa-weighting
2328
+ wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
2329
+ *both* branches simultaneously: branch (a) because the weighting is a
2330
+ stub returning equal weights, and branch (b) because round-down at
2331
+ exact 0.5 ties hands the verdict to the lower-scoring member.
2332
+
2333
+ The deeper structural point is that weighting alone cannot rescue a
2334
+ systematically miscalibrated member. Even held-out validation that
2335
+ correctly assigned gpt-4o-mini's true low weight on completeness
2336
+ would still let it dominate disputed ties unless its weight were
2337
+ driven near zero — and at that point exclusion is more honest than
2338
+ near-zero inclusion. The conservative-on-binary "ties to lower" rule
2339
+ also doesn't transfer cleanly to ordinal scales: on completeness,
2340
+ "conservative" means scoring *toward incomplete*, which is precisely
2341
+ the direction of gpt-4o-mini's bias.
2342
+
2343
+ **v1.2 fix list (four items, expanding the earlier two-item list):**
2344
+
2345
+ 1. **Held-out jury weights.** Replace the
2346
+ `_load_weights_from_baseline` stub with a real κ-derived
2347
+ computation, evaluated on a *held-out validation set* — not the
2348
+ same calibration row whose κ is being measured against the gold.
2349
+ Closes the circular-weighting hole.
2350
+ 2. **Symmetric member coverage in the weights source.** Missing-member
2351
+ fallback to weight = 1.0 amplifies an unweighted member rather than
2352
+ suppressing it. Either every jury member must have a weight in the
2353
+ source file or the run must abort. The `jury_missing_weight_
2354
+ fallback_to_one` warning fired loudly on every call this run; in
2355
+ v1.2 it should be a hard error.
2356
+ 3. **Per-dimension member exclusion when individual κ falls below a
2357
+ threshold.** gpt-4o-mini at κ = 0.020 on completeness should not be
2358
+ in the completeness jury at all. Weights below a floor (suggested
2359
+ κ < 0.2) should be treated as exclusion, not as small-weight
2360
+ inclusion. Held-out validation fixes circular weighting; it does
2361
+ not fix systematic member bias.
2362
+ 4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule
2363
+ (ties to lower) was selected for conservative behavior on binary
2364
+ scales, where "conservative" means scoring 0 on uncertainty. On
2365
+ 3-point completeness, "conservative" means scoring toward
2366
+ *incomplete*, which interacts badly with member miscalibration.
2367
+ v1.2 should select the tie-break rule per-dimension based on the
2368
+ rubric's conservative direction, not globally.
2369
+
2370
+ **Evidence files:** `docs/_generated/kappa_table.md` (regenerated with
2371
+ AC1 for groundedness/relevance, κ for completeness);
2372
+ `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
2373
+ (per-member sidecar where the gpt-4o-mini completeness bias is
2374
+ visible per item); `results/calibration_v1_judge_baseline.json`
2375
+ (weights source — note the absence of any gpt-4o-mini-2024-07-18
2376
+ entries, which is why the missing-weight fallback fires).
2377
+
2378
+ ## v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria
2379
+
2380
+ **Date:** 2026-05-06. **Status:** in-flight; this entry is the pre-experiment
2381
+ contract that pins down what counts as success before the re-aggregation
2382
+ runs, so the outcome can't be negotiated post-hoc.
2383
+
2384
+ **Sharpened diagnostic — extraction-vs-reasoning split, not just "model is
2385
+ biased".** Re-reading the per-member sidecar (item-level, not aggregate)
2386
+ on the gpt-4o-mini completeness disputes shows a more specific failure
2387
+ mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2
2388
+ / gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes`
2389
+ field correctly extracts the paraphrased coverage from the agent answer
2390
+ — and then its `reasoning` field denies that those very quotes constitute
2391
+ coverage. k8s_002 is the cleanest instance: the model quotes the strings
2392
+ "declarative updates" and "sticky identity" into evidence, then writes
2393
+ "the answer does not explicitly mention 'declarative updates' and 'sticky
2394
+ identity'". The score follows the reasoning, not the evidence. The
2395
+ mechanism is that the model's *post-extraction reasoning step* applies a
2396
+ literal-string-match standard to the answer text while the rubric
2397
+ requires "paraphrase allowed" — i.e., the structured-output discipline
2398
+ forced an extraction step that the reasoning step then contradicted on
2399
+ autopilot. This is a known failure mode in chain-of-thought judges and
2400
+ shows up more in smaller models because the reasoning step has less
2401
+ capacity to integrate the rubric's instruction with the literal-text
2402
+ comparison the model is running by default. The artifact for the writeup
2403
+ is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three
2404
+ side-by-side reasoning + evidence_quotes excerpts).
2405
+
2406
+ **Pragmatic v1.1 weights-source decision.** The v1.2 fix-list above
2407
+ specifies a held-out validation set for jury weights — methodologically
2408
+ clean but requires either splitting N=30 (loses statistical power on
2409
+ both halves) or labeling more items (eats interview prep time). v1.1
2410
+ chooses pragmatic: weights computed from the same calibration set used
2411
+ for κ reporting, with the circularity flagged in the writeup. Reason:
2412
+ (a) the alternative is splitting N=30, (b) the per-member κ values used
2413
+ as weights are internally consistent, (c) v1.2 will use a held-out 20-
2414
+ item set. The writeup will contain a sentence acknowledging the
2415
+ circularity rather than hiding it.
2416
+
2417
+ **v1.1 elevated fix-list (subset of the v1.2 list above).** Items 2
2418
+ (symmetric coverage / hard-error) is elevated unconditionally. Item 1
2419
+ (real κ-derived weights) is elevated in pragmatic form (same set with
2420
+ circularity caveat). Items 3 (per-dimension exclusion) and 4 (per-
2421
+ dimension tie-break) remain v1.2 unless B's outcome forces them up.
2422
+
2423
+ **Pre-committed B success criteria.** Plan B is "re-aggregate the existing
2424
+ 164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl`
2425
+ with corrected κ-derived weights, no new API spend." The outcome maps
2426
+ deterministically to one of three predefined responses, picked *before*
2427
+ B runs:
2428
+
2429
+ - **Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥
2430
+ 0.05** (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story:
2431
+ "v1's weights-source bug masked correct aggregation; once both bugs
2432
+ (asymmetric coverage + missing-weight fallback) are fixed, the jury
2433
+ improves on baseline. Per-dimension exclusion remains a v1.2 design
2434
+ pattern but is not needed at v1.1." This is the strong story.
2435
+ - **Outcome 2 — jury κ within ±0.05 of Haiku-baseline** (i.e., 0.366 ≤
2436
+ jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity
2437
+ but the jury isn't doing meaningful work on completeness — gpt-4o-
2438
+ mini's near-zero weight makes it effectively excluded by aggregation.
2439
+ This is *soft exclusion via weighting*; v1.2 will make exclusion
2440
+ explicit." Defensible but less clean.
2441
+ - **Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05** (i.e.,
2442
+ jury κ < 0.366). Writeup story: "weights-source fix is necessary but
2443
+ not sufficient; even at near-zero weight gpt-4o-mini's verdict tips
2444
+ disputed (1, 2) ties due to the round-down rule. v1.1 escalates to
2445
+ per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.
2446
+
2447
+ **Why the predefined-criteria framing matters.** "I ran B, looked at the
2448
+ number, decided it was good enough" is the same data with a weaker frame
2449
+ than "I predefined the success criteria before running the experiment, B
2450
+ landed at outcome X, which mapped to predefined response Y". The latter
2451
+ demonstrates evaluation maturity in the writeup; the former invites
2452
+ post-hoc reading of the outcome.
2453
+
2454
+ **B outcome — 2026-05-06.** Plan B re-aggregated the existing 164 sidecar
2455
+ rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on
2456
+ completeness; clipped at 0 from raw κ values). Result: **jury κ on
2457
+ completeness = 0.416**, exactly matching Haiku-baseline. Δ = 0.000;
2458
+ maps to **Outcome 2 (soft exclusion via weighting)**. Per the
2459
+ pre-committed response, v1.1 stops here and writes up; per-dimension
2460
+ member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.
2461
+
2462
+ Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1)
2463
+ with corrected weights aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 =
2464
+ 1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the
2465
+ correct verdict. v1's two compounding bugs (asymmetric source returning
2466
+ weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt-
2467
+ 4o-mini) jointly forced equal weights, and equal-weights with the same
2468
+ round-up rule produced `(2 × 1 + 1 × 1) / 2 = 1.5`, which has frac
2469
+ exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every
2470
+ disputed cell. The bug fixes recover the right verdict purely
2471
+ mechanically; no judge model behavior changes.
2472
+
2473
+ The empirical reading: the weighting is *not doing meaningful work* —
2474
+ gpt-4o-mini's near-zero weight effectively excludes it on completeness,
2475
+ and the jury's κ matches Haiku-alone exactly because Haiku's verdict
2476
+ wins every disputed cell. This is "soft exclusion via weighting"; v1.2's
2477
+ explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the
2478
+ exclusion visible in the jury config rather than emergent from κ-derived
2479
+ weight collapse.
2480
+
2481
+ **v1.1 code changes (this commit):**
2482
+ - `agent_bench/evaluation/variance/jury.py` — silent missing-weight
2483
+ fallback to 1.0 → hard `ValueError`. Two existing tests that asserted
2484
+ the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`,
2485
+ `test_kappa_weighted_logs_warning_on_missing_weight`) updated to
2486
+ assert the new contract.
2487
+ - `scripts/run_calibration.py::_load_weights_from_baseline` →
2488
+ `_compute_kappa_weights` — replaces the v1 stub with real per-judge
2489
+ Cohen's κ on the dimension; hard-errors when any expected member is
2490
+ missing from the source. Clips κ < 0 to weight = 0 (soft exclusion).
2491
+ - `configs/calibration/rows/jury_kappa_weighted.yaml` — `weights_source`
2492
+ re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only,
2493
+ asymmetric coverage) to
2494
+ `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges,
2495
+ same calibration set with documented circularity).
2496
+ - `tests/scripts/test_run_calibration_dispatch.py` — two new tests cover
2497
+ `_compute_kappa_weights`: (a) computes real κ (high-agreement judge →
2498
+ weight=1.0, chance-agreement judge → 0); (b) hard-errors on
2499
+ asymmetric source coverage.
2500
+ - `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` — new
2501
+ predictions row produced by re-aggregating the existing sidecar
2502
+ offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`).
2503
+ `docs/_generated/kappa_table.md` regenerated with this row alongside
2504
+ the broken v1 row, giving the writeup a clean before/after diff
2505
+ (completeness: 0.014 → 0.416, n=26).
2506
+ - `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` — the
2507
+ three side-by-side reasoning + evidence_quotes excerpts (q006 /
2508
+ k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split
2509
+ diagnostic finding.
2510
+
2511
+ The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1
2512
+ (pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.
2513
+
2514
+ ## Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria)
2515
+
2516
+ **Date:** 2026-05-06. **Status:** in-flight; this entry pins down the
2517
+ hypothesis and success criteria before the experiment runs.
2518
+
2519
+ **Hypothesis sharpened by the 1A direction-of-bias finding.** GPT-4o-
2520
+ mini's completeness disagreements are 17/19 gold=2/pred=1 with zero
2521
+ up-mistakes across 26 items spanning two corpora — direction-aware noise,
2522
+ not balanced random labeling. The model is consistently applying *some*
2523
+ rule stricter than the rubric requires. The hypothesis under test: that
2524
+ stricter rule is "literal-string match required, paraphrase doesn't
2525
+ count," and the bias is fixable by recency-positioning the rubric's
2526
+ "paraphrase allowed" instruction adjacent to the commit-to-score
2527
+ decision instead of leaving it 500+ tokens upstream in the rubric body.
2528
+
2529
+ **The intervention is positional, not lexical.** The current
2530
+ `CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`)
2531
+ sends the rubric body, then the gold reference, then the system answer,
2532
+ then a one-line "Score this answer..." instruction immediately followed
2533
+ by the JSON schema clause. The rubric body's "paraphrase allowed" clause
2534
+ appears in the introductory paragraphs, hundreds of tokens before the
2535
+ score decision. The intervention adds one sentence between the system
2536
+ answer and the score instruction:
2537
+
2538
+ > *"Note: a paraphrase that captures the same meaning as a gold-answer
2539
+ > point counts as covered. Score on content equivalence, not surface
2540
+ > form."*
2541
+
2542
+ This is the recency-positioning hypothesis: the model loses the
2543
+ paraphrase conditioning across the rubric anchors and the reasoning
2544
+ step. Restating the instruction adjacent to the score decision tests
2545
+ whether the bias is positionally correctable.
2546
+
2547
+ **Selected 5 disputed items** (representative of the gold=2 / Haiku=2 /
2548
+ gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`,
2549
+ `k8s_018`. All four are pure paraphrase-coverage cases (the system
2550
+ answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored
2551
+ 1 with the extraction-vs-reasoning split documented in
2552
+ `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`).
2553
+
2554
+ **Pre-committed 3A success criteria.**
2555
+
2556
+ - **Fixed (≥3/5 shift from 1 → 2):** Recency-positioning is sufficient.
2557
+ Re-run GPT-4o-mini on the full 26 disputed items with the corrected
2558
+ prompt, recompute κ, update the writeup table. Story: "rubric-
2559
+ engineering matters more than judge model choice for ordinal scales —
2560
+ recency-positioning the paraphrase instruction recovered N% of
2561
+ disputed items." The completeness story becomes actionable, not
2562
+ diagnostic-only.
2563
+ - **Partially fixed (1–2/5 shift):** Inconclusive at N=5 (binomial-
2564
+ significance line is ~3+). Re-run on the full 26 disputed items
2565
+ (~$0.20) to get a clean number; write up whatever the full-26 says.
2566
+ - **Not fixed (0/5 shift):** The instruction is being received and
2567
+ ignored — the model can't act on it under reasoning load. Escalate
2568
+ to 4A (GPT-4o full on the same 5 items) to verify the small-model-
2569
+ specific claim. Story: "repositioning the paraphrase instruction
2570
+ adjacent to the score decision did not shift any of 5 disputed items;
2571
+ GPT-4o handled the same prompts. The bias is small-model-specific,
2572
+ not prompt-fixable."
2573
+
2574
+ The 3/5 threshold is the binomial-significance line at this N — random
2575
+ shifting under the null produces 0 or 1 changes most of the time. Pre-
2576
+ committing avoids the "2 shifted, that's kind of a fix" negotiation.
2577
+
2578
+ **On the 1A relevance finding — confirmed.** Both judges essentially
2579
+ correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ
2580
+ degeneracy is structural under 29/30 prevalence at class-2; AC1 +
2581
+ raw agreement is the right reporting. No further investigation on
2582
+ relevance. Writeup paragraph is one short sentence: prevalence-induced
2583
+ degeneracy → AC1 is load-bearing.
2584
+
2585
+ ## Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1)
2586
+
2587
+ **Date:** 2026-05-06. **Status:** complete; the v1.1.1 prompt is now
2588
+ permanent in `agent_bench/evaluation/judges/completeness.py`.
2589
+
2590
+ **3A 5-item probe:** 3/5 disputed items shifted 1 → 2 (q006, q011,
2591
+ k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre-
2592
+ committed threshold (≥3/5 → "fixed"), so the protocol triggered the
2593
+ full-26 re-run on gpt-4o-mini only (Haiku held as control to make the
2594
+ v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on
2595
+ the affected judge).
2596
+
2597
+ **Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):**
2598
+
2599
+ | | n | raw | κ | AC1 |
2600
+ |------------------------------|----|--------|--------|--------|
2601
+ | v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 |
2602
+ | **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** |
2603
+ | v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 |
2604
+
2605
+ **Per-item delta (v1.1 → v1.1.1):** 7 items shifted up (1 → 2 or 1 → 2),
2606
+ 0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2
2607
+ items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016,
2608
+ k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1
2609
+ in v1.1, now over-credited at pred=2). Net per-item correctness delta:
2610
+ +5 items.
2611
+
2612
+ **Cohen's κ is misleading on this comparison.** v1.1.1 raw agreement
2613
+ rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from
2614
+ 0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly
2615
+ *lower* than v1.1's 0.020. The mechanism is prevalence-rebalancing in
2616
+ the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19,
2617
+ 2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer
2618
+ to gold's `{1:5, 2:23}` over n=28). Cohen's κ = `(P_o - P_e)/(1 - P_e)`;
2619
+ when marginals become more diverse, P_e (chance agreement) rises in
2620
+ lockstep with P_o (observed agreement), and κ deflates. AC1 uses
2621
+ prevalence-robust chance correction (`P_e = (1/(q-1)) Σ pi_k(1-pi_k)`)
2622
+ and reads the actual signal.
2623
+
2624
+ This is the same trap that motivated AC1 over κ on the relevance and
2625
+ groundedness rows of the original κ table, surfacing here at a
2626
+ different distribution boundary. The κ table footer already explains
2627
+ why per-dimension metric selection matters; v1.1.1's outcome
2628
+ demonstrates the trap *induced by the intervention itself*.
2629
+
2630
+ **Effect on the jury aggregate.** With κ-derived weights and gpt-4o-
2631
+ mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury
2632
+ verdict on completeness is now mathematically equivalent to Haiku-alone
2633
+ on every item (gpt's contribution is multiplied by zero). Jury κ stays
2634
+ at 0.416, identical to v1.1's corrected aggregate. The intervention's
2635
+ per-member improvement is *invisible at the jury level* under this
2636
+ weighting scheme.
2637
+
2638
+ **Methodological consequence — v1.2 fix-list addition.** The v1.2 fix-
2639
+ list now expands by one item:
2640
+
2641
+ 5. **Prevalence-robust weights for prevalence-skewed dimensions.**
2642
+ v1.1's `_compute_kappa_weights` uses Cohen's κ for every dimension,
2643
+ which has a *self-defeating property* on prevalence-skewed gold:
2644
+ improving a member can lower its weight even as it gets more
2645
+ accurate.
2646
+
2647
+ **Mechanism.** Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where
2648
+ `P_e = Σ_k P(gold=k) × P(pred=k)` is the chance-agreement term
2649
+ computed from the marginal distributions. P_e is *not* invariant to
2650
+ the predictor's marginal distribution — when a member's predictions
2651
+ become more diverse (less concentrated at one class), P_e *rises*
2652
+ as the marginals approach gold's marginals. Concretely: when an
2653
+ intervention moves a member's pred distribution from concentrated-
2654
+ at-one-class toward gold's distribution, P_o and P_e rise together
2655
+ in lockstep. The numerator `P_o - P_e` stays small, and κ deflates
2656
+ even as raw accuracy improves. This is the same prevalence-induced
2657
+ degeneracy that motivated AC1 over κ on relevance/groundedness rows
2658
+ in the κ table — it surfaces in jury weighting at any
2659
+ distribution-shifting intervention's boundary.
2660
+
2661
+ **Empirically observed in v1.1.1.** The recency-positioning
2662
+ intervention shifted gpt-4o-mini completeness pred dist from
2663
+ `{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's
2664
+ `{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% → 42.9%.
2665
+ AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×).
2666
+ Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o
2667
+ ≈ 0.43. v1.1's `_compute_kappa_weights` clips the new κ at zero,
2668
+ producing weight = 0 — and the jury aggregate loses access to a
2669
+ member that was empirically improved. The intervention's per-
2670
+ member improvement is invisible at the jury level under κ-weighting.
2671
+
2672
+ **Architectural decomposition for v1.2.** The right separation:
2673
+ - **Per-dimension metric for κ table reporting** (already in v1.1
2674
+ via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`).
2675
+ - **Per-dimension weight metric for jury aggregation** (new in
2676
+ v1.2, reuses `_DIM_METRIC`). Use κ where the gold's prevalence
2677
+ supports it, AC1 where κ degenerates. Same lookup, same per-
2678
+ dimension policy at both reporting and weighting layers.
2679
+ - **Per-dimension membership as explicit configuration override**
2680
+ for members that are structurally inappropriate (v1.2 fix #3,
2681
+ unchanged) — distinct from "low score on the chosen metric,"
2682
+ which is handled by the weight floor.
2683
+
2684
+ **Why this is non-obvious.** A reader's first instinct is that
2685
+ "weight by κ" is a sensible default — κ is *the* standard inter-
2686
+ rater statistic. The self-defeating property is invisible until
2687
+ you observe a real intervention that shifts marginals; in static
2688
+ conditions (no intervention, fixed prompts), the κ-weight choice
2689
+ is benign. The v1.1.1 outcome is the first time the agent-bench
2690
+ calibration set has produced an intervention-induced marginal
2691
+ shift on the same gold; the failure mode wouldn't have been
2692
+ visible in v1.0's static calibration sweep.
2693
+
2694
+ **v1.1.1 code changes (this commit):**
2695
+ - `agent_bench/evaluation/judges/completeness.py` — adds
2696
+ `PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system
2697
+ answer and the score instruction. Comment cites the 3A probe.
2698
+ - `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt`
2699
+ — extends to assert the recency clause appears AND is positioned
2700
+ between the answer and the score instruction (position is load-
2701
+ bearing, not just lexical inclusion).
2702
+ - `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
2703
+ — merged sidecar: v1.1 groundedness/relevance rows (unchanged
2704
+ judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku
2705
+ completeness rows.
2706
+ - `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the
2707
+ 5-item probe artifact with reasoning + evidence_quotes for each.
2708
+ - `scripts/_dev/probe_3a_paraphrase_recency.py`,
2709
+ `scripts/_dev/rerun_completeness_v1_1_1.py` — reproducers; not
2710
+ part of the production calibration runner.
2711
+
2712
+ **No changes to the κ table.** The jury aggregate κ on completeness is
2713
+ unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy
2714
+ described above; adding a `jury_kappa_weighted_v1_1_1` row with
2715
+ identical numbers would be visual noise. The v1.1.1 finding lives in
2716
+ the writeup body, not the table — the per-member AC1 improvement
2717
+ (0.006 → 0.232) is the headline number, surfaced as a separate
2718
+ paragraph next to the κ table rather than inside it.
2719
+
2720
+ **Total spend through Plan 3A:** $0.0013 (3A probe) + $0.0075 (full-26
2721
+ re-run) = $0.0088.
2722
+
2723
+ ## Plan 4A — GPT-4o (full) on the v1.1.1 residual
2724
+
2725
+ **Date:** 2026-05-06. **Status:** complete. Run after the writeup-
2726
+ framing review surfaced that v1.1.1's "fixed" verdict was overclaim-
2727
+ prone — 5/19 items were recovered, 14 remained unchanged and
2728
+ uncharacterized. 4A was originally scoped as conditional on 3A *not*
2729
+ being fixed (per the predefined sequencing rule), but became valuable
2730
+ as a *post-3A* diagnostic to characterize the residual: is it small-
2731
+ model-specific or rubric-under-specified?
2732
+
2733
+ **Scope.** GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged
2734
+ items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1
2735
+ production prompt (paraphrase recency clause active). The first two
2736
+ (k8s_006, k8s_018) are the items that didn't shift in the original 3A
2737
+ 5-item probe — we have gpt-4o-mini's reasoning on those items *with*
2738
+ the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt
2739
+ varying only the model. q011, q012, k8s_001 cover the broader
2740
+ fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A
2741
+ checks whether GPT-4o agrees with gold or with Haiku).
2742
+
2743
+ **Result: 5/5 correct.** All 5 items scored 2 by GPT-4o, matching gold
2744
+ exactly. Cost: $0.0011 reported (caveat: pricing config falls back to
2745
+ gpt-4o-mini rates for unlisted models, so actual cost is closer to
2746
+ $0.005–0.01 — the reported number under-reports by ~5–10×).
2747
+
2748
+ **Sharpened mechanism — criteria-invention, not just literal-match.**
2749
+ The original 3-example artifact (q006, k8s_002, k8s_018) was framed
2750
+ as gpt-4o-mini "applying a literal-string-match standard" while
2751
+ correctly extracting paraphrased coverage into evidence_quotes. 4A's
2752
+ side-by-side reasoning on `k8s_018` shows a distinct second mechanism:
2753
+
2754
+ - **gpt-4o-mini (v1.1.1, score 1):** "It mentions some key points
2755
+ from the reference... but does not explicitly state that the new
2756
+ fields in `autoscaling/v2` are preserved as annotations when using
2757
+ `autoscaling/v1`, nor does it mention the need to use
2758
+ `autoscaling/v2` directly for memory or custom metric scaling for
2759
+ a Deployment or StatefulSet."
2760
+ - **gpt-4o (4A, score 2):** "The answer covers all the key points
2761
+ from the reference. It mentions that the current stable version is
2762
+ autoscaling/v2, which supports scaling on memory and custom
2763
+ metrics, similar to the reference. It also notes that
2764
+ autoscaling/v1 only supports CPU-based scaling, aligning with the
2765
+ reference's points."
2766
+
2767
+ The reference for k8s_018 specifies three points: (1) autoscaling/v2
2768
+ is the current stable API, (2) it adds memory metrics support beyond
2769
+ v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's
2770
+ reasoning step *invents additional criteria* the reference does not
2771
+ require ("preserved as annotations when using autoscaling/v1," "use
2772
+ autoscaling/v2 directly for ... a Deployment or StatefulSet") and then
2773
+ deducts against them, scoring 1. GPT-4o reads the reference's three
2774
+ points and scores against exactly those, scoring 2.
2775
+
2776
+ This is a *capacity* finding distinct from the paraphrase-recency
2777
+ finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing
2778
+ it toward paraphrase semantics, manufactures additional gold criteria
2779
+ during scoring that aren't in the reference. Recency-positioning the
2780
+ "paraphrase allowed" clause doesn't address this — the bias isn't
2781
+ "missed paraphrase," it's "invented extra requirements." Two failure
2782
+ modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.
2783
+
2784
+ **Implication for v1.2.** With 5/5 confirmed, v1.2 fix #3 (per-
2785
+ dimension membership) gets clean empirical support: gpt-4o-mini is
2786
+ the wrong tool for 3-point completeness with paraphrase semantics, and
2787
+ no amount of prompt engineering on this rubric is going to bridge the
2788
+ capacity gap. The right v1.2 path is one of:
2789
+
2790
+ - **Exclude gpt-4o-mini from completeness scoring** (per-dim
2791
+ membership; jury reduces to single-judge Haiku on completeness;
2792
+ explicit and visible in config).
2793
+ - **Replace gpt-4o-mini with GPT-4o on completeness** (per-dim
2794
+ judge selection; jury keeps two members but the second is a
2795
+ frontier-class model on the dimension that needs it).
2796
+
2797
+ Both are defensible v1.2 designs. The choice depends on cost
2798
+ budget — gpt-4o is ~10× the per-call cost of gpt-4o-mini. For
2799
+ agent-bench's calibration set scale (~30 items × per-row), even gpt-
2800
+ 4o is trivially cheap; for production deployment evaluating thousands
2801
+ of agent outputs, the cost trade-off matters more.
2802
+
2803
+ **4A artifact:** `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl`
2804
+ (per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs
2805
+ with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for
2806
+ the side-by-side analysis above).
2807
+
2808
+ **Updated honest framing for the writeup.** "v1.1.1 addressed one
2809
+ identified failure mode (paraphrase-instruction-loss across reasoning,
2810
+ recovered 5/19 disputed items via positional change). 4A confirmed the
2811
+ residual 14 are a distinct failure mode (capacity-limited criteria
2812
+ invention during the reasoning step) — GPT-4o handles all 5 sampled
2813
+ residuals at the same v1.1.1 prompt, so the failure is small-model-
2814
+ specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge
2815
+ membership / model selection) is the right escalation; the rubric
2816
+ itself doesn't need changes."
2817
+
2818
+ **Total session spend:** $0.0099 reported (~$0.013–0.018 actual after
2819
+ gpt-4o pricing correction).
Makefile CHANGED
@@ -1,6 +1,6 @@
1
  PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
2
 
3
- .PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
4
 
5
  install:
6
  $(PYTHON) -m pip install -e ".[dev]"
@@ -34,6 +34,21 @@ benchmark:
34
  evaluate-langchain:
35
  $(PYTHON) scripts/run_langchain_eval.py --provider openai
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  docker:
38
  docker-compose -f docker/docker-compose.yaml up --build
39
 
 
1
  PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11
2
 
3
+ .PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate
4
 
5
  install:
6
  $(PYTHON) -m pip install -e ".[dev]"
 
34
  evaluate-langchain:
35
  $(PYTHON) scripts/run_langchain_eval.py --provider openai
36
 
37
+ calibrate: ## Run full calibration pipeline (system outputs → all rows → strict κ table). Costs ~$2 in API calls.
38
+ $(PYTHON) scripts/run_calibration.py generate-outputs
39
+ @for cfg in configs/calibration/rows/*.yaml; do \
40
+ echo "==> running judges for $$cfg"; \
41
+ $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
42
+ done
43
+ $(PYTHON) scripts/run_calibration.py build-table --strict
44
+
45
+ evaluate-judges: ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1.
46
+ @for cfg in configs/calibration/rows/*.yaml; do \
47
+ echo "==> running judges for $$cfg"; \
48
+ $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \
49
+ done
50
+ $(PYTHON) scripts/run_calibration.py build-table --strict
51
+
52
  docker:
53
  docker-compose -f docker/docker-compose.yaml up --build
54
 
README.md CHANGED
@@ -15,7 +15,7 @@ app_port: 7860
15
 
16
  Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
17
 
18
- `444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
19
 
20
  ## Benchmark Results
21
 
@@ -249,7 +249,7 @@ security:
249
  - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
250
  - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
251
  - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
252
- - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
253
 
254
  <details><summary>API Reference</summary>
255
 
@@ -311,12 +311,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3
311
  ## Testing
312
 
313
  ```bash
314
- make test # 444 deterministic tests, no API keys needed
315
  make lint # ruff + mypy
316
  ```
317
 
318
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
320
  ## Design Decisions
321
 
322
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
@@ -334,4 +347,4 @@ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF
334
  | **PII redaction** | None | None | Regex + optional NER |
335
  | **Output validation** | None | None | PII leakage + URL + blocklist |
336
  | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
337
- | Tests | 97 | 205 | 288 |
 
15
 
16
  Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
17
 
18
+ `443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
19
 
20
  ## Benchmark Results
21
 
 
249
  - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
250
  - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
251
  - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
252
+ - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
253
 
254
  <details><summary>API Reference</summary>
255
 
 
311
  ## Testing
312
 
313
  ```bash
314
+ make test # 523 deterministic tests, no API keys needed
315
  make lint # ruff + mypy
316
  ```
317
 
318
  All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe.
319
 
320
+ ### Targets that cost money
321
+
322
+ These Make targets call paid LLM APIs. Run locally; they are excluded from CI.
323
+
324
+ | Target | Requires API key | Approximate cost | What it produces |
325
+ |---|---|---|---|
326
+ | `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). |
327
+ | `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` |
328
+ | `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) |
329
+ | `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report |
330
+
331
+ Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`).
332
+
333
  ## Design Decisions
334
 
335
  See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more.
 
347
  | **PII redaction** | None | None | Regex + optional NER |
348
  | **Output validation** | None | None | PII leakage + URL + blocklist |
349
  | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
350
+ | Tests | 97 | 205 | 443 |
agent_bench/core/config.py CHANGED
@@ -88,6 +88,9 @@ class MemoryConfig(BaseModel):
88
  class EvaluationConfig(BaseModel):
89
  judge_provider: str = "openai"
90
  golden_dataset: str = "agent_bench/evaluation/datasets/tech_docs_golden.json"
 
 
 
91
 
92
 
93
  _VALID_TIERS = {"heuristic", "classifier"}
 
88
  class EvaluationConfig(BaseModel):
89
  judge_provider: str = "openai"
90
  golden_dataset: str = "agent_bench/evaluation/datasets/tech_docs_golden.json"
91
+ # New in judge-layer v1: which dimensions to score with L2 LLM judges.
92
+ # citation_faithfulness is opt-in v1 (default-on v1.1).
93
+ judge_dimensions: list[str] = ["groundedness", "relevance", "completeness"]
94
 
95
 
96
  _VALID_TIERS = {"heuristic", "classifier"}
agent_bench/core/provider.py CHANGED
@@ -192,9 +192,17 @@ class MockProvider(LLMProvider):
192
 
193
 
194
  class OpenAIProvider(LLMProvider):
195
- """OpenAI API provider pinned to a dated gpt-4o-mini snapshot."""
196
 
197
- def __init__(self, config: AppConfig | None = None) -> None:
 
 
 
 
 
 
 
 
198
  try:
199
  from openai import AsyncOpenAI
200
  except ImportError as e:
@@ -205,7 +213,7 @@ class OpenAIProvider(LLMProvider):
205
  self.config = config or load_config()
206
  api_key = os.environ.get("OPENAI_API_KEY", "")
207
  self.client = AsyncOpenAI(api_key=api_key)
208
- self.model = "gpt-4o-mini-2024-07-18"
209
  model_pricing = self.config.provider.models.get(self.model)
210
  self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
211
  self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
@@ -410,9 +418,17 @@ def format_messages_anthropic(
410
 
411
 
412
  class AnthropicProvider(LLMProvider):
413
- """Anthropic Claude provider."""
414
 
415
- def __init__(self, config: AppConfig | None = None) -> None:
 
 
 
 
 
 
 
 
416
  try:
417
  from anthropic import AsyncAnthropic
418
  except ImportError as e:
@@ -425,7 +441,7 @@ class AnthropicProvider(LLMProvider):
425
  self.config = config or load_config()
426
  api_key = os.environ.get("ANTHROPIC_API_KEY", "")
427
  self.client = AsyncAnthropic(api_key=api_key)
428
- self.model = "claude-haiku-4-5-20251001"
429
  model_pricing = self.config.provider.models.get(self.model)
430
  self._input_cost = (
431
  model_pricing.input_cost_per_mtok if model_pricing else 0.80
 
192
 
193
 
194
  class OpenAIProvider(LLMProvider):
195
+ """OpenAI API provider pinned to a dated gpt-4o-mini snapshot.
196
 
197
+ The ``model`` parameter overrides the default pin (used by the
198
+ calibration runner so a row config's ``model_id`` is what actually
199
+ gets called — without an override, ``judge_id`` would be a label
200
+ that disagrees with the API request, breaking provenance).
201
+ """
202
+
203
+ def __init__(
204
+ self, config: AppConfig | None = None, *, model: str | None = None
205
+ ) -> None:
206
  try:
207
  from openai import AsyncOpenAI
208
  except ImportError as e:
 
213
  self.config = config or load_config()
214
  api_key = os.environ.get("OPENAI_API_KEY", "")
215
  self.client = AsyncOpenAI(api_key=api_key)
216
+ self.model = model or "gpt-4o-mini-2024-07-18"
217
  model_pricing = self.config.provider.models.get(self.model)
218
  self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15
219
  self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60
 
418
 
419
 
420
  class AnthropicProvider(LLMProvider):
421
+ """Anthropic Claude provider.
422
 
423
+ The ``model`` parameter overrides the default pin (used by the
424
+ calibration runner so a row config's ``model_id`` is what actually
425
+ gets called — without an override, ``judge_id`` would be a label
426
+ that disagrees with the API request, breaking provenance).
427
+ """
428
+
429
+ def __init__(
430
+ self, config: AppConfig | None = None, *, model: str | None = None
431
+ ) -> None:
432
  try:
433
  from anthropic import AsyncAnthropic
434
  except ImportError as e:
 
441
  self.config = config or load_config()
442
  api_key = os.environ.get("ANTHROPIC_API_KEY", "")
443
  self.client = AsyncAnthropic(api_key=api_key)
444
+ self.model = model or "claude-haiku-4-5-20251001"
445
  model_pricing = self.config.provider.models.get(self.model)
446
  self._input_cost = (
447
  model_pricing.input_cost_per_mtok if model_pricing else 0.80
agent_bench/evaluation/calibration/__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ """Hand-rolled inter-rater agreement metrics + calibration report generator."""
2
+
3
+ from agent_bench.evaluation.calibration.metrics import (
4
+ bootstrap_ci,
5
+ cohen_kappa,
6
+ gwets_ac2,
7
+ )
8
+
9
+ __all__ = ["bootstrap_ci", "cohen_kappa", "gwets_ac2"]
agent_bench/evaluation/calibration/metrics.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Hand-rolled Cohen's kappa, Gwet's AC2, bootstrap CI.
2
+
3
+ Hand-rolled (not sklearn) for two reasons:
4
+ 1. agent-bench's identity is "built from primitives" — adding sklearn
5
+ for one function (and transitively numpy + scipy + threadpoolctl +
6
+ joblib) contradicts that.
7
+ 2. The hand-roll demonstrates formula understanding in a way that
8
+ sklearn.metrics.cohen_kappa_score does not.
9
+
10
+ Fixture-tested against sklearn run *outside* the project venv —
11
+ see tests/evaluation/test_calibration_metrics.py and
12
+ scripts/_dev/generate_kappa_fixtures.py.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import random
18
+ from collections.abc import Callable
19
+ from typing import Literal
20
+
21
+
22
+ def cohen_kappa(
23
+ y1: list,
24
+ y2: list,
25
+ weights: Literal[None, "linear", "quadratic"] = None,
26
+ ) -> float:
27
+ """Cohen's κ = (P_o - P_e) / (1 - P_e).
28
+
29
+ Supports unweighted, linear-weighted, and quadratic-weighted variants
30
+ for ordinal scales. y1 and y2 must be parallel lists of label values
31
+ (int or str). Both must have the same length.
32
+ """
33
+ if len(y1) != len(y2):
34
+ raise ValueError(
35
+ f"y1 and y2 must have same length; got {len(y1)} vs {len(y2)}"
36
+ )
37
+ if not y1:
38
+ raise ValueError("Empty input — kappa undefined")
39
+
40
+ labels = sorted({*y1, *y2}, key=str)
41
+ k = len(labels)
42
+ label_idx = {lab: i for i, lab in enumerate(labels)}
43
+
44
+ cm = [[0] * k for _ in range(k)]
45
+ for a, b in zip(y1, y2):
46
+ cm[label_idx[a]][label_idx[b]] += 1
47
+
48
+ n = len(y1)
49
+
50
+ if weights is None:
51
+ w = [[1.0 if i == j else 0.0 for j in range(k)] for i in range(k)]
52
+ elif weights == "linear":
53
+ if k <= 1:
54
+ w = [[1.0]]
55
+ else:
56
+ w = [
57
+ [1.0 - abs(i - j) / (k - 1) for j in range(k)] for i in range(k)
58
+ ]
59
+ elif weights == "quadratic":
60
+ if k <= 1:
61
+ w = [[1.0]]
62
+ else:
63
+ w = [
64
+ [1.0 - ((i - j) / (k - 1)) ** 2 for j in range(k)] for i in range(k)
65
+ ]
66
+ else:
67
+ raise ValueError(f"Invalid weights {weights!r}")
68
+
69
+ p_o = sum(w[i][j] * cm[i][j] for i in range(k) for j in range(k)) / n
70
+
71
+ row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)]
72
+ col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)]
73
+
74
+ p_e = sum(
75
+ w[i][j] * row_marg[i] * col_marg[j] for i in range(k) for j in range(k)
76
+ )
77
+
78
+ if p_e >= 1.0:
79
+ return 1.0
80
+ return (p_o - p_e) / (1.0 - p_e)
81
+
82
+
83
+ def gwets_ac2(
84
+ y1: list,
85
+ y2: list,
86
+ weights: Literal[None] = None,
87
+ ) -> float:
88
+ """Gwet's AC1 — chance-corrected agreement using mean marginals.
89
+
90
+ AC1 = (P_o - P_e) / (1 - P_e)
91
+ where P_e = (1/(q-1)) * Σ pi_k * (1 - pi_k)
92
+ and pi_k is the mean marginal probability for category k.
93
+
94
+ Despite the function name, v1 only supports the *unweighted* (AC1)
95
+ formula. The weighted AC2 variant has multiple inconsistent definitions
96
+ in the literature (Gwet 2008 vs Gwet 2014); without a sklearn analogue
97
+ to cross-check against (sklearn ships κ but not AC1/AC2), shipping a
98
+ weighted formula without a fixture is a methodology hazard. Pass
99
+ weights=None or omit; passing 'linear' or 'quadratic' raises
100
+ NotImplementedError. Fix the formula + fixture in v1.1 (out of scope
101
+ per the design's Out-of-Scope section).
102
+ """
103
+ if weights is not None:
104
+ raise NotImplementedError(
105
+ "Weighted Gwet's AC2 is not implemented in v1. The unweighted "
106
+ "AC1 formula is correct and tested; the weighted variant has "
107
+ "literature inconsistency that needs a pinned fixture before "
108
+ "shipping. Pass weights=None or use cohen_kappa(weights=...)."
109
+ )
110
+ if len(y1) != len(y2):
111
+ raise ValueError("y1 and y2 length mismatch")
112
+ if not y1:
113
+ raise ValueError("Empty input")
114
+
115
+ labels = sorted({*y1, *y2}, key=str)
116
+ k = len(labels)
117
+ label_idx = {lab: i for i, lab in enumerate(labels)}
118
+
119
+ cm = [[0] * k for _ in range(k)]
120
+ for a, b in zip(y1, y2):
121
+ cm[label_idx[a]][label_idx[b]] += 1
122
+ n = len(y1)
123
+
124
+ p_o = sum(cm[i][i] for i in range(k)) / n # diagonal sum (unweighted)
125
+
126
+ row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)]
127
+ col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)]
128
+ pi = [(row_marg[i] + col_marg[i]) / 2 for i in range(k)]
129
+
130
+ if k <= 1:
131
+ return 1.0
132
+ # AC1 chance term: (1/(q-1)) * Σ pi_k * (1 - pi_k)
133
+ p_e_ac1 = sum(pi[i] * (1 - pi[i]) for i in range(k)) / (k - 1)
134
+
135
+ if p_e_ac1 >= 1.0:
136
+ return 1.0
137
+ return (p_o - p_e_ac1) / (1.0 - p_e_ac1)
138
+
139
+
140
+ def bootstrap_ci(
141
+ y1: list,
142
+ y2: list,
143
+ metric_fn: Callable[[list, list], float],
144
+ n_iter: int = 1000,
145
+ ci: float = 0.95,
146
+ seed: int = 42,
147
+ ) -> tuple[float, float, float]:
148
+ """Bootstrap confidence interval for an inter-rater metric.
149
+
150
+ Returns (point_estimate, ci_lo, ci_hi). Resamples with replacement
151
+ n_iter times and takes the (1-ci)/2 and (1+ci)/2 percentiles.
152
+ """
153
+ if len(y1) != len(y2):
154
+ raise ValueError("length mismatch")
155
+ n = len(y1)
156
+ rng = random.Random(seed)
157
+ point = metric_fn(y1, y2)
158
+ samples: list[float] = []
159
+ for _ in range(n_iter):
160
+ idx = [rng.randrange(n) for _ in range(n)]
161
+ s1 = [y1[i] for i in idx]
162
+ s2 = [y2[i] for i in idx]
163
+ try:
164
+ samples.append(metric_fn(s1, s2))
165
+ except (ValueError, ZeroDivisionError):
166
+ # Degenerate resample (e.g., all one label) — skip
167
+ continue
168
+ samples.sort()
169
+ if not samples:
170
+ return point, point, point
171
+ lo_idx = int(((1 - ci) / 2) * len(samples))
172
+ hi_idx = int(((1 + ci) / 2) * len(samples)) - 1
173
+ return point, samples[lo_idx], samples[hi_idx]
agent_bench/evaluation/calibration/report.py ADDED
@@ -0,0 +1,325 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """generate_kappa_table — joins predictions ⋈ labels by (item_id, dimension,
2
+ system_output_hash); computes per-row κ + bootstrap CI + abstain breakdown;
3
+ emits markdown table at docs/_generated/kappa_table.md.
4
+ """
5
+
6
+ from __future__ import annotations
7
+
8
+ import glob as _glob
9
+ import json
10
+ from collections import defaultdict
11
+ from collections.abc import Callable
12
+ from pathlib import Path
13
+
14
+ import structlog
15
+
16
+ from agent_bench.evaluation.calibration.metrics import (
17
+ bootstrap_ci,
18
+ cohen_kappa,
19
+ gwets_ac2,
20
+ )
21
+ from agent_bench.evaluation.judges.base import (
22
+ ABSTAIN_REASON_OUT_OF_RANGE,
23
+ ABSTAIN_REASON_PROVIDER_EXHAUSTED,
24
+ ABSTAIN_REASON_SCHEMA_PARSE,
25
+ )
26
+
27
+ logger = structlog.get_logger()
28
+
29
+ ABSTAIN_THRESHOLD = 0.20 # strictly greater than fires the flag
30
+
31
+ # Per-dimension headline metric. Cohen's κ degenerates under the prevalence
32
+ # imbalance produced by the v1.1 strict-snippet groundedness rubric (1×score=1,
33
+ # ~25×score=0) and by the inherent skew on relevance (29×score=2, 1×score=1):
34
+ # both Po and Pe approach 1.0, the formula collapses to ~0/0, and the rendered
35
+ # κ reads as 0.000 even when raw agreement is >95%. Gwet's AC1 (gwets_ac2 with
36
+ # weights=None per metrics.py) uses mean marginals and stays informative under
37
+ # imbalance. Completeness has a more balanced gold (23×2, 5×1, 2×Unknown) so
38
+ # Cohen's κ is the conventional choice there. The metric per dim is rendered
39
+ # explicitly in the footer so a writeup reader sees the methodology choice.
40
+ # Type annotation prevents a mypy 1.20.x INTERNAL ERROR triggered by the
41
+ # tuple-unpack of `_DIM_METRIC.get(dim, default)` further down. Without it
42
+ # mypy fails to infer the metric_fn callable signature consistently across
43
+ # the dict literal and the fallback default, and crashes with no real
44
+ # user-facing type error to fix.
45
+ _MetricFn = Callable[[list, list], float]
46
+ _DIM_METRIC: dict[str, tuple[str, _MetricFn]] = {
47
+ "groundedness": ("AC1", gwets_ac2),
48
+ "relevance": ("AC1", gwets_ac2),
49
+ "completeness": ("κ", cohen_kappa),
50
+ }
51
+
52
+ # Filename marker for jury / permute sidecar files. Any prediction file whose
53
+ # basename contains this token is per-member detail, not aggregate predictions,
54
+ # and is excluded from the κ table. Pinned here so a future extension change
55
+ # (jsonl → json) is caught at the contract site rather than at report time.
56
+ _SIDECAR_BASENAME_MARKER = "_members."
57
+
58
+
59
+ def _classify_abstain(reasoning: str) -> str:
60
+ if reasoning.startswith(ABSTAIN_REASON_PROVIDER_EXHAUSTED):
61
+ return "provider_exhausted"
62
+ if reasoning.startswith(ABSTAIN_REASON_SCHEMA_PARSE):
63
+ return "schema_parse"
64
+ if reasoning.startswith(ABSTAIN_REASON_OUT_OF_RANGE):
65
+ return "out_of_range"
66
+ return "genuine"
67
+
68
+
69
+ def generate_kappa_table(
70
+ *,
71
+ predictions_glob: str,
72
+ labels_path: str,
73
+ output_path: str,
74
+ strict: bool = False,
75
+ ) -> None:
76
+ """Aggregate predictions across rows + dimensions into one markdown table.
77
+
78
+ On hash mismatch: ALWAYS raises (both modes), with first-item expected
79
+ /actual hashes plus full mismatched-id list.
80
+ On missing prediction or label: WARN+exclude in default mode; RAISE in strict.
81
+ On undefined κ: render '—' with a footnote (both modes).
82
+ On abstain rate > 20%: render κ + footnote with cause breakdown (both modes).
83
+ """
84
+ labels: list[dict] = []
85
+ for line in Path(labels_path).read_text().splitlines():
86
+ line = line.strip()
87
+ if not line:
88
+ continue
89
+ labels.append(json.loads(line))
90
+
91
+ label_by_key: dict[tuple[str, str], dict] = {
92
+ (label_rec["item_id"], label_rec["dimension"]): label_rec
93
+ for label_rec in labels
94
+ }
95
+
96
+ pred_files = sorted(_glob.glob(predictions_glob))
97
+ if not pred_files:
98
+ raise ValueError(f"No prediction files matched: {predictions_glob}")
99
+
100
+ rows: list[dict] = []
101
+ for pf in pred_files:
102
+ # Skip sidecars (per-member detail, not aggregate predictions).
103
+ # Match the basename marker, not a specific extension, so a future
104
+ # jsonl → json migration of jury._DEFAULT_SIDECAR_TEMPLATE doesn't
105
+ # silently start contaminating the κ table.
106
+ if _SIDECAR_BASENAME_MARKER in Path(pf).name:
107
+ continue
108
+ row_label = (
109
+ Path(pf).stem.replace("calibration_v1_judge_", "")
110
+ )
111
+ preds = json.loads(Path(pf).read_text())
112
+
113
+ # Hash-mismatch detection (always raises)
114
+ mismatches: list[tuple[str, str, str]] = []
115
+ for p in preds:
116
+ key = (p["item_id"], p["dimension"])
117
+ if key in label_by_key:
118
+ expected = label_by_key[key]["system_output_hash"]
119
+ actual = p["system_output_hash"]
120
+ if expected != actual:
121
+ mismatches.append((p["item_id"], expected, actual))
122
+ if mismatches:
123
+ first_id, first_exp, first_act = mismatches[0]
124
+ raise ValueError(
125
+ f"Hash mismatch in {pf}: item {first_id!r} "
126
+ f"label.system_output_hash={first_exp!r} but "
127
+ f"prediction.system_output_hash={first_act!r}. "
128
+ f"Full mismatched-id list ({len(mismatches)}): "
129
+ f"{[m[0] for m in mismatches]}. "
130
+ f"Labels are stale relative to predictions — regenerate one or "
131
+ f"the other so hashes align."
132
+ )
133
+
134
+ preds_by_dim: dict[str, list[dict]] = defaultdict(list)
135
+ for p in preds:
136
+ preds_by_dim[p["dimension"]].append(p)
137
+
138
+ labels_by_dim: dict[str, list[dict]] = defaultdict(list)
139
+ for label_rec in labels:
140
+ labels_by_dim[label_rec["dimension"]].append(label_rec)
141
+
142
+ for dim in sorted(preds_by_dim.keys()):
143
+ # Resolve dimension's headline metric once per dim, instead of
144
+ # tuple-unpacking _DIM_METRIC.get(...) at each use site below.
145
+ # The repeated unpack pattern triggered a mypy 1.19+ INTERNAL
146
+ # ERROR; one resolution call here is also less code.
147
+ metric_name, metric_fn = _DIM_METRIC.get(
148
+ dim, ("κ", cohen_kappa)
149
+ )
150
+
151
+ preds_d = {p["item_id"]: p for p in preds_by_dim[dim]}
152
+ labs_d = {
153
+ label_rec["item_id"]: label_rec
154
+ for label_rec in labels_by_dim.get(dim, [])
155
+ }
156
+
157
+ common = sorted(set(preds_d) & set(labs_d))
158
+ missing_pred = sorted(set(labs_d) - set(preds_d))
159
+ missing_lab = sorted(set(preds_d) - set(labs_d))
160
+ if missing_pred or missing_lab:
161
+ msg = (
162
+ f"row={row_label} dim={dim} "
163
+ f"missing_predictions={missing_pred} "
164
+ f"missing_labels={missing_lab}"
165
+ )
166
+ if strict:
167
+ raise ValueError(f"strict mode: missing items: {msg}")
168
+ logger.warning("calibration_report_missing", message=msg)
169
+
170
+ y_pred: list = []
171
+ y_lab: list = []
172
+ abstains = 0
173
+ abstain_causes: dict[str, int] = {
174
+ "provider_exhausted": 0,
175
+ "schema_parse": 0,
176
+ "out_of_range": 0,
177
+ "genuine": 0,
178
+ }
179
+ for iid in common:
180
+ p = preds_d[iid]
181
+ label_rec = labs_d[iid]
182
+ if p["score"] == "Unknown" or label_rec["score"] == "Unknown":
183
+ abstains += 1
184
+ if p["score"] == "Unknown":
185
+ abstain_causes[
186
+ _classify_abstain(p.get("reasoning", ""))
187
+ ] += 1
188
+ continue
189
+ y_pred.append(int(p["score"]))
190
+ y_lab.append(int(label_rec["score"]))
191
+
192
+ n_eligible = len(y_pred)
193
+ abstain_rate = abstains / max(len(common), 1)
194
+
195
+ if n_eligible < 3:
196
+ rows.append(
197
+ {
198
+ "row": row_label,
199
+ "dim": dim,
200
+ "metric": metric_name,
201
+ "kappa": None,
202
+ "ci_lo": None,
203
+ "ci_hi": None,
204
+ "n_eligible": n_eligible,
205
+ "abstains": abstains,
206
+ "abstain_rate": abstain_rate,
207
+ "abstain_causes": abstain_causes,
208
+ "footnote": (
209
+ f"{metric_name} undefined: insufficient "
210
+ f"agreement-eligible items (N={n_eligible})"
211
+ ),
212
+ }
213
+ )
214
+ continue
215
+
216
+ try:
217
+ kappa = metric_fn(y_lab, y_pred)
218
+ point, lo, hi = bootstrap_ci(
219
+ y_lab, y_pred, metric_fn, n_iter=1000, seed=42
220
+ )
221
+ except (ValueError, ZeroDivisionError):
222
+ rows.append(
223
+ {
224
+ "row": row_label,
225
+ "dim": dim,
226
+ "metric": metric_name,
227
+ "kappa": None,
228
+ "ci_lo": None,
229
+ "ci_hi": None,
230
+ "n_eligible": n_eligible,
231
+ "abstains": abstains,
232
+ "abstain_rate": abstain_rate,
233
+ "abstain_causes": abstain_causes,
234
+ "footnote": (
235
+ f"{metric_name} undefined: insufficient "
236
+ f"variance after exclusion"
237
+ ),
238
+ }
239
+ )
240
+ continue
241
+
242
+ # Detect degenerate κ (perfectly constant labels → P_e=1 → kappa
243
+ # was clamped to 1.0 in metrics.py, but with no observed
244
+ # disagreement the result is statistically meaningless)
245
+ if len(set(y_lab)) <= 1 and len(set(y_pred)) <= 1:
246
+ rows.append(
247
+ {
248
+ "row": row_label,
249
+ "dim": dim,
250
+ "metric": metric_name,
251
+ "kappa": None,
252
+ "ci_lo": None,
253
+ "ci_hi": None,
254
+ "n_eligible": n_eligible,
255
+ "abstains": abstains,
256
+ "abstain_rate": abstain_rate,
257
+ "abstain_causes": abstain_causes,
258
+ "footnote": (
259
+ f"{metric_name} undefined: all labels and "
260
+ f"predictions in a single category (no variance "
261
+ f"to measure)"
262
+ ),
263
+ }
264
+ )
265
+ continue
266
+
267
+ footnote = ""
268
+ if abstain_rate > ABSTAIN_THRESHOLD:
269
+ breakdown = ", ".join(
270
+ f"{int(100 * v / abstains)}% {k.replace('_', ' ')}"
271
+ for k, v in abstain_causes.items()
272
+ if v > 0
273
+ )
274
+ footnote = (
275
+ f"{metric_name} computed on N={n_eligible} of "
276
+ f"{len(common)} items; high abstain rate "
277
+ f"({100 * abstain_rate:.1f}% — breakdown: {breakdown}) "
278
+ f"suggests rubric ambiguity."
279
+ )
280
+
281
+ rows.append(
282
+ {
283
+ "row": row_label,
284
+ "dim": dim,
285
+ "metric": metric_name,
286
+ "kappa": kappa,
287
+ "ci_lo": lo,
288
+ "ci_hi": hi,
289
+ "n_eligible": n_eligible,
290
+ "abstains": abstains,
291
+ "abstain_rate": abstain_rate,
292
+ "abstain_causes": abstain_causes,
293
+ "footnote": footnote,
294
+ }
295
+ )
296
+
297
+ out = ["# κ ablation table — calibration v1\n"]
298
+ out.append(
299
+ "Headline metric per dimension: " + ", ".join(
300
+ f"**{d} → {m}**" for d, (m, _) in _DIM_METRIC.items()
301
+ ) + ". "
302
+ "AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold "
303
+ "is prevalence-skewed enough to make Cohen's κ degenerate "
304
+ "(groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics "
305
+ "produce ≥0.95 raw agreement on those rows but Cohen's κ collapses "
306
+ "to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its "
307
+ "gold (23×`2`/5×`1`) is balanced enough for κ to behave normally."
308
+ )
309
+ out.append("")
310
+ out.append("| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |")
311
+ out.append("|---|---|---|---|---|---|---|")
312
+ for r in rows:
313
+ if r["kappa"] is None:
314
+ kcell = " — "
315
+ else:
316
+ kcell = f"{r['kappa']:.3f} ({r['ci_lo']:.3f}, {r['ci_hi']:.3f})"
317
+ rate = f"{100 * r['abstain_rate']:.1f}%"
318
+ out.append(
319
+ f"| {r['row']} | {r['dim']} | {r['metric']} | {kcell} | "
320
+ f"{r['n_eligible']} | {rate} | {r['footnote']} |"
321
+ )
322
+
323
+ Path(output_path).parent.mkdir(parents=True, exist_ok=True)
324
+ Path(output_path).write_text("\n".join(out) + "\n")
325
+ logger.info("kappa_table_written", path=output_path, rows=len(rows))
agent_bench/evaluation/datasets/calibration_v1.json ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "v1",
3
+ "system_config_git_sha": "3a2ed359eb16437cf95987b1fca47281a37fb74c",
4
+ "sample_seed": 20260504,
5
+ "notes": "30-item stratified calibration set per the design doc. Spare slots filled from K8s simple_w_condition and multi_hop (typically highest-variance R@5 strata).",
6
+ "items": [
7
+ {
8
+ "id": "q021",
9
+ "corpus": "fastapi",
10
+ "stratum": "calculation"
11
+ },
12
+ {
13
+ "id": "q010",
14
+ "corpus": "fastapi",
15
+ "stratum": "out_of_scope"
16
+ },
17
+ {
18
+ "id": "q027",
19
+ "corpus": "fastapi",
20
+ "stratum": "out_of_scope"
21
+ },
22
+ {
23
+ "id": "q006",
24
+ "corpus": "fastapi",
25
+ "stratum": "retrieval"
26
+ },
27
+ {
28
+ "id": "q011",
29
+ "corpus": "fastapi",
30
+ "stratum": "retrieval"
31
+ },
32
+ {
33
+ "id": "q012",
34
+ "corpus": "fastapi",
35
+ "stratum": "retrieval"
36
+ },
37
+ {
38
+ "id": "q023",
39
+ "corpus": "fastapi",
40
+ "stratum": "retrieval"
41
+ },
42
+ {
43
+ "id": "q025",
44
+ "corpus": "fastapi",
45
+ "stratum": "retrieval"
46
+ },
47
+ {
48
+ "id": "k8s_002",
49
+ "corpus": "k8s",
50
+ "stratum": "comparison"
51
+ },
52
+ {
53
+ "id": "k8s_014",
54
+ "corpus": "k8s",
55
+ "stratum": "comparison"
56
+ },
57
+ {
58
+ "id": "k8s_016",
59
+ "corpus": "k8s",
60
+ "stratum": "comparison"
61
+ },
62
+ {
63
+ "id": "k8s_004",
64
+ "corpus": "k8s",
65
+ "stratum": "false_premise"
66
+ },
67
+ {
68
+ "id": "k8s_022",
69
+ "corpus": "k8s",
70
+ "stratum": "false_premise"
71
+ },
72
+ {
73
+ "id": "k8s_024",
74
+ "corpus": "k8s",
75
+ "stratum": "false_premise"
76
+ },
77
+ {
78
+ "id": "k8s_003",
79
+ "corpus": "k8s",
80
+ "stratum": "multi_hop"
81
+ },
82
+ {
83
+ "id": "k8s_017",
84
+ "corpus": "k8s",
85
+ "stratum": "multi_hop"
86
+ },
87
+ {
88
+ "id": "k8s_018",
89
+ "corpus": "k8s",
90
+ "stratum": "multi_hop"
91
+ },
92
+ {
93
+ "id": "k8s_019",
94
+ "corpus": "k8s",
95
+ "stratum": "multi_hop"
96
+ },
97
+ {
98
+ "id": "k8s_025",
99
+ "corpus": "k8s",
100
+ "stratum": "set"
101
+ },
102
+ {
103
+ "id": "k8s_001",
104
+ "corpus": "k8s",
105
+ "stratum": "simple"
106
+ },
107
+ {
108
+ "id": "k8s_006",
109
+ "corpus": "k8s",
110
+ "stratum": "simple"
111
+ },
112
+ {
113
+ "id": "k8s_007",
114
+ "corpus": "k8s",
115
+ "stratum": "simple"
116
+ },
117
+ {
118
+ "id": "k8s_009",
119
+ "corpus": "k8s",
120
+ "stratum": "simple"
121
+ },
122
+ {
123
+ "id": "k8s_005",
124
+ "corpus": "k8s",
125
+ "stratum": "simple_w_condition"
126
+ },
127
+ {
128
+ "id": "k8s_012",
129
+ "corpus": "k8s",
130
+ "stratum": "simple_w_condition"
131
+ },
132
+ {
133
+ "id": "k8s_013",
134
+ "corpus": "k8s",
135
+ "stratum": "simple_w_condition"
136
+ },
137
+ {
138
+ "id": "k8s_015",
139
+ "corpus": "k8s",
140
+ "stratum": "spare_comparison"
141
+ },
142
+ {
143
+ "id": "k8s_023",
144
+ "corpus": "k8s",
145
+ "stratum": "spare_false_premise"
146
+ },
147
+ {
148
+ "id": "k8s_020",
149
+ "corpus": "k8s",
150
+ "stratum": "spare_multi_hop"
151
+ },
152
+ {
153
+ "id": "k8s_011",
154
+ "corpus": "k8s",
155
+ "stratum": "spare_simple_w_condition"
156
+ }
157
+ ]
158
+ }
agent_bench/evaluation/datasets/tech_docs_golden.json CHANGED
@@ -2,8 +2,15 @@
2
  {
3
  "id": "q001",
4
  "question": "How do you define a path parameter in FastAPI?",
5
- "expected_answer_keywords": ["curly braces", "path", "function parameter", "URL"],
6
- "expected_sources": ["fastapi_path_params.md"],
 
 
 
 
 
 
 
7
  "category": "retrieval",
8
  "difficulty": "easy",
9
  "requires_calculator": false,
@@ -12,8 +19,15 @@
12
  {
13
  "id": "q002",
14
  "question": "What is the default page size for pagination in FastAPI and what is the maximum allowed?",
15
- "expected_answer_keywords": ["20", "100", "default", "maximum"],
16
- "expected_sources": ["fastapi_pagination.md"],
 
 
 
 
 
 
 
17
  "category": "retrieval",
18
  "difficulty": "easy",
19
  "requires_calculator": false,
@@ -22,8 +36,15 @@
22
  {
23
  "id": "q003",
24
  "question": "How does FastAPI handle CORS and what is the default max_age for preflight caching?",
25
- "expected_answer_keywords": ["CORSMiddleware", "600", "seconds", "preflight"],
26
- "expected_sources": ["fastapi_middleware.md"],
 
 
 
 
 
 
 
27
  "category": "retrieval",
28
  "difficulty": "easy",
29
  "requires_calculator": false,
@@ -32,8 +53,14 @@
32
  {
33
  "id": "q004",
34
  "question": "What algorithm and expiry time does the FastAPI security example use for JWT tokens?",
35
- "expected_answer_keywords": ["HS256", "30", "minutes"],
36
- "expected_sources": ["fastapi_security.md"],
 
 
 
 
 
 
37
  "category": "retrieval",
38
  "difficulty": "medium",
39
  "requires_calculator": false,
@@ -42,8 +69,15 @@
42
  {
43
  "id": "q005",
44
  "question": "What is the recommended formula for calculating the number of Gunicorn workers for a FastAPI deployment?",
45
- "expected_answer_keywords": ["2", "CPU", "cores", "1"],
46
- "expected_sources": ["fastapi_deployment.md"],
 
 
 
 
 
 
 
47
  "category": "retrieval",
48
  "difficulty": "medium",
49
  "requires_calculator": false,
@@ -52,18 +86,35 @@
52
  {
53
  "id": "q006",
54
  "question": "How does dependency caching work in FastAPI, and how can you disable it?",
55
- "expected_answer_keywords": ["cache", "once", "use_cache", "False"],
56
- "expected_sources": ["fastapi_dependencies.md"],
 
 
 
 
 
 
 
57
  "category": "retrieval",
58
  "difficulty": "medium",
59
  "requires_calculator": false,
60
- "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call."
 
 
 
61
  },
62
  {
63
  "id": "q007",
64
  "question": "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?",
65
- "expected_answer_keywords": ["500", "334", "ceil", "pages"],
66
- "expected_sources": ["fastapi_pagination.md"],
 
 
 
 
 
 
 
67
  "category": "calculation",
68
  "difficulty": "medium",
69
  "requires_calculator": true,
@@ -72,7 +123,11 @@
72
  {
73
  "id": "q008",
74
  "question": "Does FastAPI support automatic Kubernetes deployment?",
75
- "expected_answer_keywords": ["not", "does not contain", "no information"],
 
 
 
 
76
  "expected_sources": [],
77
  "category": "out_of_scope",
78
  "difficulty": "easy",
@@ -82,7 +137,11 @@
82
  {
83
  "id": "q009",
84
  "question": "How does FastAPI integrate with Apache Kafka for event streaming?",
85
- "expected_answer_keywords": ["not", "does not contain", "no information"],
 
 
 
 
86
  "expected_sources": [],
87
  "category": "out_of_scope",
88
  "difficulty": "easy",
@@ -92,38 +151,69 @@
92
  {
93
  "id": "q010",
94
  "question": "Can FastAPI generate GraphQL schemas natively?",
95
- "expected_answer_keywords": ["not", "does not contain", "no information"],
 
 
 
 
96
  "expected_sources": [],
97
  "category": "out_of_scope",
98
  "difficulty": "easy",
99
  "requires_calculator": false,
100
- "reference_answer": ""
 
101
  },
102
  {
103
  "id": "q011",
104
  "question": "What is the default Swagger UI endpoint in FastAPI?",
105
- "expected_answer_keywords": ["/docs", "Swagger", "interactive"],
106
- "expected_sources": ["fastapi_openapi.md"],
 
 
 
 
 
 
107
  "category": "retrieval",
108
  "difficulty": "easy",
109
  "requires_calculator": false,
110
- "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface."
 
 
 
 
111
  },
112
  {
113
  "id": "q012",
114
  "question": "How do you raise an HTTP error in a FastAPI route handler?",
115
- "expected_answer_keywords": ["HTTPException", "status_code", "detail"],
116
- "expected_sources": ["fastapi_error_handling.md"],
 
 
 
 
 
 
117
  "category": "retrieval",
118
  "difficulty": "easy",
119
  "requires_calculator": false,
120
- "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error."
 
 
 
 
121
  },
122
  {
123
  "id": "q013",
124
  "question": "How do you define a request body in FastAPI?",
125
- "expected_answer_keywords": ["Pydantic", "BaseModel", "JSON"],
126
- "expected_sources": ["fastapi_request_body.md"],
 
 
 
 
 
 
127
  "category": "retrieval",
128
  "difficulty": "easy",
129
  "requires_calculator": false,
@@ -132,8 +222,14 @@
132
  {
133
  "id": "q014",
134
  "question": "What testing tools does FastAPI use, and what class provides the test client?",
135
- "expected_answer_keywords": ["TestClient", "pytest", "Starlette"],
136
- "expected_sources": ["fastapi_testing.md"],
 
 
 
 
 
 
137
  "category": "retrieval",
138
  "difficulty": "easy",
139
  "requires_calculator": false,
@@ -142,8 +238,15 @@
142
  {
143
  "id": "q015",
144
  "question": "How does FastAPI manage application configuration and environment variables?",
145
- "expected_answer_keywords": ["BaseSettings", "pydantic", "env", "environment"],
146
- "expected_sources": ["fastapi_configuration.md"],
 
 
 
 
 
 
 
147
  "category": "retrieval",
148
  "difficulty": "medium",
149
  "requires_calculator": false,
@@ -152,8 +255,15 @@
152
  {
153
  "id": "q016",
154
  "question": "What is the minimum response size for GZip compression middleware in FastAPI, and how do you enable it?",
155
- "expected_answer_keywords": ["500", "bytes", "GZipMiddleware", "minimum_size"],
156
- "expected_sources": ["fastapi_middleware.md"],
 
 
 
 
 
 
 
157
  "category": "retrieval",
158
  "difficulty": "medium",
159
  "requires_calculator": false,
@@ -162,8 +272,15 @@
162
  {
163
  "id": "q017",
164
  "question": "How do yield dependencies work in FastAPI and what is the maximum number supported per request?",
165
- "expected_answer_keywords": ["yield", "cleanup", "finally", "32"],
166
- "expected_sources": ["fastapi_dependencies.md"],
 
 
 
 
 
 
 
167
  "category": "retrieval",
168
  "difficulty": "medium",
169
  "requires_calculator": false,
@@ -172,8 +289,15 @@
172
  {
173
  "id": "q018",
174
  "question": "What are the three documentation endpoints FastAPI exposes by default and what OpenAPI version does it use?",
175
- "expected_answer_keywords": ["/docs", "/redoc", "/openapi.json", "3.1"],
176
- "expected_sources": ["fastapi_openapi.md"],
 
 
 
 
 
 
 
177
  "category": "retrieval",
178
  "difficulty": "medium",
179
  "requires_calculator": false,
@@ -182,8 +306,15 @@
182
  {
183
  "id": "q019",
184
  "question": "How does FastAPI handle WebSocket connections, and what must be called before sending data?",
185
- "expected_answer_keywords": ["accept", "WebSocket", "send", "receive"],
186
- "expected_sources": ["fastapi_websockets.md"],
 
 
 
 
 
 
 
187
  "category": "retrieval",
188
  "difficulty": "medium",
189
  "requires_calculator": false,
@@ -192,8 +323,16 @@
192
  {
193
  "id": "q020",
194
  "question": "For a server with 4 CPU cores, how many Gunicorn workers should be configured using the recommended formula?",
195
- "expected_answer_keywords": ["9", "workers", "2", "CPU", "1"],
196
- "expected_sources": ["fastapi_deployment.md"],
 
 
 
 
 
 
 
 
197
  "category": "calculation",
198
  "difficulty": "medium",
199
  "requires_calculator": true,
@@ -202,18 +341,35 @@
202
  {
203
  "id": "q021",
204
  "question": "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?",
205
- "expected_answer_keywords": ["10", "minutes"],
206
- "expected_sources": ["fastapi_middleware.md"],
 
 
 
 
 
207
  "category": "calculation",
208
  "difficulty": "easy",
209
  "requires_calculator": true,
210
- "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10)."
 
 
 
211
  },
212
  {
213
  "id": "q022",
214
  "question": "How do route ordering and dependency injection interact when building a secure FastAPI application with scoped endpoints?",
215
- "expected_answer_keywords": ["order", "Depends", "Security", "scopes"],
216
- "expected_sources": ["fastapi_path_params.md", "fastapi_dependencies.md", "fastapi_security.md"],
 
 
 
 
 
 
 
 
 
217
  "category": "retrieval",
218
  "difficulty": "hard",
219
  "requires_calculator": false,
@@ -222,18 +378,40 @@
222
  {
223
  "id": "q023",
224
  "question": "How would you set up a FastAPI application with custom error handling, CORS middleware, and structured testing including dependency overrides?",
225
- "expected_answer_keywords": ["HTTPException", "CORSMiddleware", "TestClient", "override"],
226
- "expected_sources": ["fastapi_error_handling.md", "fastapi_middleware.md", "fastapi_testing.md"],
 
 
 
 
 
 
 
 
 
227
  "category": "retrieval",
228
  "difficulty": "hard",
229
  "requires_calculator": false,
230
- "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests."
 
 
 
 
231
  },
232
  {
233
  "id": "q024",
234
  "question": "Explain how to deploy a FastAPI app with Docker using Gunicorn workers, health checks, and environment-based configuration via Pydantic Settings.",
235
- "expected_answer_keywords": ["Docker", "Gunicorn", "health", "BaseSettings", "env"],
236
- "expected_sources": ["fastapi_deployment.md", "fastapi_configuration.md"],
 
 
 
 
 
 
 
 
 
237
  "category": "retrieval",
238
  "difficulty": "hard",
239
  "requires_calculator": false,
@@ -242,17 +420,32 @@
242
  {
243
  "id": "q025",
244
  "question": "How would you build a paginated API with cursor-based navigation, response model validation, and background task processing for analytics logging?",
245
- "expected_answer_keywords": ["cursor", "response_model", "BackgroundTasks"],
246
- "expected_sources": ["fastapi_pagination.md", "fastapi_response_model.md", "fastapi_background_tasks.md"],
 
 
 
 
 
 
 
 
247
  "category": "retrieval",
248
  "difficulty": "hard",
249
  "requires_calculator": false,
250
- "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency."
 
 
 
251
  },
252
  {
253
  "id": "q026",
254
  "question": "Does FastAPI have built-in support for database migrations like Alembic?",
255
- "expected_answer_keywords": ["not", "does not contain", "no information"],
 
 
 
 
256
  "expected_sources": [],
257
  "category": "out_of_scope",
258
  "difficulty": "easy",
@@ -262,11 +455,16 @@
262
  {
263
  "id": "q027",
264
  "question": "How does FastAPI handle automatic load balancing across multiple servers?",
265
- "expected_answer_keywords": ["not", "does not contain", "no information"],
 
 
 
 
266
  "expected_sources": [],
267
  "category": "out_of_scope",
268
  "difficulty": "easy",
269
  "requires_calculator": false,
270
- "reference_answer": ""
 
271
  }
272
  ]
 
2
  {
3
  "id": "q001",
4
  "question": "How do you define a path parameter in FastAPI?",
5
+ "expected_answer_keywords": [
6
+ "curly braces",
7
+ "path",
8
+ "function parameter",
9
+ "URL"
10
+ ],
11
+ "expected_sources": [
12
+ "fastapi_path_params.md"
13
+ ],
14
  "category": "retrieval",
15
  "difficulty": "easy",
16
  "requires_calculator": false,
 
19
  {
20
  "id": "q002",
21
  "question": "What is the default page size for pagination in FastAPI and what is the maximum allowed?",
22
+ "expected_answer_keywords": [
23
+ "20",
24
+ "100",
25
+ "default",
26
+ "maximum"
27
+ ],
28
+ "expected_sources": [
29
+ "fastapi_pagination.md"
30
+ ],
31
  "category": "retrieval",
32
  "difficulty": "easy",
33
  "requires_calculator": false,
 
36
  {
37
  "id": "q003",
38
  "question": "How does FastAPI handle CORS and what is the default max_age for preflight caching?",
39
+ "expected_answer_keywords": [
40
+ "CORSMiddleware",
41
+ "600",
42
+ "seconds",
43
+ "preflight"
44
+ ],
45
+ "expected_sources": [
46
+ "fastapi_middleware.md"
47
+ ],
48
  "category": "retrieval",
49
  "difficulty": "easy",
50
  "requires_calculator": false,
 
53
  {
54
  "id": "q004",
55
  "question": "What algorithm and expiry time does the FastAPI security example use for JWT tokens?",
56
+ "expected_answer_keywords": [
57
+ "HS256",
58
+ "30",
59
+ "minutes"
60
+ ],
61
+ "expected_sources": [
62
+ "fastapi_security.md"
63
+ ],
64
  "category": "retrieval",
65
  "difficulty": "medium",
66
  "requires_calculator": false,
 
69
  {
70
  "id": "q005",
71
  "question": "What is the recommended formula for calculating the number of Gunicorn workers for a FastAPI deployment?",
72
+ "expected_answer_keywords": [
73
+ "2",
74
+ "CPU",
75
+ "cores",
76
+ "1"
77
+ ],
78
+ "expected_sources": [
79
+ "fastapi_deployment.md"
80
+ ],
81
  "category": "retrieval",
82
  "difficulty": "medium",
83
  "requires_calculator": false,
 
86
  {
87
  "id": "q006",
88
  "question": "How does dependency caching work in FastAPI, and how can you disable it?",
89
+ "expected_answer_keywords": [
90
+ "cache",
91
+ "once",
92
+ "use_cache",
93
+ "False"
94
+ ],
95
+ "expected_sources": [
96
+ "fastapi_dependencies.md"
97
+ ],
98
  "category": "retrieval",
99
  "difficulty": "medium",
100
  "requires_calculator": false,
101
+ "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call.",
102
+ "source_snippets": [
103
+ "By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once. To disable caching and force a fresh call each time, use `Depends(get_db, use_cache=False)`."
104
+ ]
105
  },
106
  {
107
  "id": "q007",
108
  "question": "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?",
109
+ "expected_answer_keywords": [
110
+ "500",
111
+ "334",
112
+ "ceil",
113
+ "pages"
114
+ ],
115
+ "expected_sources": [
116
+ "fastapi_pagination.md"
117
+ ],
118
  "category": "calculation",
119
  "difficulty": "medium",
120
  "requires_calculator": true,
 
123
  {
124
  "id": "q008",
125
  "question": "Does FastAPI support automatic Kubernetes deployment?",
126
+ "expected_answer_keywords": [
127
+ "not",
128
+ "does not contain",
129
+ "no information"
130
+ ],
131
  "expected_sources": [],
132
  "category": "out_of_scope",
133
  "difficulty": "easy",
 
137
  {
138
  "id": "q009",
139
  "question": "How does FastAPI integrate with Apache Kafka for event streaming?",
140
+ "expected_answer_keywords": [
141
+ "not",
142
+ "does not contain",
143
+ "no information"
144
+ ],
145
  "expected_sources": [],
146
  "category": "out_of_scope",
147
  "difficulty": "easy",
 
151
  {
152
  "id": "q010",
153
  "question": "Can FastAPI generate GraphQL schemas natively?",
154
+ "expected_answer_keywords": [
155
+ "not",
156
+ "does not contain",
157
+ "no information"
158
+ ],
159
  "expected_sources": [],
160
  "category": "out_of_scope",
161
  "difficulty": "easy",
162
  "requires_calculator": false,
163
+ "reference_answer": "",
164
+ "source_snippets": []
165
  },
166
  {
167
  "id": "q011",
168
  "question": "What is the default Swagger UI endpoint in FastAPI?",
169
+ "expected_answer_keywords": [
170
+ "/docs",
171
+ "Swagger",
172
+ "interactive"
173
+ ],
174
+ "expected_sources": [
175
+ "fastapi_openapi.md"
176
+ ],
177
  "category": "retrieval",
178
  "difficulty": "easy",
179
  "requires_calculator": false,
180
+ "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface.",
181
+ "source_snippets": [
182
+ "| `/docs` | Swagger UI -- interactive API explorer |",
183
+ "Every FastAPI application exposes three documentation-related endpoints by default:"
184
+ ]
185
  },
186
  {
187
  "id": "q012",
188
  "question": "How do you raise an HTTP error in a FastAPI route handler?",
189
+ "expected_answer_keywords": [
190
+ "HTTPException",
191
+ "status_code",
192
+ "detail"
193
+ ],
194
+ "expected_sources": [
195
+ "fastapi_error_handling.md"
196
+ ],
197
  "category": "retrieval",
198
  "difficulty": "easy",
199
  "requires_calculator": false,
200
+ "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error.",
201
+ "source_snippets": [
202
+ "The `HTTPException` class is the primary way to return error responses from route handlers:",
203
+ "When raised, `HTTPException` immediately terminates request processing and returns the specified status code and detail message. The `detail` parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically."
204
+ ]
205
  },
206
  {
207
  "id": "q013",
208
  "question": "How do you define a request body in FastAPI?",
209
+ "expected_answer_keywords": [
210
+ "Pydantic",
211
+ "BaseModel",
212
+ "JSON"
213
+ ],
214
+ "expected_sources": [
215
+ "fastapi_request_body.md"
216
+ ],
217
  "category": "retrieval",
218
  "difficulty": "easy",
219
  "requires_calculator": false,
 
222
  {
223
  "id": "q014",
224
  "question": "What testing tools does FastAPI use, and what class provides the test client?",
225
+ "expected_answer_keywords": [
226
+ "TestClient",
227
+ "pytest",
228
+ "Starlette"
229
+ ],
230
+ "expected_sources": [
231
+ "fastapi_testing.md"
232
+ ],
233
  "category": "retrieval",
234
  "difficulty": "easy",
235
  "requires_calculator": false,
 
238
  {
239
  "id": "q015",
240
  "question": "How does FastAPI manage application configuration and environment variables?",
241
+ "expected_answer_keywords": [
242
+ "BaseSettings",
243
+ "pydantic",
244
+ "env",
245
+ "environment"
246
+ ],
247
+ "expected_sources": [
248
+ "fastapi_configuration.md"
249
+ ],
250
  "category": "retrieval",
251
  "difficulty": "medium",
252
  "requires_calculator": false,
 
255
  {
256
  "id": "q016",
257
  "question": "What is the minimum response size for GZip compression middleware in FastAPI, and how do you enable it?",
258
+ "expected_answer_keywords": [
259
+ "500",
260
+ "bytes",
261
+ "GZipMiddleware",
262
+ "minimum_size"
263
+ ],
264
+ "expected_sources": [
265
+ "fastapi_middleware.md"
266
+ ],
267
  "category": "retrieval",
268
  "difficulty": "medium",
269
  "requires_calculator": false,
 
272
  {
273
  "id": "q017",
274
  "question": "How do yield dependencies work in FastAPI and what is the maximum number supported per request?",
275
+ "expected_answer_keywords": [
276
+ "yield",
277
+ "cleanup",
278
+ "finally",
279
+ "32"
280
+ ],
281
+ "expected_sources": [
282
+ "fastapi_dependencies.md"
283
+ ],
284
  "category": "retrieval",
285
  "difficulty": "medium",
286
  "requires_calculator": false,
 
289
  {
290
  "id": "q018",
291
  "question": "What are the three documentation endpoints FastAPI exposes by default and what OpenAPI version does it use?",
292
+ "expected_answer_keywords": [
293
+ "/docs",
294
+ "/redoc",
295
+ "/openapi.json",
296
+ "3.1"
297
+ ],
298
+ "expected_sources": [
299
+ "fastapi_openapi.md"
300
+ ],
301
  "category": "retrieval",
302
  "difficulty": "medium",
303
  "requires_calculator": false,
 
306
  {
307
  "id": "q019",
308
  "question": "How does FastAPI handle WebSocket connections, and what must be called before sending data?",
309
+ "expected_answer_keywords": [
310
+ "accept",
311
+ "WebSocket",
312
+ "send",
313
+ "receive"
314
+ ],
315
+ "expected_sources": [
316
+ "fastapi_websockets.md"
317
+ ],
318
  "category": "retrieval",
319
  "difficulty": "medium",
320
  "requires_calculator": false,
 
323
  {
324
  "id": "q020",
325
  "question": "For a server with 4 CPU cores, how many Gunicorn workers should be configured using the recommended formula?",
326
+ "expected_answer_keywords": [
327
+ "9",
328
+ "workers",
329
+ "2",
330
+ "CPU",
331
+ "1"
332
+ ],
333
+ "expected_sources": [
334
+ "fastapi_deployment.md"
335
+ ],
336
  "category": "calculation",
337
  "difficulty": "medium",
338
  "requires_calculator": true,
 
341
  {
342
  "id": "q021",
343
  "question": "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?",
344
+ "expected_answer_keywords": [
345
+ "10",
346
+ "minutes"
347
+ ],
348
+ "expected_sources": [
349
+ "fastapi_middleware.md"
350
+ ],
351
  "category": "calculation",
352
  "difficulty": "easy",
353
  "requires_calculator": true,
354
+ "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10).",
355
+ "source_snippets": [
356
+ "| `max_age` | `600` | Seconds the browser caches preflight results |"
357
+ ]
358
  },
359
  {
360
  "id": "q022",
361
  "question": "How do route ordering and dependency injection interact when building a secure FastAPI application with scoped endpoints?",
362
+ "expected_answer_keywords": [
363
+ "order",
364
+ "Depends",
365
+ "Security",
366
+ "scopes"
367
+ ],
368
+ "expected_sources": [
369
+ "fastapi_path_params.md",
370
+ "fastapi_dependencies.md",
371
+ "fastapi_security.md"
372
+ ],
373
  "category": "retrieval",
374
  "difficulty": "hard",
375
  "requires_calculator": false,
 
378
  {
379
  "id": "q023",
380
  "question": "How would you set up a FastAPI application with custom error handling, CORS middleware, and structured testing including dependency overrides?",
381
+ "expected_answer_keywords": [
382
+ "HTTPException",
383
+ "CORSMiddleware",
384
+ "TestClient",
385
+ "override"
386
+ ],
387
+ "expected_sources": [
388
+ "fastapi_error_handling.md",
389
+ "fastapi_middleware.md",
390
+ "fastapi_testing.md"
391
+ ],
392
  "category": "retrieval",
393
  "difficulty": "hard",
394
  "requires_calculator": false,
395
+ "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests.",
396
+ "source_snippets": [
397
+ "The `HTTPException` class is the primary way to return error responses from route handlers:",
398
+ "Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette:"
399
+ ]
400
  },
401
  {
402
  "id": "q024",
403
  "question": "Explain how to deploy a FastAPI app with Docker using Gunicorn workers, health checks, and environment-based configuration via Pydantic Settings.",
404
+ "expected_answer_keywords": [
405
+ "Docker",
406
+ "Gunicorn",
407
+ "health",
408
+ "BaseSettings",
409
+ "env"
410
+ ],
411
+ "expected_sources": [
412
+ "fastapi_deployment.md",
413
+ "fastapi_configuration.md"
414
+ ],
415
  "category": "retrieval",
416
  "difficulty": "hard",
417
  "requires_calculator": false,
 
420
  {
421
  "id": "q025",
422
  "question": "How would you build a paginated API with cursor-based navigation, response model validation, and background task processing for analytics logging?",
423
+ "expected_answer_keywords": [
424
+ "cursor",
425
+ "response_model",
426
+ "BackgroundTasks"
427
+ ],
428
+ "expected_sources": [
429
+ "fastapi_pagination.md",
430
+ "fastapi_response_model.md",
431
+ "fastapi_background_tasks.md"
432
+ ],
433
  "category": "retrieval",
434
  "difficulty": "hard",
435
  "requires_calculator": false,
436
+ "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency.",
437
+ "source_snippets": [
438
+ "Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets:"
439
+ ]
440
  },
441
  {
442
  "id": "q026",
443
  "question": "Does FastAPI have built-in support for database migrations like Alembic?",
444
+ "expected_answer_keywords": [
445
+ "not",
446
+ "does not contain",
447
+ "no information"
448
+ ],
449
  "expected_sources": [],
450
  "category": "out_of_scope",
451
  "difficulty": "easy",
 
455
  {
456
  "id": "q027",
457
  "question": "How does FastAPI handle automatic load balancing across multiple servers?",
458
+ "expected_answer_keywords": [
459
+ "not",
460
+ "does not contain",
461
+ "no information"
462
+ ],
463
  "expected_sources": [],
464
  "category": "out_of_scope",
465
  "difficulty": "easy",
466
  "requires_calculator": false,
467
+ "reference_answer": "",
468
+ "source_snippets": []
469
  }
470
  ]
agent_bench/evaluation/harness.py CHANGED
@@ -8,8 +8,13 @@ from pathlib import Path
8
  from pydantic import BaseModel, Field
9
 
10
  from agent_bench.agents.orchestrator import Orchestrator
 
11
  from agent_bench.core.provider import LLMProvider
12
  from agent_bench.core.types import TokenUsage
 
 
 
 
13
  from agent_bench.evaluation.metrics import (
14
  calculator_used_when_expected,
15
  citation_accuracy,
@@ -21,6 +26,18 @@ from agent_bench.evaluation.metrics import (
21
  tool_call_count,
22
  )
23
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  class GoldenQuestion(BaseModel):
26
  id: str
@@ -70,9 +87,13 @@ class EvalResult(BaseModel):
70
  # Raw answer for reporting
71
  answer: str = ""
72
  retrieved_sources: list[str] = []
73
- # LLM judge (None if not run)
74
- faithfulness: float | None = None
75
- correctness: float | None = None
 
 
 
 
76
 
77
 
78
  def load_golden_dataset(path: str | Path) -> list[GoldenQuestion]:
@@ -149,21 +170,46 @@ async def run_evaluation(
149
  retrieved_sources=ranked_sources,
150
  )
151
 
152
- # Optional LLM judge
153
- if judge_provider is not None and q.category != "out_of_scope":
154
- from agent_bench.evaluation.metrics import answer_correctness, answer_faithfulness
155
-
156
- result.faithfulness = await answer_faithfulness(
157
- answer=agent_response.answer,
158
- source_chunks=agent_response.source_chunks,
159
- judge_provider=judge_provider,
160
- )
161
- if q.reference_answer:
162
- result.correctness = await answer_correctness(
163
- answer=agent_response.answer,
164
- reference_answer=q.reference_answer,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  judge_provider=judge_provider,
 
 
166
  )
 
 
167
 
168
  results.append(result)
169
 
 
8
  from pydantic import BaseModel, Field
9
 
10
  from agent_bench.agents.orchestrator import Orchestrator
11
+ from agent_bench.core.config import load_config
12
  from agent_bench.core.provider import LLMProvider
13
  from agent_bench.core.types import TokenUsage
14
+ from agent_bench.evaluation.judges.base import Rubric, ScoreResult
15
+ from agent_bench.evaluation.judges.completeness import CompletenessJudge
16
+ from agent_bench.evaluation.judges.groundedness import GroundednessJudge
17
+ from agent_bench.evaluation.judges.relevance import RelevanceJudge
18
  from agent_bench.evaluation.metrics import (
19
  calculator_used_when_expected,
20
  citation_accuracy,
 
26
  tool_call_count,
27
  )
28
 
29
+ # Annotated as type[Judge] would lose concrete-class info and trigger
30
+ # mypy's "cannot instantiate abstract class" on the dispatch site below.
31
+ # The dict's runtime values are concrete, instantiable subclasses; the
32
+ # explicit type alias below preserves that information.
33
+ _JUDGE_CLASS_BY_DIMENSION: dict[
34
+ str, type[GroundednessJudge] | type[RelevanceJudge] | type[CompletenessJudge]
35
+ ] = {
36
+ "groundedness": GroundednessJudge,
37
+ "relevance": RelevanceJudge,
38
+ "completeness": CompletenessJudge,
39
+ }
40
+
41
 
42
  class GoldenQuestion(BaseModel):
43
  id: str
 
87
  # Raw answer for reporting
88
  answer: str = ""
89
  retrieved_sources: list[str] = []
90
+ # New in judge-layer v1: per-dimension judge scores. Empty when no
91
+ # judge_provider is configured. With a provider, OOS items receive
92
+ # relevance only (refusal-vs-engagement is the L2 signal worth
93
+ # measuring); reference-based dimensions (groundedness, completeness)
94
+ # are skipped on OOS. Completeness is also skipped when
95
+ # reference_answer is empty regardless of category.
96
+ judge_scores: dict[str, ScoreResult] = Field(default_factory=dict)
97
 
98
 
99
  def load_golden_dataset(path: str | Path) -> list[GoldenQuestion]:
 
170
  retrieved_sources=ranked_sources,
171
  )
172
 
173
+ # Optional L2 LLM-judge layer (per-dimension; gated per-dim).
174
+ #
175
+ # OOS items get relevance scoring (a non-refusal answer to an OOS
176
+ # question is exactly what relevance is designed to detect — the
177
+ # rubric's "refusal that ignores the question" example covers this
178
+ # case). Groundedness and completeness are skipped on OOS because
179
+ # neither has a meaningful reference (no source_snippets, no
180
+ # reference_answer for OOS items).
181
+ #
182
+ # This per-dimension gating matches the calibration runner's
183
+ # behavior so the κ table's distribution of scored items lines up
184
+ # with what the production harness produces. Diverging gates would
185
+ # mean the calibration κ for relevance was estimated on items the
186
+ # production harness never sees, breaking the supersession's
187
+ # empirical backing.
188
+ if judge_provider is not None:
189
+ cfg = load_config()
190
+ rubric_dir = Path(__file__).resolve().parent / "rubrics"
191
+ is_oos = q.category == "out_of_scope"
192
+ for dim in cfg.evaluation.judge_dimensions:
193
+ if dim not in _JUDGE_CLASS_BY_DIMENSION:
194
+ continue # citation_faithfulness opt-in; not in default loop
195
+ # Per-dimension OOS gating: skip reference-based dimensions
196
+ # (groundedness, completeness) on OOS; allow relevance.
197
+ if is_oos and dim != "relevance":
198
+ continue
199
+ # CompletenessJudge is reference-based on q.reference_answer;
200
+ # scoring an empty reference is guaranteed-noisy and burns
201
+ # tokens. Pre-supersession code had the same gate (correctness
202
+ # was conditional on reference_answer being non-empty).
203
+ if dim == "completeness" and not q.reference_answer:
204
+ continue
205
+ rubric = Rubric.from_markdown_file(rubric_dir / f"{dim}.md")
206
+ judge = _JUDGE_CLASS_BY_DIMENSION[dim](
207
  judge_provider=judge_provider,
208
+ rubric=rubric,
209
+ model_id=getattr(judge_provider, "model", "unknown"),
210
  )
211
+ score_result = await judge.score(q, agent_response)
212
+ result.judge_scores[dim] = score_result
213
 
214
  results.append(result)
215
 
agent_bench/evaluation/judges/__init__.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Discrete-scale per-dimension LLM judges with anchored rubrics."""
2
+
3
+ from agent_bench.evaluation.judges.base import (
4
+ ABSTAIN_REASON_GENUINE,
5
+ ABSTAIN_REASON_OUT_OF_RANGE,
6
+ ABSTAIN_REASON_PROVIDER_EXHAUSTED,
7
+ ABSTAIN_REASON_SCHEMA_PARSE,
8
+ Judge,
9
+ MockJudge,
10
+ Rubric,
11
+ RubricLevel,
12
+ ScoreResult,
13
+ )
14
+
15
+ __all__ = [
16
+ "ABSTAIN_REASON_GENUINE",
17
+ "ABSTAIN_REASON_OUT_OF_RANGE",
18
+ "ABSTAIN_REASON_PROVIDER_EXHAUSTED",
19
+ "ABSTAIN_REASON_SCHEMA_PARSE",
20
+ "Judge",
21
+ "MockJudge",
22
+ "Rubric",
23
+ "RubricLevel",
24
+ "ScoreResult",
25
+ ]
agent_bench/evaluation/judges/base.py ADDED
@@ -0,0 +1,628 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Judge ABC, ScoreResult, Rubric, MockJudge, abstain-reason constants.
2
+
3
+ The Judge layer supersedes the continuous-scale answer_faithfulness /
4
+ answer_correctness functions in agent_bench/evaluation/metrics.py. See
5
+ docs/plans/2026-05-04-judge-layer-v1-design.md for the supersession
6
+ rationale and the six-axis comparison table.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import hashlib
12
+ import json as _json
13
+ import random
14
+ import re
15
+ import time
16
+ from abc import ABC, abstractmethod
17
+ from pathlib import Path
18
+ from typing import TYPE_CHECKING, Literal, Self
19
+
20
+ import structlog
21
+ import yaml
22
+ from pydantic import BaseModel, Field
23
+
24
+ from agent_bench.core.provider import (
25
+ ProviderRateLimitError,
26
+ ProviderTimeoutError,
27
+ )
28
+ from agent_bench.core.types import Message, Role
29
+
30
+ if TYPE_CHECKING:
31
+ from agent_bench.agents.orchestrator import AgentResponse
32
+ from agent_bench.core.provider import LLMProvider
33
+ from agent_bench.evaluation.harness import GoldenQuestion
34
+
35
+ logger = structlog.get_logger()
36
+
37
+ # --- Abstain-reason constants ---
38
+ #
39
+ # Failure-as-abstain ScoreResults carry a reasoning string with one of
40
+ # these prefixes. The calibration report pattern-matches against these
41
+ # constants for the four-way breakdown in the >20% abstain-rate flag.
42
+ # Genuine model abstain (rubric-allowed) uses the empty-string sentinel.
43
+
44
+ ABSTAIN_REASON_PROVIDER_EXHAUSTED = "judge_call_failed_after_retry: "
45
+ ABSTAIN_REASON_SCHEMA_PARSE = "schema_parse_failed_after_retry: "
46
+ ABSTAIN_REASON_OUT_OF_RANGE = "score_out_of_range_after_retry: "
47
+ ABSTAIN_REASON_GENUINE = ""
48
+
49
+
50
+ class ScoreResult(BaseModel):
51
+ """One judge call's result. Self-contained provenance — no run
52
+ metadata cross-reference needed for κ aggregation.
53
+
54
+ Field order matters: reasoning + evidence_quotes come BEFORE score
55
+ in both Pydantic field order and the JSON schema sent to the model,
56
+ so the score conditions on the reasoning rather than being
57
+ post-hoc rationalized.
58
+ """
59
+
60
+ # Reasoning-first ordering — load-bearing for the JSON schema
61
+ reasoning: str
62
+ evidence_quotes: list[str] = Field(default_factory=list)
63
+ score: int | Literal["Unknown"]
64
+
65
+ # Provenance
66
+ judge_id: str
67
+ rubric_version: str
68
+ prompt_seed: int = 0
69
+ system_output_hash: str
70
+
71
+ # Operations
72
+ cost_usd: float
73
+ latency_ms: float
74
+
75
+ @property
76
+ def abstained(self) -> bool:
77
+ return self.score == "Unknown"
78
+
79
+
80
+ _FENCE_PATTERN = re.compile(r"^```[^\n]*\n.*?^```\n?", re.MULTILINE | re.DOTALL)
81
+
82
+
83
+ def _mask_code_fences(text: str) -> str:
84
+ """Replace fenced code blocks (``` ... ```) with same-length whitespace,
85
+ preserving newlines so byte offsets align with the original. Used by
86
+ the rubric loader to skip fenced ``## Score N`` literals when scanning
87
+ for structural level headers.
88
+ """
89
+
90
+ def _replace(match: re.Match[str]) -> str:
91
+ return "".join("\n" if c == "\n" else " " for c in match.group(0))
92
+
93
+ return _FENCE_PATTERN.sub(_replace, text)
94
+
95
+
96
+ class RubricLevel(BaseModel):
97
+ """One score level in a rubric, with anchored examples.
98
+
99
+ Parsed from markdown sections under `## Score N` headers. The
100
+ `examples` list contains the H3 sub-sections (`### Example X`)
101
+ each with a thinking-trace explanation of why that output got
102
+ that score.
103
+ """
104
+
105
+ score: int
106
+ description: str
107
+ examples: list[str] # raw markdown of `### Example` sections
108
+
109
+
110
+ class Rubric(BaseModel):
111
+ """A scoring rubric loaded from a markdown file with YAML frontmatter.
112
+
113
+ Construction validates aggressively: scale ∈ {binary, three_point},
114
+ levels arity matches scale, every level has at least one anchored
115
+ example. ValidationError raises with file path + field path so a
116
+ Day-1 rubric typo doesn't surface as a Day-2 judge.score crash with
117
+ API budget already spent.
118
+ """
119
+
120
+ dimension: Literal[
121
+ "groundedness", "relevance", "completeness", "citation_faithfulness"
122
+ ]
123
+ scale: Literal["binary", "three_point"]
124
+ reference_based: bool
125
+ abstain_allowed: bool
126
+ levels: list[RubricLevel]
127
+ body_markdown: str
128
+
129
+ @property
130
+ def source_hash(self) -> str:
131
+ """SHA-256 of the canonical body. Immutable per file content,
132
+ independent of git state. Used as ScoreResult.rubric_version.
133
+ """
134
+ return hashlib.sha256(self.body_markdown.encode("utf-8")).hexdigest()
135
+
136
+ @classmethod
137
+ def from_markdown_file(cls, path: Path | str) -> Self:
138
+ path = Path(path)
139
+ body = path.read_text(encoding="utf-8")
140
+
141
+ # Parse YAML frontmatter delimited by --- ... ---
142
+ fm_match = re.match(r"^---\n(.+?)\n---\n(.*)$", body, re.DOTALL)
143
+ if not fm_match:
144
+ raise ValueError(
145
+ f"Rubric {path.name}: missing YAML frontmatter "
146
+ f"(expected --- ... --- block at top of file)"
147
+ )
148
+ try:
149
+ frontmatter = yaml.safe_load(fm_match.group(1)) or {}
150
+ except yaml.YAMLError as e:
151
+ raise ValueError(
152
+ f"Rubric {path.name}: frontmatter YAML parse error: {e}"
153
+ ) from e
154
+
155
+ required = {"dimension", "scale", "reference_based", "abstain_allowed"}
156
+ missing = required - frontmatter.keys()
157
+ if missing:
158
+ raise ValueError(
159
+ f"Rubric {path.name}: frontmatter missing fields: {sorted(missing)}"
160
+ )
161
+
162
+ scale = frontmatter["scale"]
163
+ if scale not in ("binary", "three_point"):
164
+ raise ValueError(
165
+ f"Rubric {path.name}: invalid scale {scale!r}; "
166
+ f"must be 'binary' or 'three_point'"
167
+ )
168
+
169
+ # Parse levels by ## Score N headers. Mask fenced code blocks first
170
+ # so a literal "## Score N" inside an example's code fence is not
171
+ # interpreted as a structural level header. The mask preserves byte
172
+ # offsets (replacing non-newline chars with spaces) so we can slice
173
+ # the original `body_no_fm` at the masked-text header positions to
174
+ # recover level bodies with their fenced content intact.
175
+ body_no_fm = fm_match.group(2)
176
+ masked_body = _mask_code_fences(body_no_fm)
177
+ header_pattern = re.compile(r"^## Score (\d+)\n", re.MULTILINE)
178
+ header_matches = list(header_pattern.finditer(masked_body))
179
+ raw_levels: list[tuple[int, str]] = []
180
+ for i, m in enumerate(header_matches):
181
+ start = m.end()
182
+ end = (
183
+ header_matches[i + 1].start()
184
+ if i + 1 < len(header_matches)
185
+ else len(body_no_fm)
186
+ )
187
+ raw_levels.append((int(m.group(1)), body_no_fm[start:end]))
188
+
189
+ expected_arity = 2 if scale == "binary" else 3
190
+ if len(raw_levels) != expected_arity:
191
+ raise ValueError(
192
+ f"Rubric {path.name}: arity mismatch — scale {scale!r} "
193
+ f"requires {expected_arity} levels, found {len(raw_levels)}"
194
+ )
195
+
196
+ # Parse examples (### Example) per level
197
+ levels: list[RubricLevel] = []
198
+ for score, level_body in raw_levels:
199
+ example_pattern = re.compile(
200
+ r"^### (Example .+?)\n(.*?)(?=^### |\Z)", re.MULTILINE | re.DOTALL
201
+ )
202
+ examples = [m.group(0) for m in example_pattern.finditer(level_body)]
203
+ if not examples:
204
+ raise ValueError(
205
+ f"Rubric {path.name}: level Score {score} has no "
206
+ f"anchored example (expected at least one ### Example header)"
207
+ )
208
+ description = level_body.split("###", 1)[0].strip()
209
+ levels.append(
210
+ RubricLevel(score=score, description=description, examples=examples)
211
+ )
212
+
213
+ return cls(
214
+ dimension=frontmatter["dimension"],
215
+ scale=scale,
216
+ reference_based=bool(frontmatter["reference_based"]),
217
+ abstain_allowed=bool(frontmatter["abstain_allowed"]),
218
+ levels=levels,
219
+ body_markdown=body,
220
+ )
221
+
222
+ def render_prompt(self, *, level_permutation_seed: int = 0) -> str:
223
+ """Render the rubric body for inclusion in a judge prompt.
224
+
225
+ If level_permutation_seed > 0, levels are reordered deterministically
226
+ using a seeded PRNG. seed=0 returns the canonical order.
227
+ """
228
+ if level_permutation_seed == 0:
229
+ return self.body_markdown
230
+ rng = random.Random(level_permutation_seed)
231
+ permuted_levels = list(self.levels)
232
+ rng.shuffle(permuted_levels)
233
+ # Reconstruct: keep frontmatter + intro paragraphs intact;
234
+ # reorder the ## Score N sections.
235
+ fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL)
236
+ if not fm_match:
237
+ return self.body_markdown # defensive — should never happen post-construction
238
+ head = fm_match.group(1)
239
+ rest = fm_match.group(2)
240
+ intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0]
241
+ permuted_body = head + intro + "\n".join(
242
+ f"## Score {lvl.score}\n{lvl.description}\n" + "\n".join(lvl.examples)
243
+ for lvl in permuted_levels
244
+ )
245
+ return permuted_body
246
+
247
+ def strip_anchors(self) -> Self:
248
+ """Return a new Rubric with anchored examples removed from every
249
+ level (and a regenerated body_markdown that omits the ``### Example``
250
+ sections). Used by the calibration runner's `use_anchors=false`
251
+ ablation row to measure the contribution of anchored examples.
252
+
253
+ source_hash naturally diverges because body_markdown changes — so
254
+ ScoreResults from the stripped rubric carry a different
255
+ rubric_version, and the calibration report can bucket them
256
+ correctly without requiring a separate provenance field.
257
+ """
258
+ fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL)
259
+ head = fm_match.group(1) if fm_match else ""
260
+ rest = fm_match.group(2) if fm_match else self.body_markdown
261
+ intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0]
262
+ # Render each level with its description but no examples.
263
+ stripped_body = head + intro + "\n".join(
264
+ f"## Score {lvl.score}\n{lvl.description}\n" for lvl in self.levels
265
+ )
266
+ stripped_levels = [
267
+ RubricLevel(score=lvl.score, description=lvl.description, examples=[])
268
+ for lvl in self.levels
269
+ ]
270
+ return type(self)(
271
+ dimension=self.dimension,
272
+ scale=self.scale,
273
+ reference_based=self.reference_based,
274
+ abstain_allowed=self.abstain_allowed,
275
+ levels=stripped_levels,
276
+ body_markdown=stripped_body,
277
+ )
278
+
279
+
280
+ class Judge(ABC):
281
+ """Per-dimension LLM judge. Concrete subclasses implement score()
282
+ for one rubric dimension; they are thin (~30 lines) and not
283
+ factored against a shared base method (see design doc for why).
284
+
285
+ Three calibration knobs are accepted at construction so the
286
+ calibration runner can run baseline-vs-ablation rows from the same
287
+ code path without monkey-patching:
288
+
289
+ - ``use_cot`` (default True) — when False, the JSON schema requested
290
+ from the model omits the ``reasoning`` and ``evidence_quotes``
291
+ fields, ablating the chain-of-thought-before-score discipline.
292
+ - ``abstain_allowed_override`` (default None) — when set, overrides
293
+ the rubric's ``abstain_allowed`` flag for this judge's calls. Used
294
+ by the ``baseline_no_abstain`` ablation row.
295
+ - The ``use_anchors`` knob is implemented by passing a stripped
296
+ rubric (via ``Rubric.strip_anchors()``) at construction time, not
297
+ via a separate flag here — that way ScoreResult.rubric_version
298
+ naturally distinguishes anchored vs stripped variants.
299
+ """
300
+
301
+ def __init__(
302
+ self,
303
+ judge_provider: "LLMProvider",
304
+ rubric: Rubric,
305
+ model_id: str,
306
+ *,
307
+ use_cot: bool = True,
308
+ abstain_allowed_override: bool | None = None,
309
+ ) -> None:
310
+ self.judge_provider = judge_provider
311
+ self.rubric = rubric
312
+ self.model_id = model_id
313
+ self.use_cot = use_cot
314
+ self.abstain_allowed_override = abstain_allowed_override
315
+ # judge_id format: ``{model_id}_{dimension}`` — load-bearing for
316
+ # the calibration report's per-judge κ breakdown. Ablation knobs
317
+ # do NOT enter the judge_id; the row label + ScoreResult.
318
+ # rubric_version (which differs for stripped anchors) carry that
319
+ # signal. This keeps the per-judge bucketing stable across
320
+ # baseline + ablation rows for the same model.
321
+ self.judge_id = f"{model_id}_{rubric.dimension}"
322
+
323
+ @property
324
+ def effective_abstain_allowed(self) -> bool:
325
+ """Whether abstain is permitted for this judge's calls; the
326
+ override (when set) takes precedence over the rubric's flag.
327
+ """
328
+ if self.abstain_allowed_override is not None:
329
+ return self.abstain_allowed_override
330
+ return self.rubric.abstain_allowed
331
+
332
+ def _json_schema_clause(self, valid_scores_str: str) -> str:
333
+ """Render the trailing JSON-schema instruction for the prompt.
334
+
335
+ With ``use_cot=True`` (default) the schema asks for reasoning
336
+ and evidence_quotes before the score, so the model's response
337
+ conditions the score on the reasoning. With ``use_cot=False``
338
+ only the score field is requested — used for the ``no_cot``
339
+ ablation row.
340
+ """
341
+ if self.use_cot:
342
+ return (
343
+ f'JSON object: {{"reasoning": "...", '
344
+ f'"evidence_quotes": [...], "score": {valid_scores_str}}}.'
345
+ )
346
+ return f'JSON object: {{"score": {valid_scores_str}}}.'
347
+
348
+ @abstractmethod
349
+ async def score(
350
+ self,
351
+ item: "GoldenQuestion",
352
+ output: "AgentResponse",
353
+ *,
354
+ prompt_seed: int = 0,
355
+ ) -> ScoreResult:
356
+ """Score one (item, output) pair against this judge's rubric.
357
+
358
+ Returns a ScoreResult whose system_output_hash is computed from
359
+ (item.id, output.answer, sorted(output.sources)). Failures map
360
+ to abstain via the abstain-reason constants; provider non-
361
+ retryable errors raise (caller bug, not noise).
362
+ """
363
+ ...
364
+
365
+
366
+ class MockJudge(Judge):
367
+ """Pre-baked-verdict judge for deterministic tests. No API calls.
368
+
369
+ Constructor takes verdicts: dict[item_id, ScoreResult]. score()
370
+ raises LookupError on missing keys — never returns a default —
371
+ so test fixtures are self-checking. A separate fixture-validation
372
+ test (test_mockjudge_coverage.py) walks item.id across all goldens
373
+ and asserts every MockJudge instance has coverage for the items
374
+ its tests reference.
375
+
376
+ Mirrors the MockProvider pattern at agent_bench/core/provider.py.
377
+ """
378
+
379
+ def __init__(self, verdicts: dict[str, ScoreResult]) -> None:
380
+ # MockJudge does not need provider/rubric/model_id; supply
381
+ # placeholder values so the ABC's __init__ doesn't matter.
382
+ self.judge_provider = None # type: ignore[assignment]
383
+ self.rubric = None # type: ignore[assignment]
384
+ self.model_id = "mock"
385
+ self.judge_id = "mock_judge"
386
+ self._verdicts = verdicts
387
+
388
+ async def score(
389
+ self,
390
+ item: "GoldenQuestion",
391
+ output: "AgentResponse",
392
+ *,
393
+ prompt_seed: int = 0,
394
+ ) -> ScoreResult:
395
+ if item.id not in self._verdicts:
396
+ raise LookupError(
397
+ f"MockJudge has no pre-baked verdict for item_id {item.id!r}; "
398
+ f"available: {sorted(self._verdicts.keys())[:5]}"
399
+ + (" ..." if len(self._verdicts) > 5 else "")
400
+ )
401
+ return self._verdicts[item.id]
402
+
403
+
404
+ # --- _call_judge_with_retry helper ---
405
+
406
+ _STRICT_REPROMPT_SUFFIX = (
407
+ "\n\nSTRICT FORMATTING NOTE: respond ONLY with a JSON object matching "
408
+ "the schema; reasoning first, then evidence_quotes, then score. "
409
+ "Do not wrap the JSON in a markdown code fence."
410
+ )
411
+
412
+
413
+ _MARKDOWN_FENCE_RE = re.compile(r"^\s*```(?:json|JSON)?\s*\n(.*?)\n```\s*$", re.DOTALL)
414
+
415
+
416
+ def _strip_markdown_fence(text: str) -> str:
417
+ """Strip a leading/trailing ```json ... ``` markdown fence if present.
418
+
419
+ Some chat models wrap structured JSON in a markdown code fence even
420
+ when the prompt asks for a bare JSON object. The judge parser uses
421
+ json.loads on the raw content, which fails at char 0 on the literal
422
+ backtick. This helper unwraps the fence so the parse can proceed.
423
+ Idempotent: returns text unchanged if no fence is present.
424
+ """
425
+ m = _MARKDOWN_FENCE_RE.match(text.strip())
426
+ return m.group(1) if m else text
427
+
428
+
429
+ async def _call_judge_with_retry(
430
+ *,
431
+ provider: "LLMProvider",
432
+ prompt: str,
433
+ valid_scores: set[int],
434
+ judge_id: str,
435
+ rubric_version: str,
436
+ prompt_seed: int,
437
+ system_output_hash: str,
438
+ item_id: str,
439
+ abstain_allowed: bool = True,
440
+ max_tokens: int = 1024,
441
+ ) -> ScoreResult:
442
+ """Send prompt to provider; one retry with strict reprompt on
443
+ schema-parse / score-out-of-range; abstain on persistent failure
444
+ or provider exhaustion. Re-raises unknown exceptions (caller bugs).
445
+
446
+ max_tokens defaults to 1024 (was 512 pre-v1.1). The v1.1 groundedness
447
+ rubric ships with calibration anchors whose verbose thinking traces
448
+ elicit longer model reasoning in turn; 512 truncated the JSON
449
+ response mid-reasoning and caused 78/82 schema_parse_failed
450
+ abstains in the first run after the rubric clarification. 1024 leaves
451
+ enough headroom; bump again if a future rubric revision pushes
452
+ reasoning longer.
453
+ """
454
+ accumulated_cost = 0.0
455
+ accumulated_latency = 0.0
456
+
457
+ for attempt in range(2): # 2 = original + one retry
458
+ send_prompt = prompt if attempt == 0 else prompt + _STRICT_REPROMPT_SUFFIX
459
+ start = time.perf_counter()
460
+ try:
461
+ response = await provider.complete(
462
+ [Message(role=Role.USER, content=send_prompt)],
463
+ temperature=0.0,
464
+ max_tokens=max_tokens,
465
+ )
466
+ except (ProviderRateLimitError, ProviderTimeoutError) as e:
467
+ return ScoreResult(
468
+ reasoning=f"{ABSTAIN_REASON_PROVIDER_EXHAUSTED}{type(e).__name__}: {e}",
469
+ evidence_quotes=[],
470
+ score="Unknown",
471
+ judge_id=judge_id,
472
+ rubric_version=rubric_version,
473
+ prompt_seed=prompt_seed,
474
+ system_output_hash=system_output_hash,
475
+ cost_usd=accumulated_cost,
476
+ latency_ms=accumulated_latency + (time.perf_counter() - start) * 1000,
477
+ )
478
+ # Other exceptions (caller bugs like 401, 400) propagate.
479
+ accumulated_cost += response.usage.estimated_cost_usd
480
+ accumulated_latency += (time.perf_counter() - start) * 1000
481
+ last_raw = response.content[:300]
482
+
483
+ # Parse — reasoning and evidence_quotes are optional so judges
484
+ # configured with use_cot=False (which prompt for {"score": ...}
485
+ # only) don't fail parsing on the missing key.
486
+ #
487
+ # Some models (observed on Haiku 4.5 under the v1.1 rubric) wrap
488
+ # their JSON in a ```json ... ``` markdown fence. Strip the fence
489
+ # before parsing rather than abstaining on a syntactically valid
490
+ # but conventionally formatted response.
491
+ content = _strip_markdown_fence(response.content)
492
+ try:
493
+ data = _json.loads(content)
494
+ reasoning = str(data.get("reasoning", ""))
495
+ evidence_quotes = list(data.get("evidence_quotes", []))
496
+ raw_score = data["score"]
497
+ except (_json.JSONDecodeError, KeyError, TypeError) as e:
498
+ cause = ABSTAIN_REASON_SCHEMA_PARSE
499
+ if attempt == 0:
500
+ logger.warning(
501
+ "judge_first_attempt_failure",
502
+ judge_id=judge_id,
503
+ item_id=item_id,
504
+ provider=type(provider).__name__,
505
+ failure_cause=cause,
506
+ attempt_index=1,
507
+ )
508
+ continue
509
+ return ScoreResult(
510
+ reasoning=f"{cause}raw={last_raw!r} parse_error={e}",
511
+ evidence_quotes=[],
512
+ score="Unknown",
513
+ judge_id=judge_id,
514
+ rubric_version=rubric_version,
515
+ prompt_seed=prompt_seed,
516
+ system_output_hash=system_output_hash,
517
+ cost_usd=accumulated_cost,
518
+ latency_ms=accumulated_latency,
519
+ )
520
+
521
+ # Score validation
522
+ if raw_score == "Unknown":
523
+ if not abstain_allowed:
524
+ cause = ABSTAIN_REASON_OUT_OF_RANGE
525
+ if attempt == 0:
526
+ logger.warning(
527
+ "judge_first_attempt_failure",
528
+ judge_id=judge_id,
529
+ item_id=item_id,
530
+ provider=type(provider).__name__,
531
+ failure_cause=cause,
532
+ attempt_index=1,
533
+ )
534
+ continue
535
+ return ScoreResult(
536
+ reasoning=(
537
+ f"{cause}model returned 'Unknown' but rubric "
538
+ f"abstain_allowed=False"
539
+ ),
540
+ evidence_quotes=[],
541
+ score="Unknown",
542
+ judge_id=judge_id,
543
+ rubric_version=rubric_version,
544
+ prompt_seed=prompt_seed,
545
+ system_output_hash=system_output_hash,
546
+ cost_usd=accumulated_cost,
547
+ latency_ms=accumulated_latency,
548
+ )
549
+ # Genuine abstain — no prefix, no retry
550
+ return ScoreResult(
551
+ reasoning=reasoning,
552
+ evidence_quotes=evidence_quotes,
553
+ score="Unknown",
554
+ judge_id=judge_id,
555
+ rubric_version=rubric_version,
556
+ prompt_seed=prompt_seed,
557
+ system_output_hash=system_output_hash,
558
+ cost_usd=accumulated_cost,
559
+ latency_ms=accumulated_latency,
560
+ )
561
+
562
+ try:
563
+ score_int = int(raw_score)
564
+ except (ValueError, TypeError):
565
+ cause = ABSTAIN_REASON_OUT_OF_RANGE
566
+ if attempt == 0:
567
+ logger.warning(
568
+ "judge_first_attempt_failure",
569
+ judge_id=judge_id,
570
+ item_id=item_id,
571
+ provider=type(provider).__name__,
572
+ failure_cause=cause,
573
+ attempt_index=1,
574
+ )
575
+ continue
576
+ return ScoreResult(
577
+ reasoning=f"{cause}non-int score: {raw_score!r}",
578
+ evidence_quotes=[],
579
+ score="Unknown",
580
+ judge_id=judge_id,
581
+ rubric_version=rubric_version,
582
+ prompt_seed=prompt_seed,
583
+ system_output_hash=system_output_hash,
584
+ cost_usd=accumulated_cost,
585
+ latency_ms=accumulated_latency,
586
+ )
587
+
588
+ if score_int not in valid_scores:
589
+ cause = ABSTAIN_REASON_OUT_OF_RANGE
590
+ if attempt == 0:
591
+ logger.warning(
592
+ "judge_first_attempt_failure",
593
+ judge_id=judge_id,
594
+ item_id=item_id,
595
+ provider=type(provider).__name__,
596
+ failure_cause=cause,
597
+ attempt_index=1,
598
+ )
599
+ continue
600
+ return ScoreResult(
601
+ reasoning=(
602
+ f"{cause}model returned {score_int}, valid levels "
603
+ f"{sorted(valid_scores)}"
604
+ ),
605
+ evidence_quotes=[],
606
+ score="Unknown",
607
+ judge_id=judge_id,
608
+ rubric_version=rubric_version,
609
+ prompt_seed=prompt_seed,
610
+ system_output_hash=system_output_hash,
611
+ cost_usd=accumulated_cost,
612
+ latency_ms=accumulated_latency,
613
+ )
614
+
615
+ # Success
616
+ return ScoreResult(
617
+ reasoning=reasoning,
618
+ evidence_quotes=evidence_quotes,
619
+ score=score_int,
620
+ judge_id=judge_id,
621
+ rubric_version=rubric_version,
622
+ prompt_seed=prompt_seed,
623
+ system_output_hash=system_output_hash,
624
+ cost_usd=accumulated_cost,
625
+ latency_ms=accumulated_latency,
626
+ )
627
+
628
+ raise RuntimeError("_call_judge_with_retry: unreachable code path")
agent_bench/evaluation/judges/citation_faithfulness.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CitationFaithfulnessJudge — binary, per-(claim,citation) all-or-nothing."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import re
6
+ from typing import TYPE_CHECKING, Literal
7
+
8
+ import structlog
9
+
10
+ from agent_bench.evaluation.judges.base import (
11
+ Judge,
12
+ ScoreResult,
13
+ _call_judge_with_retry,
14
+ )
15
+ from agent_bench.evaluation.judges.groundedness import _system_output_hash
16
+
17
+ if TYPE_CHECKING:
18
+ from agent_bench.agents.orchestrator import AgentResponse
19
+ from agent_bench.evaluation.harness import GoldenQuestion
20
+
21
+ logger = structlog.get_logger()
22
+
23
+ _CITATION_PATTERN = re.compile(r"\[source:\s*([^\]]+)\]")
24
+
25
+
26
+ def _extract_claims_with_citations(answer: str) -> list[tuple[str, str]]:
27
+ """Return list of (claim_text, cited_source) pairs.
28
+
29
+ A "claim" is the sentence (including its terminating punctuation)
30
+ immediately preceding a [source:] citation. Prior citation tags
31
+ inside `before` are stripped so multi-citation answers yield clean
32
+ claim strings.
33
+ """
34
+ pairs: list[tuple[str, str]] = []
35
+ for match in _CITATION_PATTERN.finditer(answer):
36
+ cited = match.group(1).strip()
37
+ before = answer[: match.start()]
38
+ # Strip prior [source:...] tags so they don't pollute the claim
39
+ before_clean = _CITATION_PATTERN.sub("", before)
40
+ last_end = max(
41
+ before_clean.rfind("."), before_clean.rfind("!"), before_clean.rfind("?")
42
+ )
43
+ if last_end >= 0:
44
+ prev_end = max(
45
+ before_clean.rfind(".", 0, last_end),
46
+ before_clean.rfind("!", 0, last_end),
47
+ before_clean.rfind("?", 0, last_end),
48
+ )
49
+ claim = before_clean[prev_end + 1 : last_end + 1].strip()
50
+ else:
51
+ claim = before_clean.strip()
52
+ pairs.append((claim, cited))
53
+ return pairs
54
+
55
+
56
+ class CitationFaithfulnessJudge(Judge):
57
+ """Aggregates per-(claim, citation) judgments into one item-level
58
+ binary ScoreResult. Per-pair detail is in evidence_quotes.
59
+
60
+ All-or-nothing aggregation: any unfaithful citation → score 0.
61
+ The rubric documents the rule explicitly.
62
+ """
63
+
64
+ async def score(
65
+ self,
66
+ item: "GoldenQuestion",
67
+ output: "AgentResponse",
68
+ *,
69
+ prompt_seed: int = 0,
70
+ ) -> ScoreResult:
71
+ pairs = _extract_claims_with_citations(output.answer)
72
+ # Map cited source name to its retrieved chunk text via output.source_chunks
73
+ # (assumes index alignment with output.sources, matching harness
74
+ # convention). If the same source appears multiple times in the
75
+ # sources list with distinct chunks (legitimate when multiple
76
+ # retrievals match the same doc), `setdefault` keeps only the first
77
+ # — every "[source: X]" claim then evaluates against that one chunk,
78
+ # a false-failure risk. Warn so the operator notices.
79
+ source_names = [s.source for s in output.sources]
80
+ if len(set(source_names)) < len(source_names):
81
+ from collections import Counter
82
+
83
+ duplicates = sorted(
84
+ name for name, n in Counter(source_names).items() if n > 1
85
+ )
86
+ logger.warning(
87
+ "citation_faithfulness_lossy_source_lookup",
88
+ item_id=item.id,
89
+ duplicate_source_names=duplicates,
90
+ detail=(
91
+ "source name appears multiple times in output.sources "
92
+ "with distinct chunks; only the first chunk will be "
93
+ "associated with the name during citation evaluation."
94
+ ),
95
+ )
96
+ source_to_chunk: dict[str, str] = {}
97
+ for src_ref, chunk in zip(output.sources, output.source_chunks):
98
+ source_to_chunk.setdefault(src_ref.source, chunk)
99
+
100
+ per_pair_results: list[ScoreResult] = []
101
+ sys_hash = _system_output_hash(
102
+ item.id, output.answer, [s.source for s in output.sources]
103
+ )
104
+
105
+ if not pairs:
106
+ return ScoreResult(
107
+ reasoning="no_citations_in_answer",
108
+ evidence_quotes=[],
109
+ score=1,
110
+ judge_id=self.judge_id,
111
+ rubric_version=self.rubric.source_hash,
112
+ prompt_seed=prompt_seed,
113
+ system_output_hash=sys_hash,
114
+ cost_usd=0.0,
115
+ latency_ms=0.0,
116
+ )
117
+
118
+ accumulated_cost = 0.0
119
+ accumulated_latency = 0.0
120
+ any_unfaithful = False
121
+ for claim, cited in pairs:
122
+ # Empty claim → leading-citation case (e.g., answer starts with
123
+ # "[source: a.md] ..." with no prior content). There is no claim
124
+ # to evaluate against the chunk; the well-defined verdict is
125
+ # vacuously faithful. Skip the API call; record a synthetic
126
+ # ScoreResult so per-pair detail still appears in evidence_quotes.
127
+ if not claim:
128
+ per_pair_results.append(
129
+ ScoreResult(
130
+ reasoning="empty_claim_vacuously_faithful",
131
+ evidence_quotes=[],
132
+ score=1,
133
+ judge_id=self.judge_id,
134
+ rubric_version=self.rubric.source_hash,
135
+ prompt_seed=prompt_seed,
136
+ system_output_hash=sys_hash,
137
+ cost_usd=0.0,
138
+ latency_ms=0.0,
139
+ )
140
+ )
141
+ continue
142
+ chunk = source_to_chunk.get(cited, "")
143
+ schema_clause = self._json_schema_clause('0 or 1 or "Unknown"')
144
+ prompt = (
145
+ f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
146
+ f"---\n\n"
147
+ f"## Claim (from agent's answer)\n{claim}\n\n"
148
+ f"## Cited chunk content\n{chunk}\n\n"
149
+ f"Does the cited chunk support the claim? Respond with ONLY a "
150
+ f"{schema_clause}"
151
+ )
152
+ sub_result = await _call_judge_with_retry(
153
+ provider=self.judge_provider,
154
+ prompt=prompt,
155
+ valid_scores={0, 1},
156
+ judge_id=self.judge_id,
157
+ rubric_version=self.rubric.source_hash,
158
+ prompt_seed=prompt_seed,
159
+ system_output_hash=sys_hash,
160
+ item_id=f"{item.id}::{cited}",
161
+ abstain_allowed=self.effective_abstain_allowed,
162
+ )
163
+ per_pair_results.append(sub_result)
164
+ accumulated_cost += sub_result.cost_usd
165
+ accumulated_latency += sub_result.latency_ms
166
+ if sub_result.score == 0:
167
+ any_unfaithful = True
168
+
169
+ aggregate_score: int | Literal["Unknown"] = 0 if any_unfaithful else 1
170
+ # Any sub-call abstain → propagate Unknown (consistent with strict-quorum)
171
+ if any(r.abstained for r in per_pair_results):
172
+ aggregate_score = "Unknown"
173
+
174
+ return ScoreResult(
175
+ reasoning=(
176
+ f"all_or_nothing aggregate over {len(per_pair_results)} (claim, citation) pairs; "
177
+ f"unfaithful={sum(1 for r in per_pair_results if r.score == 0)}, "
178
+ f"abstained={sum(1 for r in per_pair_results if r.abstained)}"
179
+ ),
180
+ evidence_quotes=[r.reasoning[:120] for r in per_pair_results],
181
+ score=aggregate_score,
182
+ judge_id=self.judge_id,
183
+ rubric_version=self.rubric.source_hash,
184
+ prompt_seed=prompt_seed,
185
+ system_output_hash=sys_hash,
186
+ cost_usd=accumulated_cost,
187
+ latency_ms=accumulated_latency,
188
+ )
agent_bench/evaluation/judges/completeness.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """CompletenessJudge — three-point, reference-based on item.reference_answer."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import TYPE_CHECKING
6
+
7
+ from agent_bench.evaluation.judges.base import (
8
+ Judge,
9
+ ScoreResult,
10
+ _call_judge_with_retry,
11
+ )
12
+ from agent_bench.evaluation.judges.groundedness import _system_output_hash
13
+
14
+ if TYPE_CHECKING:
15
+ from agent_bench.agents.orchestrator import AgentResponse
16
+ from agent_bench.evaluation.harness import GoldenQuestion
17
+
18
+
19
+ # v1.1.1: recency-positioned restatement of the rubric's "paraphrase
20
+ # allowed" semantics. Earned by the 3A probe (3/5 disputed items shifted
21
+ # 1→2 on gpt-4o-mini) which validated that gpt-4o-mini's directional
22
+ # downward bias on 3-point completeness was prompt-positionally
23
+ # correctable rather than model-intrinsic. The clause appears immediately
24
+ # before the score instruction so the conditioning isn't lost across the
25
+ # rubric body and the reasoning step. See DECISIONS "Plan 3A" entry.
26
+ PARAPHRASE_RECENCY_CLAUSE = (
27
+ "Note: a paraphrase that captures the same meaning as a gold-answer "
28
+ "point counts as covered. Score on content equivalence, not surface form."
29
+ )
30
+
31
+
32
+ class CompletenessJudge(Judge):
33
+ async def score(
34
+ self,
35
+ item: "GoldenQuestion",
36
+ output: "AgentResponse",
37
+ *,
38
+ prompt_seed: int = 0,
39
+ ) -> ScoreResult:
40
+ schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"')
41
+ prompt = (
42
+ f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
43
+ f"---\n\n"
44
+ f"## Reference answer (gold)\n{item.reference_answer}\n\n"
45
+ f"## Answer to score\n{output.answer}\n\n"
46
+ f"{PARAPHRASE_RECENCY_CLAUSE}\n\n"
47
+ f"Score this answer against the rubric above. Respond with ONLY a "
48
+ f"{schema_clause}"
49
+ )
50
+ return await _call_judge_with_retry(
51
+ provider=self.judge_provider,
52
+ prompt=prompt,
53
+ valid_scores={0, 1, 2},
54
+ judge_id=self.judge_id,
55
+ rubric_version=self.rubric.source_hash,
56
+ prompt_seed=prompt_seed,
57
+ system_output_hash=_system_output_hash(
58
+ item.id, output.answer, [s.source for s in output.sources]
59
+ ),
60
+ item_id=item.id,
61
+ abstain_allowed=self.effective_abstain_allowed,
62
+ )
agent_bench/evaluation/judges/groundedness.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GroundednessJudge — binary, reference-based on item.source_snippets."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import hashlib
6
+ from typing import TYPE_CHECKING
7
+
8
+ from agent_bench.evaluation.judges.base import (
9
+ Judge,
10
+ ScoreResult,
11
+ _call_judge_with_retry,
12
+ )
13
+
14
+ if TYPE_CHECKING:
15
+ from agent_bench.agents.orchestrator import AgentResponse
16
+ from agent_bench.evaluation.harness import GoldenQuestion
17
+
18
+
19
+ def _system_output_hash(item_id: str, answer: str, sources: list[str]) -> str:
20
+ sorted_sources = sorted(sources)
21
+ canonical = f"{item_id}\x00{answer}\x00{','.join(sorted_sources)}"
22
+ return hashlib.sha256(canonical.encode("utf-8")).hexdigest()
23
+
24
+
25
+ class GroundednessJudge(Judge):
26
+ async def score(
27
+ self,
28
+ item: "GoldenQuestion",
29
+ output: "AgentResponse",
30
+ *,
31
+ prompt_seed: int = 0,
32
+ ) -> ScoreResult:
33
+ snippets_block = "\n".join(
34
+ f"[{i + 1}] {s}" for i, s in enumerate(item.source_snippets)
35
+ )
36
+ schema_clause = self._json_schema_clause('0 or 1 or "Unknown"')
37
+ prompt = (
38
+ f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
39
+ f"---\n\n"
40
+ f"## Gold source snippets\n{snippets_block}\n\n"
41
+ f"## Answer to score\n{output.answer}\n\n"
42
+ f"Score this answer against the rubric above. Respond with ONLY a "
43
+ f"{schema_clause}"
44
+ )
45
+ return await _call_judge_with_retry(
46
+ provider=self.judge_provider,
47
+ prompt=prompt,
48
+ valid_scores={0, 1},
49
+ judge_id=self.judge_id,
50
+ rubric_version=self.rubric.source_hash,
51
+ prompt_seed=prompt_seed,
52
+ system_output_hash=_system_output_hash(
53
+ item.id, output.answer, [s.source for s in output.sources]
54
+ ),
55
+ item_id=item.id,
56
+ abstain_allowed=self.effective_abstain_allowed,
57
+ )
agent_bench/evaluation/judges/relevance.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """RelevanceJudge — three-point, reference-free."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import TYPE_CHECKING
6
+
7
+ from agent_bench.evaluation.judges.base import (
8
+ Judge,
9
+ ScoreResult,
10
+ _call_judge_with_retry,
11
+ )
12
+ from agent_bench.evaluation.judges.groundedness import _system_output_hash
13
+
14
+ if TYPE_CHECKING:
15
+ from agent_bench.agents.orchestrator import AgentResponse
16
+ from agent_bench.evaluation.harness import GoldenQuestion
17
+
18
+
19
+ class RelevanceJudge(Judge):
20
+ async def score(
21
+ self,
22
+ item: "GoldenQuestion",
23
+ output: "AgentResponse",
24
+ *,
25
+ prompt_seed: int = 0,
26
+ ) -> ScoreResult:
27
+ schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"')
28
+ prompt = (
29
+ f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n"
30
+ f"---\n\n"
31
+ f"## Question\n{item.question}\n\n"
32
+ f"## Answer to score\n{output.answer}\n\n"
33
+ f"Score this answer against the rubric above. Respond with ONLY a "
34
+ f"{schema_clause}"
35
+ )
36
+ return await _call_judge_with_retry(
37
+ provider=self.judge_provider,
38
+ prompt=prompt,
39
+ valid_scores={0, 1, 2},
40
+ judge_id=self.judge_id,
41
+ rubric_version=self.rubric.source_hash,
42
+ prompt_seed=prompt_seed,
43
+ system_output_hash=_system_output_hash(
44
+ item.id, output.answer, [s.source for s in output.sources]
45
+ ),
46
+ item_id=item.id,
47
+ abstain_allowed=self.effective_abstain_allowed,
48
+ )
agent_bench/evaluation/metrics.py CHANGED
@@ -1,15 +1,19 @@
1
- """Deterministic and LLM-judge evaluation metrics."""
 
 
 
 
 
 
 
2
 
3
  from __future__ import annotations
4
 
5
- import json
6
  import re
7
 
8
  import structlog
9
 
10
  from agent_bench.agents.orchestrator import AgentResponse
11
- from agent_bench.core.provider import LLMProvider
12
- from agent_bench.core.types import Message, Role
13
 
14
  logger = structlog.get_logger()
15
 
@@ -125,84 +129,4 @@ def calculator_used_when_expected(
125
  return "calculator" in response.tools_used
126
 
127
 
128
- # --- LLM-judge metrics (costs money, manual) ---
129
-
130
- _FAITHFULNESS_PROMPT = """\
131
- You are evaluating whether an AI assistant's answer \
132
- is fully supported by the provided source passages.
133
-
134
- Source passages:
135
- {chunks}
136
-
137
- Answer to evaluate:
138
- {answer}
139
-
140
- Score the answer's faithfulness to the sources from 0.0 to 1.0:
141
- - 1.0: Every claim is directly supported by the sources
142
- - 0.5: Some claims are supported, others are extrapolated
143
- - 0.0: The answer contradicts or is entirely unsupported
144
-
145
- Respond with ONLY a JSON object:
146
- {{"score": 0.8, "reasoning": "brief explanation"}}"""
147
-
148
- _CORRECTNESS_PROMPT = """\
149
- You are evaluating whether an AI assistant's answer \
150
- is factually correct compared to a reference answer.
151
-
152
- Reference answer:
153
- {reference}
154
-
155
- Answer to evaluate:
156
- {answer}
157
-
158
- Score correctness from 0.0 to 1.0:
159
- - 1.0: All key facts match the reference
160
- - 0.5: Some facts are correct, some are missing or wrong
161
- - 0.0: The answer is factually incorrect
162
-
163
- Respond with ONLY a JSON object:
164
- {{"score": 0.8, "reasoning": "brief explanation"}}"""
165
-
166
-
167
- async def answer_faithfulness(
168
- answer: str,
169
- source_chunks: list[str],
170
- judge_provider: LLMProvider,
171
- ) -> float | None:
172
- """LLM-judged: is the answer supported by the sources? 0.0-1.0."""
173
- chunks_text = "\n\n".join(f"[{i + 1}] {c}" for i, c in enumerate(source_chunks))
174
- prompt = _FAITHFULNESS_PROMPT.format(chunks=chunks_text, answer=answer)
175
-
176
- return await _judge_call(prompt, judge_provider)
177
-
178
-
179
- async def answer_correctness(
180
- answer: str,
181
- reference_answer: str,
182
- judge_provider: LLMProvider,
183
- ) -> float | None:
184
- """LLM-judged: is the answer factually correct vs reference? 0.0-1.0."""
185
- prompt = _CORRECTNESS_PROMPT.format(reference=reference_answer, answer=answer)
186
-
187
- return await _judge_call(prompt, judge_provider)
188
-
189
-
190
- async def _judge_call(prompt: str, provider: LLMProvider) -> float | None:
191
- """Make a judge call and parse the JSON response."""
192
- try:
193
- response = await provider.complete(
194
- [Message(role=Role.USER, content=prompt)],
195
- temperature=0.0,
196
- max_tokens=256,
197
- )
198
- data = json.loads(response.content)
199
- score = float(data["score"])
200
- reasoning = data.get("reasoning", "")
201
- logger.info("llm_judge_result", score=score, reasoning=reasoning)
202
- return max(0.0, min(1.0, score))
203
- except (json.JSONDecodeError, KeyError, ValueError, TypeError) as e:
204
- logger.warning("llm_judge_parse_error", error=str(e), raw=response.content[:200])
205
- return None
206
- except Exception as e:
207
- logger.error("llm_judge_call_error", error=str(e))
208
- return None
 
1
+ """Deterministic evaluation metrics.
2
+
3
+ The continuous-scale LLM-judge functions (answer_faithfulness,
4
+ answer_correctness, _judge_call) were removed in the judge-layer v1
5
+ supersession. The replacement lives at agent_bench/evaluation/judges/
6
+ as discrete-anchored, per-dimension judges with κ-validated calibration.
7
+ See docs/plans/2026-05-04-judge-layer-v1-design.md for the rationale.
8
+ """
9
 
10
  from __future__ import annotations
11
 
 
12
  import re
13
 
14
  import structlog
15
 
16
  from agent_bench.agents.orchestrator import AgentResponse
 
 
17
 
18
  logger = structlog.get_logger()
19
 
 
129
  return "calculator" in response.tools_used
130
 
131
 
132
+ # LLM-judge metrics moved to agent_bench/evaluation/judges/ in judge-layer v1.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
agent_bench/evaluation/report.py CHANGED
@@ -52,9 +52,18 @@ def generate_report(
52
  total_cost = sum(r.tokens_used.estimated_cost_usd for r in results)
53
  avg_cost = total_cost / max(len(results), 1)
54
 
55
- # Optional faithfulness
56
- faith_scores = [r.faithfulness for r in positive if r.faithfulness is not None]
57
- avg_faith = _safe_avg(faith_scores) if faith_scores else None
 
 
 
 
 
 
 
 
 
58
 
59
  lines.append("| Metric | Value |")
60
  lines.append("|--------|-------|")
@@ -65,8 +74,8 @@ def generate_report(
65
  lines.append(f"| Citation Accuracy | {avg_citation:.2f} |")
66
  lines.append(f"| Grounded Refusal Rate | {refusal_rate}/{len(negative)} |")
67
  lines.append(f"| Calculator Accuracy | {calc_correct}/{len(calc_qs)} |")
68
- if avg_faith is not None:
69
- lines.append(f"| Answer Faithfulness (LLM) | {avg_faith:.2f} |")
70
  lines.append(f"| Latency p50 | {p50:,.0f} ms |")
71
  lines.append(f"| Latency p95 | {p95:,.0f} ms |")
72
  lines.append(f"| Cost per query | ${avg_cost:.4f} |")
 
52
  total_cost = sum(r.tokens_used.estimated_cost_usd for r in results)
53
  avg_cost = total_cost / max(len(results), 1)
54
 
55
+ # Optional groundedness (replaces continuous faithfulness in v1).
56
+ # Discrete-anchored binary 0/1; abstain ('Unknown' score) is excluded
57
+ # from the average. The float() cast narrows ScoreResult.score from
58
+ # `int | Literal["Unknown"]` to float for _safe_avg — abstained=False
59
+ # already guarantees the value is int but mypy doesn't propagate that.
60
+ grounded_scores: list[float] = [
61
+ float(r.judge_scores["groundedness"].score) # type: ignore[arg-type]
62
+ for r in positive
63
+ if "groundedness" in r.judge_scores
64
+ and not r.judge_scores["groundedness"].abstained
65
+ ]
66
+ avg_grounded = _safe_avg(grounded_scores) if grounded_scores else None
67
 
68
  lines.append("| Metric | Value |")
69
  lines.append("|--------|-------|")
 
74
  lines.append(f"| Citation Accuracy | {avg_citation:.2f} |")
75
  lines.append(f"| Grounded Refusal Rate | {refusal_rate}/{len(negative)} |")
76
  lines.append(f"| Calculator Accuracy | {calc_correct}/{len(calc_qs)} |")
77
+ if avg_grounded is not None:
78
+ lines.append(f"| Answer Groundedness (LLM judge) | {avg_grounded:.2f} |")
79
  lines.append(f"| Latency p50 | {p50:,.0f} ms |")
80
  lines.append(f"| Latency p95 | {p95:,.0f} ms |")
81
  lines.append(f"| Cost per query | ${avg_cost:.4f} |")
agent_bench/evaluation/rubrics/citation_faithfulness.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ dimension: citation_faithfulness
3
+ scale: binary
4
+ reference_based: true
5
+ abstain_allowed: true
6
+ ---
7
+
8
+ # Citation faithfulness (binary, all-or-nothing aggregation per item)
9
+
10
+ For each [source: X.md] citation in the answer, is the cited chunk's
11
+ content actually relevant to the claim it supports? This is stricter
12
+ than the deterministic citation_accuracy metric, which only checks
13
+ that the cited chunk_id appears in the retrieved set — citation
14
+ faithfulness checks the **relevance** of the chunk to the claim.
15
+
16
+ **Aggregation rule (item-level):** any unfaithful citation in the
17
+ answer → item score = 0. A single bad citation in a multi-citation
18
+ answer is a real failure that all-or-nothing surfaces; treating it as
19
+ partial would obscure the failure mode.
20
+
21
+ ## Score 0
22
+
23
+ The cited chunk's content does not support the adjacent claim.
24
+
25
+ ### Example A — citation drift
26
+
27
+ Claim: "The default port is 8080."
28
+ Cited chunk content: "The dashboard supports OAuth and SAML authentication."
29
+
30
+ Score=0 because the chunk talks about authentication, not the port.
31
+ The citation is misleading even though the claim happens to be true.
32
+
33
+ ### Example B — wrong topic citation
34
+
35
+ Claim: "StatefulSet pods get ordinal indices."
36
+ Cited chunk content: "Deployments support rolling updates with maxSurge and maxUnavailable parameters."
37
+
38
+ Score=0 — the cited chunk is about Deployments, not StatefulSets.
39
+ The citation does not support the claim about StatefulSet ordinals.
40
+
41
+ ## Score 1
42
+
43
+ The cited chunk's content directly supports the adjacent claim.
44
+
45
+ ### Example C — single accurate citation
46
+
47
+ Claim: "The default port is 8080."
48
+ Cited chunk content: "The dashboard listens on port 8080 by default."
49
+
50
+ Score=1.
51
+
52
+ ### Example D — paraphrase-supported citation
53
+
54
+ Claim: "Each pod has a stable hostname."
55
+ Cited chunk content: "StatefulSet pods receive hostnames derived from the StatefulSet name plus their ordinal, and these hostnames persist across reschedules."
56
+
57
+ Score=1 — the chunk supports the claim via paraphrase.
agent_bench/evaluation/rubrics/completeness.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ dimension: completeness
3
+ scale: three_point
4
+ reference_based: true
5
+ abstain_allowed: true
6
+ ---
7
+
8
+ # Completeness (three-point)
9
+
10
+ Score how much of the gold reference answer is covered by the agent's
11
+ answer. This is reference-based — the judge sees the gold reference
12
+ and the agent's answer; score on **coverage of facts** in the
13
+ reference, not on additional facts the agent may have included.
14
+
15
+ The judge does not penalize the agent for adding correct extra detail
16
+ (that's a separate concern). Score only on what fraction of the
17
+ reference's points are present.
18
+
19
+ ## Score 0
20
+
21
+ None of the reference's key points are present in the answer.
22
+
23
+ ### Example A — answer addresses different facts
24
+
25
+ Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
26
+ Answer: "Kubernetes uses YAML manifests to declare resources."
27
+
28
+ Score=0 — none of the three reference points (ordinal, hostname, storage) appear.
29
+
30
+ ### Example B — refusal that covers nothing
31
+
32
+ Reference: "The default port is 8080."
33
+ Answer: "I cannot find that information."
34
+
35
+ Score=0 — the reference's single point (port=8080) is not in the answer.
36
+
37
+ ## Score 1
38
+
39
+ Some but not all of the reference's points are present.
40
+
41
+ ### Example C — partial coverage
42
+
43
+ Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
44
+ Answer: "StatefulSet pods get ordinal indices."
45
+
46
+ Score=1 — one of three points covered.
47
+
48
+ ### Example D — half a comparison
49
+
50
+ Reference: "Deployments manage stateless replicas; StatefulSets manage stateful pods with stable identities."
51
+ Answer: "Deployments manage stateless replicas with rolling updates."
52
+
53
+ Score=1 — Deployment side covered, StatefulSet side missing.
54
+
55
+ ## Score 2
56
+
57
+ All of the reference's key points are present (paraphrase allowed).
58
+
59
+ ### Example E — full coverage with paraphrase
60
+
61
+ Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage."
62
+ Answer: "Each pod gets an ordinal number, a stable DNS name, and storage that survives restarts."
63
+
64
+ Score=2 — all three points covered with paraphrase.
65
+
66
+ ### Example F — full coverage of single-fact reference
67
+
68
+ Reference: "The default port is 8080."
69
+ Answer: "Port 8080."
70
+
71
+ Score=2 — the only reference point is covered.
agent_bench/evaluation/rubrics/groundedness.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ dimension: groundedness
3
+ scale: binary
4
+ reference_based: true
5
+ abstain_allowed: true
6
+ ---
7
+
8
+ # Groundedness (binary)
9
+
10
+ Score whether **every claim** in the agent's answer is entailed by the gold
11
+ source snippets attached to this item. Paraphrase is allowed; what matters
12
+ is content equivalence, not surface form.
13
+
14
+ ## Reference scope (strict, clarified in v1.1)
15
+
16
+ Reference scope is the **gold snippets only**, not the broader corpus, not
17
+ the retrieved chunks, not the LLM's general domain knowledge. A claim that
18
+ is factually correct in the world but not entailed by the snippets **must
19
+ score 0**. The "must" forecloses the "well, mostly grounded" reading: a
20
+ single ungrounded claim in an otherwise solid answer fails the binary
21
+ rubric.
22
+
23
+ The strict-entailment posture is a methodological choice. A claim that is
24
+ correct because the model happened to know it isn't grounded — it's lucky.
25
+ Strict-snippet groundedness measures *retrieval-grounded behavior*, not
26
+ LLM general knowledge passing through a RAG harness.
27
+
28
+ ## Trivial inference is entailment
29
+
30
+ Some surface-form variations of a snippet's content are entailment, not
31
+ new claims. The test is **content equivalence**, not surface form:
32
+
33
+ - **Paraphrase.** "X causes Y" ↔ "Y is caused by X".
34
+ - **Unit conversion.** "600 seconds" ↔ "10 minutes".
35
+ - **Syntactic variation.** Pluralization, tense, voice, declarative ↔ imperative.
36
+ - **Canonical name of the snippet's concept.** When the snippet describes
37
+ a field, header, or API element by configuration syntax (e.g., a
38
+ `max_age` table row), the canonical name (`Access-Control-Max-Age` HTTP
39
+ header) is the same content in different surface form. This is a
40
+ separate carve-out from pure paraphrase: it admits domain knowledge
41
+ tightly bound to the snippet's referent.
42
+
43
+ > **v1.2 debt.** The trivial-inference clause — especially the
44
+ > canonical-name carve-out — is the strictest-rubric concession most
45
+ > likely to require revision in v1.2. If labelers find themselves
46
+ > applying it broadly to rescue answers from score-0, the clause is
47
+ > too permissive and should be tightened.
48
+
49
+ **When to abstain (`"Unknown"`)**: if the answer is a refusal ("I don't
50
+ know" / "not in the documentation") and there is nothing to ground, score
51
+ abstain rather than 1.
52
+
53
+ ## Score 0
54
+
55
+ At least one claim in the answer is not entailed by any snippet, after
56
+ applying the trivial-inference clause.
57
+
58
+ ### Example A — calibration anchor `k8s_006` (dramatic over-extension)
59
+
60
+ Question: "What is a ConfigMap in Kubernetes and what kind of data should you store in it?"
61
+
62
+ Snippet: "A ConfigMap is an API object used to store non-confidential data in key-value pairs."
63
+
64
+ Answer (excerpted): The agent gives a comprehensive multi-section answer
65
+ covering (i) the definition, (ii) three consumption methods (env vars,
66
+ command-line args, volumes), (iii) a warning not to store
67
+ passwords/tokens/certificates, (iv) a recommendation to use Secrets
68
+ instead, and (v) details about `data` and `binaryData` fields.
69
+
70
+ Thinking trace: Score = 0. Only the definition (i) is entailed by the
71
+ snippet. Claims (ii)–(v) are factually correct against the underlying
72
+ `k8s_configmap.md` doc, but **none are entailed by the one-sentence
73
+ snippet**. The snippet does not describe consumption methods, security
74
+ guidance, or schema fields. The strict-conjunction rule applies: even
75
+ though most of the answer is well-supported by the broader corpus, the
76
+ gold-snippet scope is what the rubric measures, and the answer goes
77
+ dramatically beyond it.
78
+
79
+ ### Example B — calibration anchor `q006` (subtle embellishment)
80
+
81
+ Question: "How does dependency caching work in FastAPI, and how can you disable it?"
82
+
83
+ Snippet: "By default, if the same dependency is used multiple times within
84
+ a single request (e.g., both a route and a sub-dependency use
85
+ `Depends(get_db)`), FastAPI caches the result and calls the dependency
86
+ only once. To disable caching and force a fresh call each time, use
87
+ `Depends(get_db, use_cache=False)`."
88
+
89
+ Answer (excerpted): Most of the answer is entailed by the snippet — the
90
+ default-caching behavior, the once-per-request semantics, the
91
+ `use_cache=False` disable mechanism, and an example. But the answer also
92
+ states that caching is "particularly useful for expensive operations like
93
+ database connections."
94
+
95
+ Thinking trace: Score = 0. The "particularly useful for expensive
96
+ operations like database connections" claim is **not entailed** by the
97
+ snippet. The snippet describes *what* caching does (a mechanism); the
98
+ claim adds *why* it is useful (a use-case argument the snippet does not
99
+ make). The example uses `Depends(get_db)` as an identifier, not as a
100
+ normative argument that caching is useful for database connections. This
101
+ is the subtle case: most of the answer is good, but a single embellishment
102
+ that imports general LLM knowledge ("expensive operations like database
103
+ connections") fails strict snippet entailment. This is the boundary case
104
+ the rubric exists to catch — easier to mis-label than `k8s_006` because
105
+ the failure is one claim in the middle of a mostly-grounded answer.
106
+
107
+ ## Score 1
108
+
109
+ Every claim in the answer is entailed by at least one snippet, including
110
+ trivial inferences (paraphrase, unit conversion, syntactic variation,
111
+ canonical-name substitution).
112
+
113
+ ### Example C — calibration anchor `q021` (trivial inference + canonical name)
114
+
115
+ Question: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?"
116
+
117
+ Snippet (table row): `| max_age | 600 | Seconds the browser caches preflight results |`
118
+
119
+ Answer: "If the CORS `max_age` is 600 seconds, the browser caches
120
+ preflight results for 10 minutes. The `max_age` directive in the
121
+ `Access-Control-Max-Age` header specifies the maximum time (in seconds)
122
+ that the results of a preflight request can be cached. So 600 seconds
123
+ ÷ 60 seconds/minute = 10 minutes."
124
+
125
+ Thinking trace: Score = 1. Every claim is entailed under the
126
+ trivial-inference clause:
127
+
128
+ - "max_age=600 → 10 minutes" — unit conversion.
129
+ - "Access-Control-Max-Age header" — canonical-name carve-out: the snippet
130
+ describes `max_age` by its configuration syntax; the answer uses the
131
+ canonical HTTP header name for the same field.
132
+ - "specifies the maximum time (in seconds)" — paraphrase of "Seconds the
133
+ browser caches preflight results".
134
+ - "600 ÷ 60 = 10 minutes" — arithmetic, the same trivial-inference class
135
+ as unit conversion.
136
+
137
+ The canonical-name carve-out is doing the heaviest lifting in this
138
+ example. Without it, "Access-Control-Max-Age" would be ungrounded
139
+ (domain knowledge not in the snippet text). With it, the answer is a
140
+ clean strict-snippet pass. This is exactly the v1.2-debt sentence above
141
+ — if many future labels rescue score-1 via canonical-name appeals, the
142
+ clause is over-rescuing and should be tightened.
agent_bench/evaluation/rubrics/relevance.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ dimension: relevance
3
+ scale: three_point
4
+ reference_based: false
5
+ abstain_allowed: true
6
+ ---
7
+
8
+ # Relevance (three-point)
9
+
10
+ Does the agent's answer address the user's question? This is reference-free
11
+ — the judge sees only the question and the answer, not gold snippets or a
12
+ reference answer. Score the topic-match, not the truth-value.
13
+
14
+ ## Score 0
15
+
16
+ Off-topic. The answer addresses a different question, is unintelligible,
17
+ or is a refusal that does not engage with the question's premise.
18
+
19
+ ### Example A — wrong topic
20
+
21
+ Question: "How do I deploy to Kubernetes?"
22
+ Answer: "Python virtual environments isolate dependencies between projects."
23
+
24
+ Score=0 — the answer is about Python venvs, not Kubernetes deployment.
25
+
26
+ ### Example B — refusal that ignores the question
27
+
28
+ Question: "What's the default replica count for a StatefulSet?"
29
+ Answer: "I cannot help with that request."
30
+
31
+ Score=0 — the refusal does not engage with the StatefulSet topic. A
32
+ proper grounded refusal ("the documentation does not specify a default
33
+ replica count for StatefulSets") would score higher.
34
+
35
+ ## Score 1
36
+
37
+ Partially relevant. The answer touches the question's topic but misses
38
+ the core ask, or addresses a related-but-different question.
39
+
40
+ ### Example C — adjacent but off-target
41
+
42
+ Question: "How do I deploy a StatefulSet?"
43
+ Answer: "Kubernetes runs containerized workloads on a cluster of nodes."
44
+
45
+ Score=1 because it's about Kubernetes but doesn't address StatefulSet
46
+ deployment specifically.
47
+
48
+ ### Example D — answers a sibling question
49
+
50
+ Question: "What's the difference between Deployment and StatefulSet?"
51
+ Answer: "A Deployment manages stateless replicas with rolling updates."
52
+
53
+ Score=1 because it describes Deployment but doesn't compare it to
54
+ StatefulSet — only half the question is addressed.
55
+
56
+ ## Score 2
57
+
58
+ Directly addresses the question's core ask.
59
+
60
+ ### Example E — on-target single-fact answer
61
+
62
+ Question: "What's the default port for kubelet?"
63
+ Answer: "Port 10250."
64
+
65
+ Score=2 because it directly answers the question.
66
+
67
+ ### Example F — on-target comparison
68
+
69
+ Question: "What's the difference between Deployment and StatefulSet?"
70
+ Answer: "Deployments manage stateless, interchangeable pods with rolling
71
+ updates; StatefulSets manage stateful pods with stable identities,
72
+ ordered rollouts, and persistent per-pod storage."
73
+
74
+ Score=2 — both sides of the comparison are addressed.
agent_bench/evaluation/variance/__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ """Variance-control wrappers around Judge instances."""
2
+
3
+ from agent_bench.evaluation.variance.jury import Jury, jury
4
+ from agent_bench.evaluation.variance.rubric_permute import (
5
+ PermutedJudge,
6
+ rubric_permute,
7
+ )
8
+
9
+ __all__ = ["Jury", "PermutedJudge", "jury", "rubric_permute"]
agent_bench/evaluation/variance/jury.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Jury — multi-judge aggregator with strict-quorum default and sidecar."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import asyncio
6
+ from pathlib import Path
7
+ from typing import TYPE_CHECKING, Literal
8
+
9
+ import structlog
10
+
11
+ from agent_bench.evaluation.judges.base import Judge, ScoreResult
12
+ from agent_bench.evaluation.variance.rubric_permute import _aggregate_scores
13
+
14
+ if TYPE_CHECKING:
15
+ from agent_bench.agents.orchestrator import AgentResponse
16
+ from agent_bench.evaluation.harness import GoldenQuestion
17
+
18
+ _DEFAULT_SIDECAR_TEMPLATE = "results/calibration_v1_judge_{aggregation}_members.jsonl"
19
+
20
+ logger = structlog.get_logger()
21
+
22
+
23
+ def _discretize_mean(mean: float, scale: str) -> int:
24
+ """Discretize a float mean to a discrete level per scale, ties → lower
25
+ (mirrors `_aggregate_scores`'s policy without going through int(round())
26
+ which would invoke Python's banker's rounding and silently violate the
27
+ tie-breaking contract).
28
+ """
29
+ if scale == "binary":
30
+ return 1 if mean > 0.5 else 0
31
+ floor = int(mean)
32
+ frac = mean - floor
33
+ return floor + 1 if frac > 0.5 else floor
34
+
35
+
36
+ class Jury:
37
+ """Aggregates a list of Judge instances into one ScoreResult per item.
38
+
39
+ Strict quorum default (quorum = len(judges)): any member abstain →
40
+ aggregate abstain. The parameter exists in v1 so v1.1's 3-judge jury
41
+ can shift to quorum=2 (majority) without rearchitecting failure
42
+ semantics.
43
+
44
+ Per-member ScoreResults always written to sidecar (successes and
45
+ failure-as-abstains alike). Provider non-retryable exceptions in
46
+ any member raise immediately, cancelling sibling gather tasks.
47
+ """
48
+
49
+ def __init__(
50
+ self,
51
+ judges: list[Judge],
52
+ aggregation: Literal["mean", "kappa_weighted"],
53
+ weights: dict[str, float] | None = None,
54
+ quorum: int | None = None,
55
+ sidecar_path: Path | str | None = None,
56
+ ) -> None:
57
+ if not judges:
58
+ raise ValueError("jury requires at least one judge")
59
+ if aggregation == "kappa_weighted" and not weights:
60
+ raise ValueError(
61
+ "kappa_weighted aggregation requires explicit weights "
62
+ "(computed offline on calibration set; not at jury construction)"
63
+ )
64
+ self.judges = judges
65
+ self.aggregation = aggregation
66
+ self.weights = weights or {}
67
+ self.quorum = quorum if quorum is not None else len(judges)
68
+ self.sidecar_path = (
69
+ Path(sidecar_path)
70
+ if sidecar_path is not None
71
+ else Path(_DEFAULT_SIDECAR_TEMPLATE.format(aggregation=aggregation))
72
+ )
73
+ self.judge_id = f"jury_v1_{aggregation}"
74
+
75
+ async def score(
76
+ self,
77
+ item: "GoldenQuestion",
78
+ output: "AgentResponse",
79
+ ) -> ScoreResult:
80
+ # return_exceptions=False → first exception cancels siblings
81
+ member_results: list[ScoreResult] = await asyncio.gather(
82
+ *[j.score(item, output) for j in self.judges],
83
+ return_exceptions=False,
84
+ )
85
+
86
+ # Sidecar (append; one line per member per call)
87
+ self.sidecar_path.parent.mkdir(parents=True, exist_ok=True)
88
+ with self.sidecar_path.open("a", encoding="utf-8") as f:
89
+ for r in member_results:
90
+ f.write(r.model_dump_json() + "\n")
91
+
92
+ successful = [r for r in member_results if not r.abstained]
93
+ sys_hash = member_results[0].system_output_hash
94
+
95
+ if len(successful) < self.quorum:
96
+ return ScoreResult(
97
+ reasoning=(
98
+ f"jury_below_quorum: {len(successful)}/{len(self.judges)} "
99
+ f"members succeeded; required {self.quorum}"
100
+ ),
101
+ evidence_quotes=[],
102
+ score="Unknown",
103
+ judge_id=self.judge_id,
104
+ rubric_version=member_results[0].rubric_version,
105
+ prompt_seed=0,
106
+ system_output_hash=sys_hash,
107
+ cost_usd=sum(r.cost_usd for r in member_results),
108
+ latency_ms=max(r.latency_ms for r in member_results),
109
+ )
110
+
111
+ # Aggregate over successful members
112
+ scores = [int(r.score) for r in successful]
113
+ scale = self.judges[0].rubric.scale
114
+ applied_weights: list[float] = []
115
+ if self.aggregation == "mean":
116
+ agg = _aggregate_scores(scores, scale)
117
+ else: # kappa_weighted
118
+ # Weight successful members by judge_id. v1.1: missing weight is
119
+ # a hard error (was a silent fallback to 1.0 in v1, which let an
120
+ # asymmetric weights source amplify the unweighted member rather
121
+ # than suppressing it — see the v1.1 jury-rescue entry in
122
+ # DECISIONS.md for the calibration evidence).
123
+ missing = [r.judge_id for r in successful if r.judge_id not in self.weights]
124
+ if missing:
125
+ raise ValueError(
126
+ f"jury kappa_weighted: weights dict missing entries for "
127
+ f"member judge_ids {sorted(set(missing))}. Configured "
128
+ f"weights cover {sorted(self.weights.keys())}. "
129
+ f"v1.1 requires symmetric coverage — every jury member "
130
+ f"must have an explicit weight in the source. The v1 "
131
+ f"silent fallback to 1.0 was a documented contract "
132
+ f"violation that masked the source's asymmetric coverage."
133
+ )
134
+ for r in successful:
135
+ applied_weights.append(self.weights[r.judge_id])
136
+ weighted_sum = sum(s * w for s, w in zip(scores, applied_weights))
137
+ weight_total = sum(applied_weights)
138
+ weighted_mean = (
139
+ weighted_sum / weight_total if weight_total > 0 else 0.0
140
+ )
141
+ # Discretize via the shared ties-to-lower policy (NOT int(round())
142
+ # which uses banker's rounding and would diverge from the `mean`
143
+ # path on half-integer aggregates).
144
+ agg = _discretize_mean(weighted_mean, scale)
145
+
146
+ # Reasoning string reports the per-member weights actually applied
147
+ # (not the constructor's dict — the dict may be missing entries that
148
+ # silently fell back to 1.0; printing the constructor's dict would
149
+ # conceal that fallback from anyone debugging a calibration row).
150
+ weights_str = applied_weights if self.aggregation == "kappa_weighted" else "n/a"
151
+ return ScoreResult(
152
+ reasoning=(
153
+ f"jury_{self.aggregation}: "
154
+ f"members={[r.score for r in successful]}, "
155
+ f"weights={weights_str}"
156
+ ),
157
+ evidence_quotes=[],
158
+ score=agg,
159
+ judge_id=self.judge_id,
160
+ rubric_version=member_results[0].rubric_version,
161
+ prompt_seed=0,
162
+ system_output_hash=sys_hash,
163
+ cost_usd=sum(r.cost_usd for r in member_results),
164
+ latency_ms=max(r.latency_ms for r in member_results),
165
+ )
166
+
167
+
168
+ def jury(
169
+ judges: list[Judge],
170
+ aggregation: Literal["mean", "kappa_weighted"],
171
+ weights: dict[str, float] | None = None,
172
+ quorum: int | None = None,
173
+ sidecar_path: Path | str | None = None,
174
+ ) -> Jury:
175
+ return Jury(
176
+ judges=judges,
177
+ aggregation=aggregation,
178
+ weights=weights,
179
+ quorum=quorum,
180
+ sidecar_path=sidecar_path,
181
+ )
agent_bench/evaluation/variance/rubric_permute.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """rubric_permute — runs the same judge with permuted rubric levels and aggregates."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from pathlib import Path
6
+ from typing import TYPE_CHECKING, Literal
7
+
8
+ from agent_bench.evaluation.judges.base import Judge, ScoreResult
9
+
10
+ if TYPE_CHECKING:
11
+ from agent_bench.agents.orchestrator import AgentResponse
12
+ from agent_bench.evaluation.harness import GoldenQuestion
13
+
14
+
15
+ def _aggregate_scores(scores: list[int], scale: str) -> int:
16
+ """Discretize aggregated score per scale.
17
+
18
+ Binary: threshold 0.5 with ties → 0 (conservative).
19
+ Three-point: round to nearest with ties → lower level (conservative).
20
+ """
21
+ mean = sum(scores) / len(scores)
22
+ if scale == "binary":
23
+ return 1 if mean > 0.5 else 0
24
+ floor = int(mean)
25
+ frac = mean - floor
26
+ if frac > 0.5:
27
+ return floor + 1
28
+ return floor
29
+
30
+
31
+ class PermutedJudge:
32
+ """Wraps a Judge; runs N permutations with different prompt_seeds.
33
+
34
+ Aggregation:
35
+ - Any abstain in any permutation → aggregate score = "Unknown".
36
+ - Otherwise, discretize the per-permutation scores per scale.
37
+
38
+ Per-permutation ScoreResults are written to the sidecar JSONL on
39
+ every score() call (one batch per call, append-mode JSONL across calls).
40
+ """
41
+
42
+ def __init__(
43
+ self,
44
+ judge: Judge,
45
+ n: int = 2,
46
+ seeds: list[int] | None = None,
47
+ sidecar_path: Path | str | None = None,
48
+ ) -> None:
49
+ self.judge = judge
50
+ self.n = n
51
+ self.seeds = seeds if seeds is not None else list(range(1, n + 1))
52
+ if len(self.seeds) != n:
53
+ raise ValueError(f"seeds length {len(self.seeds)} != n {n}")
54
+ self.sidecar_path = Path(sidecar_path) if sidecar_path else None
55
+ self.judge_id = f"{judge.judge_id}_perm{n}"
56
+
57
+ async def score(
58
+ self,
59
+ item: "GoldenQuestion",
60
+ output: "AgentResponse",
61
+ ) -> ScoreResult:
62
+ per_perm_results: list[ScoreResult] = []
63
+ for seed in self.seeds:
64
+ r = await self.judge.score(item, output, prompt_seed=seed)
65
+ per_perm_results.append(r)
66
+
67
+ if self.sidecar_path is not None:
68
+ self.sidecar_path.parent.mkdir(parents=True, exist_ok=True)
69
+ with self.sidecar_path.open("a", encoding="utf-8") as f:
70
+ for r in per_perm_results:
71
+ f.write(r.model_dump_json() + "\n")
72
+
73
+ any_abstain = any(r.abstained for r in per_perm_results)
74
+ if any_abstain:
75
+ score: int | Literal["Unknown"] = "Unknown"
76
+ reasoning = (
77
+ f"any_abstain_propagated: "
78
+ f"{sum(1 for r in per_perm_results if r.abstained)}/{self.n} "
79
+ f"permutations abstained"
80
+ )
81
+ else:
82
+ score = _aggregate_scores(
83
+ [int(r.score) for r in per_perm_results],
84
+ self.judge.rubric.scale,
85
+ )
86
+ reasoning = (
87
+ f"perm_mean over {self.n} seeds: {[r.score for r in per_perm_results]}"
88
+ )
89
+
90
+ return ScoreResult(
91
+ reasoning=reasoning,
92
+ evidence_quotes=[],
93
+ score=score,
94
+ judge_id=self.judge_id,
95
+ rubric_version=self.judge.rubric.source_hash,
96
+ prompt_seed=0,
97
+ system_output_hash=per_perm_results[0].system_output_hash,
98
+ cost_usd=sum(r.cost_usd for r in per_perm_results),
99
+ latency_ms=sum(r.latency_ms for r in per_perm_results),
100
+ )
101
+
102
+
103
+ def rubric_permute(
104
+ judge: Judge,
105
+ n: int = 2,
106
+ seeds: list[int] | None = None,
107
+ sidecar_path: Path | str | None = None,
108
+ ) -> PermutedJudge:
109
+ return PermutedJudge(judge=judge, n=n, seeds=seeds, sidecar_path=sidecar_path)
agent_bench/serving/static/index.html CHANGED
@@ -721,6 +721,141 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
721
  border: 1px solid var(--rule-2); background: var(--paper); color: var(--ink);
722
  }
723
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724
  /* ── Responsive ────────────────────────────── */
725
  @media (max-width: 880px){
726
  .grid{grid-template-columns: 1fr}
@@ -742,6 +877,7 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
742
  <div class="wordmark">agent-bench</div>
743
  <nav>
744
  <a href="#demo">Demo</a>
 
745
  <a href="#findings">Findings</a>
746
  <a href="#log">Log</a>
747
  <a href="https://github.com/tyy0811/agent-bench" target="_blank" rel="noopener">GitHub ↗</a>
@@ -933,6 +1069,62 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
933
  </div>
934
  </section>
935
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
936
  <!-- Findings -->
937
  <section class="section" id="findings">
938
  <div class="section-head">
@@ -1028,6 +1220,49 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
1028
  </div>
1029
  </section>
1030
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1031
  <!-- Footer -->
1032
  <footer class="footer">
1033
  <div class="who">agent-bench · MIT · built by Jane Yeung · Munich</div>
 
721
  border: 1px solid var(--rule-2); background: var(--paper); color: var(--ink);
722
  }
723
 
724
+ /* ── Evaluation harness section ───────────── */
725
+ .harness-intro{
726
+ display: grid; grid-template-columns: 1.2fr 1fr; gap: 32px;
727
+ align-items: end; margin-bottom: 28px;
728
+ }
729
+ .harness-intro p{color: var(--ink-2); max-width: 56ch}
730
+ .harness-intro .sig{
731
+ font-family: var(--font-mono); font-size: 0.74rem; color: var(--ink-3);
732
+ display: flex; flex-direction: column; gap: 4px;
733
+ }
734
+ .harness-intro .sig b{color: var(--ink); font-weight: 600}
735
+
736
+ .rubric-grid{
737
+ display: grid; grid-template-columns: repeat(4, 1fr);
738
+ gap: 1px; background: var(--rule);
739
+ border: 1px solid var(--rule);
740
+ }
741
+ .rubric{
742
+ background: var(--paper); padding: 22px 20px;
743
+ display: flex; flex-direction: column; gap: 10px;
744
+ }
745
+ .rubric .dim{
746
+ font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600;
747
+ letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink);
748
+ }
749
+ .rubric .scale{
750
+ font-family: var(--font-mono); font-size: 0.7rem; color: var(--ink-3);
751
+ display: flex; gap: 8px; flex-wrap: wrap;
752
+ }
753
+ .rubric .scale span{border: 1px solid var(--rule); padding: 1px 6px}
754
+ .rubric .scale span.on{border-color: var(--ink); color: var(--ink)}
755
+ .rubric .desc{
756
+ font-size: 0.86rem; color: var(--ink-2); line-height: 1.5;
757
+ }
758
+ .rubric .anchor{
759
+ font-family: var(--font-mono); font-size: 0.72rem;
760
+ border-left: 2px solid var(--rule-2); padding: 8px 10px;
761
+ background: var(--paper-2); color: var(--ink-2); line-height: 1.5;
762
+ margin-top: auto;
763
+ }
764
+ .rubric .anchor b{color: var(--ink); font-weight: 600; font-size: 0.7rem; letter-spacing: 0.06em}
765
+
766
+ /* Compact one-row κ summary that lives above Findings (deep table is in appendix) */
767
+ .kappa-summary{
768
+ margin-top: 22px; border: 1px solid var(--rule);
769
+ padding: 14px 18px;
770
+ display: flex; flex-direction: column; gap: 10px;
771
+ background: var(--paper-2);
772
+ }
773
+ .kappa-summary .ks-head{
774
+ font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600;
775
+ letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink-3);
776
+ }
777
+ .kappa-summary .ks-head .ks-sub{
778
+ letter-spacing: 0.04em; text-transform: none; color: var(--ink-3);
779
+ font-weight: 400; margin-left: 4px;
780
+ }
781
+ .kappa-summary .ks-row{
782
+ display: flex; flex-wrap: wrap; align-items: baseline; gap: 22px;
783
+ font-family: var(--font-mono); font-size: 0.85rem;
784
+ font-feature-settings: "tnum","zero";
785
+ }
786
+ .kappa-summary .ks-stat{display: flex; align-items: baseline; gap: 8px}
787
+ .kappa-summary .ks-stat .k{color: var(--ink-3); font-size: 0.78rem}
788
+ .kappa-summary .ks-stat .v{color: var(--ink); font-weight: 600}
789
+ .kappa-summary .ks-stat .v.win{color: var(--ok)}
790
+ .kappa-summary .ks-link{
791
+ margin-left: auto; font-size: 0.78rem; color: var(--ink-2);
792
+ border-bottom: 1px solid var(--rule-2);
793
+ }
794
+ .kappa-summary .ks-link:hover{color: var(--ink); border-color: var(--ink)}
795
+
796
+ .kappa-wrap{
797
+ margin-top: 28px; border: 1px solid var(--rule);
798
+ display: grid; grid-template-columns: 1.4fr 1fr;
799
+ }
800
+ .kappa-table{
801
+ border-right: 1px solid var(--rule);
802
+ padding: 22px 24px;
803
+ }
804
+ .kappa-table h4{
805
+ font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600;
806
+ letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3);
807
+ margin-bottom: 14px;
808
+ }
809
+ .kappa-table table{width: 100%; border-collapse: collapse; font-family: var(--font-mono); font-size: 0.78rem}
810
+ .kappa-table th, .kappa-table td{
811
+ text-align: left; padding: 7px 10px; border-bottom: 1px solid var(--rule);
812
+ font-feature-settings: "tnum","zero";
813
+ }
814
+ .kappa-table th{
815
+ font-weight: 600; color: var(--ink-3); font-size: 0.68rem;
816
+ letter-spacing: 0.08em; text-transform: uppercase;
817
+ }
818
+ .kappa-table td.num{text-align: right; color: var(--ink)}
819
+ .kappa-table td.num.win{color: var(--ok); font-weight: 600}
820
+ .kappa-table tr.config-row td{background: var(--paper)}
821
+ .kappa-table tr:last-child td{border-bottom: none}
822
+ .kappa-note{
823
+ font-family: var(--font-ui); font-size: 0.78rem; color: var(--ink-3);
824
+ margin-top: 10px; line-height: 1.5; max-width: 60ch;
825
+ }
826
+
827
+ .variance{
828
+ padding: 22px 24px;
829
+ display: flex; flex-direction: column; gap: 14px;
830
+ background: var(--paper-2);
831
+ }
832
+ .variance h4{
833
+ font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600;
834
+ letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3);
835
+ }
836
+ .variance .v-row{
837
+ display: flex; flex-direction: column; gap: 4px;
838
+ padding: 12px 14px; background: var(--paper); border: 1px solid var(--rule);
839
+ }
840
+ .variance .v-row .name{
841
+ font-family: var(--font-mono); font-size: 0.82rem; font-weight: 600; color: var(--ink);
842
+ }
843
+ .variance .v-row .name code{
844
+ font-family: var(--font-mono); font-size: 0.78rem; color: var(--accent-ink);
845
+ background: var(--accent-soft); padding: 1px 5px;
846
+ }
847
+ .variance .v-row .why{
848
+ font-size: 0.82rem; color: var(--ink-2); line-height: 1.5;
849
+ }
850
+
851
+ /* Harness responsive overrides — collapse rubric grid + κ split at narrower viewport */
852
+ @media (max-width: 1000px){
853
+ .rubric-grid{grid-template-columns: repeat(2, 1fr)}
854
+ .kappa-wrap{grid-template-columns: 1fr}
855
+ .kappa-table{border-right: none; border-bottom: 1px solid var(--rule)}
856
+ .harness-intro{grid-template-columns: 1fr; gap: 16px}
857
+ }
858
+
859
  /* ── Responsive ────────────────────────────── */
860
  @media (max-width: 880px){
861
  .grid{grid-template-columns: 1fr}
 
877
  <div class="wordmark">agent-bench</div>
878
  <nav>
879
  <a href="#demo">Demo</a>
880
+ <a href="#harness">Harness</a>
881
  <a href="#findings">Findings</a>
882
  <a href="#log">Log</a>
883
  <a href="https://github.com/tyy0811/agent-bench" target="_blank" rel="noopener">GitHub ↗</a>
 
1069
  </div>
1070
  </section>
1071
 
1072
+ <!-- Evaluation harness (LLM-as-judge methodology) -->
1073
+ <section class="section" id="harness">
1074
+ <div class="section-head">
1075
+ <h2>How we grade it</h2>
1076
+ <span class="sub">4 anchored rubrics · LLM-as-judge · κ-calibrated against human labels</span>
1077
+ </div>
1078
+
1079
+ <div class="harness-intro">
1080
+ <p class="deck">Benchmark numbers are only as good as the grader. Each answer is scored by an LLM judge against an anchored markdown rubric — strict scope, fixed scale, abstain-allowed — and the judges themselves are calibrated against human labels on a held-out set before they're trusted on the main run.</p>
1081
+ <div class="sig">
1082
+ <span><b>30</b> calibration items · human-labeled</span>
1083
+ <span><b>v1.1</b> rubric · sha-pinned per result</span>
1084
+ <span>headline metric: <b>Cohen's κ</b> · <b>Gwet's AC1</b> on prevalence-skewed dims</span>
1085
+ </div>
1086
+ </div>
1087
+
1088
+ <!-- Rubric cards -->
1089
+ <div class="rubric-grid">
1090
+ <div class="rubric">
1091
+ <div class="dim">Groundedness</div>
1092
+ <div class="scale"><span class="on">0</span><span class="on">1</span><span>abstain</span></div>
1093
+ <div class="desc">Every claim must be entailed by gold snippets. A claim that's correct in the world but not in the snippets scores 0 — strict-snippet measures retrieval-grounded behavior, not LLM general knowledge passing through.</div>
1094
+ <div class="anchor"><b>ANCHOR · q006</b><br>Answer adds "particularly useful for expensive operations like database connections" — not in snippet → 0.</div>
1095
+ </div>
1096
+ <div class="rubric">
1097
+ <div class="dim">Relevance</div>
1098
+ <div class="scale"><span class="on">0</span><span class="on">1</span><span class="on">2</span><span>abstain</span></div>
1099
+ <div class="desc">Reference-free. Does the answer address the user's question? Score the topic-match, not the truth-value. A refusal that doesn't engage with the premise scores 0.</div>
1100
+ <div class="anchor"><b>ANCHOR</b><br>Q: "How do I deploy to Kubernetes?"<br>A: "Python virtual environments isolate dependencies." → 0.</div>
1101
+ </div>
1102
+ <div class="rubric">
1103
+ <div class="dim">Completeness</div>
1104
+ <div class="scale"><span class="on">0</span><span class="on">1</span><span class="on">2</span><span>abstain</span></div>
1105
+ <div class="desc">Reference-based against gold answer. Score coverage of the reference's key points only — extra correct detail isn't penalized here.</div>
1106
+ <div class="anchor"><b>ANCHOR</b><br>Reference covers ordinal, hostname, storage. Answer covers ordinal, hostname only → 1.</div>
1107
+ </div>
1108
+ <div class="rubric">
1109
+ <div class="dim">Citation faithfulness</div>
1110
+ <div class="scale"><span class="on">0</span><span class="on">1</span><span>abstain</span></div>
1111
+ <div class="desc">For every <code>[source: X.md]</code> in the answer, does the cited chunk actually support the claim next to it? <b>All-or-nothing</b> per item — one bad citation fails the whole answer.</div>
1112
+ <div class="anchor"><b>ANCHOR</b><br>Claim: "default port is 8080." Cited chunk: about OAuth and SAML auth → 0 (citation drift).</div>
1113
+ </div>
1114
+ </div>
1115
+
1116
+ <!-- Compact κ summary → deep methodology lives in the appendix below the log -->
1117
+ <div class="kappa-summary">
1118
+ <div class="ks-head">Inter-rater agreement vs. human labels <span class="ks-sub">(calibration v1, baseline)</span></div>
1119
+ <div class="ks-row">
1120
+ <div class="ks-stat"><span class="k">groundedness</span><span class="v win">AC1 = 1.000</span></div>
1121
+ <div class="ks-stat"><span class="k">relevance</span><span class="v win">AC1 = 0.964</span></div>
1122
+ <div class="ks-stat"><span class="k">completeness</span><span class="v">κ = 0.416</span></div>
1123
+ <a class="ks-link" href="#harness-appendix">Full table + variance hardening ↓</a>
1124
+ </div>
1125
+ </div>
1126
+ </section>
1127
+
1128
  <!-- Findings -->
1129
  <section class="section" id="findings">
1130
  <div class="section-head">
 
1220
  </div>
1221
  </section>
1222
 
1223
+ <!-- Methodology appendix — deep dive that was demoted from the main flow -->
1224
+ <section class="section" id="harness-appendix">
1225
+ <div class="section-head">
1226
+ <h2>Methodology appendix</h2>
1227
+ <span class="sub">κ ablations · variance hardening · abstain semantics</span>
1228
+ </div>
1229
+
1230
+ <div class="kappa-wrap">
1231
+ <div class="kappa-table">
1232
+ <h4>κ ablation table · calibration v1</h4>
1233
+ <table>
1234
+ <thead>
1235
+ <tr><th>Configuration</th><th>Groundedness<br><span style="font-weight:400">AC1</span></th><th>Relevance<br><span style="font-weight:400">AC1</span></th><th>Completeness<br><span style="font-weight:400">κ</span></th></tr>
1236
+ </thead>
1237
+ <tbody>
1238
+ <tr><td>baseline (v1.1, anchors, CoT)</td><td class="num win">1.000</td><td class="num win">0.964</td><td class="num">0.416</td></tr>
1239
+ <tr><td>baseline · no anchors</td><td class="num">0.953</td><td class="num">0.964</td><td class="num">0.623</td></tr>
1240
+ <tr><td>baseline · no CoT</td><td class="num">0.897</td><td class="num">0.963</td><td class="num win">1.000</td></tr>
1241
+ <tr><td>permute (n=2 seeds)</td><td class="num win">1.000</td><td class="num">0.966</td><td class="num">0.506</td></tr>
1242
+ <tr><td>jury · κ-weighted (haiku + gpt-4o-mini)</td><td class="num win">1.000</td><td class="num win">1.000</td><td class="num">0.416</td></tr>
1243
+ </tbody>
1244
+ </table>
1245
+ <p class="kappa-note"><b>Reading this:</b> groundedness and relevance gold are prevalence-skewed (29×<code>0</code> / 1×<code>1</code> and 29×<code>2</code> / 1×<code>1</code> respectively), which makes Cohen's κ degenerate to ≈0 even at 95%+ raw agreement. AC1 is the right metric there. Completeness gold is balanced enough (23×<code>2</code> / 5×<code>1</code>) for κ to behave normally. The <b>no-CoT κ=1.000</b> looks like a win but comes with an 11.5% abstain rate — the headline is the baseline row.</p>
1246
+ </div>
1247
+
1248
+ <div class="variance">
1249
+ <h4>Variance hardening</h4>
1250
+ <div class="v-row">
1251
+ <div class="name"><code>PermutedJudge</code> · level-order permutation</div>
1252
+ <div class="why">Wrap a judge with n=2 prompt-seed permutations of the rubric's level order; aggregate by mean. Catches judges whose verdict flips when "Score 0" anchor moves above "Score 2" — a presentation-order artifact, not a content disagreement.</div>
1253
+ </div>
1254
+ <div class="v-row">
1255
+ <div class="name"><code>Jury</code> · κ-weighted multi-judge aggregation</div>
1256
+ <div class="why">Run the same item through claude-haiku-4-5 and gpt-4o-mini, weight each judge's vote by its calibration κ, abstain if any member abstains. Surfaces single-model bias without flattening to majority-rule, and keeps abstain as a first-class outcome.</div>
1257
+ </div>
1258
+ <div class="v-row">
1259
+ <div class="name">Abstain semantics · <code>"Unknown"</code> sentinel</div>
1260
+ <div class="why">Schema-parse failures retry once, then abstain with a typed prefix; rubric-allowed model abstains use the empty-string sentinel. The metric drops the item, doesn't pretend it scored 0 — visible in the abstain rate column above.</div>
1261
+ </div>
1262
+ </div>
1263
+ </div>
1264
+ </section>
1265
+
1266
  <!-- Footer -->
1267
  <footer class="footer">
1268
  <div class="who">agent-bench · MIT · built by Jane Yeung · Munich</div>
configs/calibration/rows/baseline.yaml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Baseline: single Claude-Haiku judge per dimension, all variance controls on.
2
+ # CoT is implicit (the rubric prompts ask for reasoning before score).
3
+ # Anchors come from the rubric files. Abstain comes from rubric.abstain_allowed=true.
4
+
5
+ label: baseline
6
+ provider: anthropic
7
+ model_id: claude-haiku-4-5-20251001
8
+ dimensions: [groundedness, relevance, completeness]
9
+ strategy: single
10
+ options:
11
+ use_cot: true
12
+ use_anchors: true
13
+ abstain_allowed: true
14
+ output_path: results/calibration_v1_judge_baseline.json
configs/calibration/rows/baseline_no_abstain.yaml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ablation: rubric.abstain_allowed forced false at scoring time. Measures
2
+ # the contribution of the abstain option. Out-of-range schema violations
3
+ # (model returns "Unknown" anyway) abstain via ABSTAIN_REASON_OUT_OF_RANGE.
4
+
5
+ label: baseline_no_abstain
6
+ provider: anthropic
7
+ model_id: claude-haiku-4-5-20251001
8
+ dimensions: [groundedness, relevance, completeness]
9
+ strategy: single
10
+ options:
11
+ use_cot: true
12
+ use_anchors: true
13
+ abstain_allowed: false
14
+ output_path: results/calibration_v1_judge_baseline_no_abstain.json
configs/calibration/rows/baseline_no_anchors.yaml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ablation: rubric anchored examples stripped from the prompt; only the
2
+ # level descriptions are sent. Measures the contribution of anchored examples.
3
+
4
+ label: baseline_no_anchors
5
+ provider: anthropic
6
+ model_id: claude-haiku-4-5-20251001
7
+ dimensions: [groundedness, relevance, completeness]
8
+ strategy: single
9
+ options:
10
+ use_cot: true
11
+ use_anchors: false
12
+ abstain_allowed: true
13
+ output_path: results/calibration_v1_judge_baseline_no_anchors.json
configs/calibration/rows/baseline_no_cot.yaml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ablation: same as baseline but the judge prompt does NOT request reasoning
2
+ # before the score. Used to measure the contribution of CoT-before-score.
3
+
4
+ label: baseline_no_cot
5
+ provider: anthropic
6
+ model_id: claude-haiku-4-5-20251001
7
+ dimensions: [groundedness, relevance, completeness]
8
+ strategy: single
9
+ options:
10
+ use_cot: false
11
+ use_anchors: true
12
+ abstain_allowed: true
13
+ output_path: results/calibration_v1_judge_baseline_no_cot.json
configs/calibration/rows/jury_kappa_weighted.yaml ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 2-judge jury: Claude-Haiku + gpt-4o-mini, kappa-weighted aggregation.
2
+ # Strict quorum default (any member abstain → jury abstain).
3
+ #
4
+ # v1.1: weights are computed by `_compute_kappa_weights` from the prior
5
+ # jury-run sidecar (which has predictions from BOTH members), not the
6
+ # baseline.json (Haiku-only). v1's pointer at baseline.json was the
7
+ # asymmetric-coverage bug — see DECISIONS "v1.1 jury rescue" entry.
8
+ # This is pragmatic-circular: weights are derived from the same
9
+ # calibration set used for κ reporting; v1.2 will use a held-out set.
10
+
11
+ label: jury_kappa_weighted
12
+ strategy: jury
13
+ aggregation: kappa_weighted
14
+ quorum: null # null = strict default (= len(judges) = 2)
15
+ members:
16
+ - provider: anthropic
17
+ model_id: claude-haiku-4-5-20251001
18
+ - provider: openai
19
+ model_id: gpt-4o-mini-2024-07-18
20
+ dimensions: [groundedness, relevance, completeness]
21
+ weights_source: results/calibration_v1_judge_jury_kappa_weighted_members.jsonl
22
+ output_path: results/calibration_v1_judge_jury_kappa_weighted.json
23
+ sidecar_path: results/calibration_v1_judge_jury_kappa_weighted_members.jsonl
configs/calibration/rows/permute.yaml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Rubric permutation: N=2 seeded prompt-level permutations per item, mean-
2
+ # aggregated. Per-permutation results land in the sidecar JSONL.
3
+
4
+ label: permute
5
+ provider: anthropic
6
+ model_id: claude-haiku-4-5-20251001
7
+ dimensions: [groundedness, relevance, completeness]
8
+ strategy: rubric_permute
9
+ options:
10
+ n_permutations: 2
11
+ seeds: [1, 2]
12
+ abstain_allowed: true
13
+ output_path: results/calibration_v1_judge_permute.json
14
+ sidecar_path: results/calibration_v1_judge_permute_members.jsonl
docs/_generated/kappa_table.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # κ ablation table — calibration v1
2
+
3
+ Headline metric per dimension: **groundedness → AC1**, **relevance → AC1**, **completeness → κ**. AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold is prevalence-skewed enough to make Cohen's κ degenerate (groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics produce ≥0.95 raw agreement on those rows but Cohen's κ collapses to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its gold (23×`2`/5×`1`) is balanced enough for κ to behave normally.
4
+
5
+ | Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |
6
+ |---|---|---|---|---|---|---|
7
+ | baseline | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
8
+ | baseline | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
9
+ | baseline | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% | |
10
+ | baseline_no_abstain | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
11
+ | baseline_no_abstain | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
12
+ | baseline_no_abstain | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% | |
13
+ | baseline_no_anchors | completeness | κ | 0.623 (-0.054, 1.000) | 26 | 0.0% | |
14
+ | baseline_no_anchors | groundedness | AC1 | 0.953 (0.834, 1.000) | 24 | 7.7% | |
15
+ | baseline_no_anchors | relevance | AC1 | 0.964 (0.885, 1.000) | 29 | 3.3% | |
16
+ | baseline_no_cot | completeness | κ | 1.000 (1.000, 1.000) | 24 | 7.7% | |
17
+ | baseline_no_cot | groundedness | AC1 | 0.897 (0.707, 1.000) | 23 | 11.5% | |
18
+ | baseline_no_cot | relevance | AC1 | 0.963 (0.881, 1.000) | 28 | 6.7% | |
19
+ | jury_kappa_weighted | completeness | κ | 0.014 (-0.077, 0.112) | 26 | 0.0% | |
20
+ | jury_kappa_weighted | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
21
+ | jury_kappa_weighted | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% | |
22
+ | jury_kappa_weighted_v1_1 | completeness | κ | 0.416 (-0.068, 0.866) | 26 | 0.0% | |
23
+ | jury_kappa_weighted_v1_1 | groundedness | AC1 | 1.000 (1.000, 1.000) | 26 | 0.0% | |
24
+ | jury_kappa_weighted_v1_1 | relevance | AC1 | 1.000 (1.000, 1.000) | 30 | 0.0% | |
25
+ | permute | completeness | κ | 0.506 (-0.061, 1.000) | 26 | 0.0% | |
26
+ | permute | groundedness | AC1 | 1.000 (1.000, 1.000) | 25 | 3.8% | |
27
+ | permute | relevance | AC1 | 0.966 (0.890, 1.000) | 30 | 0.0% | |
docs/judge-design.md ADDED
@@ -0,0 +1,687 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Judge Layer — calibration writeup (v1.1.1)
2
+
3
+ ## TL;DR
4
+
5
+ The v1 deliverable is a per-dimension LLM-judge layer (groundedness,
6
+ relevance, completeness) with anchored discrete rubrics, abstain
7
+ support, rubric permutation as a variance control, and a 2-judge
8
+ kappa-weighted jury. It supersedes the previous continuous-score
9
+ single-call judges. v1 was validated against a 30-item hand-labeled
10
+ calibration set spanning two corpora (FastAPI + Kubernetes); the
11
+ calibration surfaced six findings organized below as a methodology
12
+ arc rather than a flat ablation table. The interpretive headline:
13
+
14
+ - The shared retrieval stack does the heavy lifting on retrieval
15
+ metrics (P@5, R@5, KHR vary < 0.12 across all four custom/LangChain
16
+ × OpenAI/Anthropic configurations); the judge layer's value is in
17
+ *measuring* the orchestrator's grounded-citation behavior, not in
18
+ driving it.
19
+ - Calibration caught a published-rubric drift between human-grader
20
+ and rubric-as-written (22/30 disagreements at v1.0); rubric
21
+ clarification + re-labeling brought v1.1 inter-rater agreement to
22
+ 29/30 on groundedness.
23
+ - The 2-judge jury under v1's weighting pipeline fired both branches
24
+ of the design doc's tracked risk simultaneously: the weights-source
25
+ was a stub and the missing-weight fallback to 1.0 silently
26
+ amplified an unweighted member. v1.1 fixed both; the corrected
27
+ jury matches the calibrated single-judge baseline (κ 0.014 → 0.416
28
+ on completeness, no API spend).
29
+ - A second-order finding the v1 design didn't anticipate: small
30
+ models on 3-point ordinal scales with paraphrase semantics exhibit
31
+ *at least two* distinct failure modes — one rubric-positional and
32
+ prompt-engineering-fixable, one capacity-limited and only
33
+ addressable by model selection. The 4A A/B against GPT-4o (full)
34
+ is the empirical separator.
35
+ - A methodological observation that's the deepest finding of the
36
+ calibration: Cohen's κ as a jury weight has a self-defeating
37
+ property under intervention-induced marginal shifts. AC1 reads the
38
+ signal correctly. v1.2 fix-list addresses this.
39
+
40
+ The closing position is *when not to use LLM-judge*: 3-point ordinal
41
+ scoring with paraphrase semantics is at the boundary where mid-tier
42
+ models (gpt-4o-mini class) exhibit capacity limits independent of
43
+ prompt engineering, and the right architectural choice is per-
44
+ dimension judge selection rather than further prompt iteration.
45
+
46
+ ---
47
+
48
+ ## 1. Methodology arc
49
+
50
+ The findings below are ordered as the calibration produced them, not
51
+ re-ordered for clarity. Each one has its own supporting evidence
52
+ file; the κ table at `docs/_generated/kappa_table.md` is the
53
+ quantitative summary; `DECISIONS.md` carries the per-decision
54
+ rationale that informs but doesn't repeat the writeup.
55
+
56
+ ### 1.1 Rubric drift caught by frontier-model stress-test
57
+
58
+ The v1.0 hand-labeled calibration set (29 items, single-rater) ran
59
+ through a 90-cell Opus-4 stress-test (`measurements/2026-05-05-judge-
60
+ rubric-opus-stress.jsonl`, $0.20) against the published rubrics. The
61
+ test surfaced a 22/30 disagreement on groundedness — high enough to
62
+ indicate one of three things: (a) the rubric was wrong, (b) the
63
+ labels were wrong, (c) Opus was wrong.
64
+
65
+ Investigation localized the cause to a *scope mismatch* between the
66
+ rubric and the human-grader's labeling procedure. The groundedness
67
+ rubric scopes entailment to the *retrieval snippets* — a specific
68
+ binary check: every claim in the agent's answer must be entailed by
69
+ at least one retrieved snippet. The human grader had instead been
70
+ checking against *corpus documents* (which the snippets are drawn
71
+ from but which contain additional context). Under the corpus-
72
+ supported reading, claims like "useful for expensive operations like
73
+ database connections" pass; under the strict-snippets-only reading,
74
+ they fail.
75
+
76
+ The fix: the rubric was clarified with an explicit "must score 0"
77
+ reference-scope sentence, a trivial-inference clause with a
78
+ canonical-name carve-out (e.g., the snippet says "FastAPI's
79
+ `HTTPException`" and the answer says "the `HTTPException` class" —
80
+ that's still grounded), and three calibration anchors covering the
81
+ boundary cases (`q006` subtle embellishment, `k8s_006` dramatic
82
+ over-extension, `q021` trivial-inference positive).
83
+
84
+ 22 v1.0 labels were flipped against the strict rubric. v1.1 inter-
85
+ rater agreement on groundedness rose to **29/30**. The methodology
86
+ note: *the rubric's reference scope was load-bearing for the dimension
87
+ to measure retrieval-grounded behavior rather than LLM general
88
+ knowledge*; relaxing it would have re-introduced the failure mode the
89
+ supersession was designed to remove.
90
+
91
+ **Why this matters for the writeup:** the strict-snippet groundedness
92
+ rubric is the v1 deliverable's identity. The benchmark is *zero
93
+ hallucinated citations on all API provider configurations* — that
94
+ claim is only meaningful under strict scope. Stress-testing the rubric
95
+ against a frontier model before publication is the cheap intervention
96
+ that catches the labeling-vs-rubric drift before the artifact ships.
97
+
98
+ ### 1.2 CoT-before-score asymmetry across dimensions (tangent — see appendix)
99
+
100
+ The `baseline_no_cot` ablation row reached κ = 1.000 on completeness
101
+ — counterintuitive given the conventional CoT-helps-judging story —
102
+ but at n = 24 (vs n = 26 for `baseline`), and the no_cot row's
103
+ groundedness AC1 falls from 1.000 to 0.897, so the finding is real
104
+ but doesn't drive v1.1 design choices. The longer treatment with the
105
+ n = 24 caveat surfaced honestly is in **Appendix B — CoT-before-
106
+ score by dimension**.
107
+
108
+ ### 1.3 v1 jury bug — two compounding weight-pipeline bugs
109
+
110
+ The v1 design doc's risks subsection listed *"jury κ worse than the
111
+ better individual judge — (a) kappa-weighting wrong, or (b) worse
112
+ judge drags mean"* as a tracked risk. The v1.0 calibration fired both
113
+ branches simultaneously.
114
+
115
+ The κ table row `jury_kappa_weighted` reads κ = 0.014 on
116
+ completeness, vs the single-judge `baseline` (Haiku) at κ = 0.416 —
117
+ a 30× regression. Per-member analysis from
118
+ `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
119
+
120
+ | Member | n | raw% | κ | AC1 |
121
+ |---|---|---|---|---|
122
+ | Haiku 4.5 alone (gold ⋈ pred) | 26 | 84.6% | +0.416 | +0.792 |
123
+ | gpt-4o-mini-2024-07-18 alone | 26 | 26.9% | +0.020 | +0.006 |
124
+ | Jury aggregate (v1) | 26 | 26.9% | +0.014 | +0.016 |
125
+
126
+ The jury aggregate matches gpt-4o-mini almost exactly. The mechanism
127
+ is not "weighted voting in the usual sense" but *missing-weight + tie-
128
+ break compounding*:
129
+
130
+ - `scripts/run_calibration.py::_load_weights_from_baseline` was a
131
+ documented v1 stub returning `1.0` for every judge_id present in
132
+ `baseline.json`. `baseline.json` contains only Haiku predictions
133
+ (the baseline ablation is single-judge), so Haiku got `1.0` from
134
+ the stub.
135
+ - gpt-4o-mini was not in the baseline file — its judge_id never
136
+ appears there. v1's `Jury.score` had a fallback policy of
137
+ `weights.get(judge_id, 1.0)` with a `logger.warning` for visibility.
138
+ gpt-4o-mini got `1.0` from this fallback.
139
+ - Equal weights make a disputed (Haiku=2, gpt=1) cell aggregate as
140
+ `(2 × 1 + 1 × 1) / 2 = 1.5`. The discretization rule
141
+ (`_aggregate_scores`'s policy, mirrored in `_discretize_mean`) is
142
+ *ties to lower*: `frac > 0.5 → ceil else floor`, and `0.5 > 0.5` is
143
+ false, so 1.5 floors to 1. gpt-4o-mini's verdict wins every
144
+ disputed cell.
145
+
146
+ The deeper structural point: weighting alone cannot rescue a
147
+ systematically miscalibrated member. Even held-out validation that
148
+ correctly assigned gpt-4o-mini's true low weight on completeness
149
+ would still let it dominate disputed ties unless its weight were
150
+ driven near zero — and at that point exclusion is more honest than
151
+ near-zero inclusion.
152
+
153
+ **v1.1 fix.** Two coordinated changes (single bundled commit, see
154
+ `ab0e054`):
155
+ - `agent_bench/evaluation/variance/jury.py`: missing-weight fallback
156
+ to `1.0` → hard `ValueError`. v1.1 requires symmetric coverage in
157
+ the weights source.
158
+ - `scripts/run_calibration.py::_load_weights_from_baseline` →
159
+ `_compute_kappa_weights`: replaces the stub with real per-judge
160
+ Cohen's κ on the dimension. Negative κ clipped to 0 (soft exclusion
161
+ via weight). Hard-errors when any expected member is missing from
162
+ the source.
163
+ - Configuration: `weights_source` re-pointed from
164
+ `calibration_v1_judge_baseline.json` (Haiku-only, asymmetric) to
165
+ `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (sidecar
166
+ from a prior jury run; both judges present). The source has
167
+ documented circularity — weights are computed from the same
168
+ calibration set used for κ reporting; v1.2 will use a held-out
169
+ validation set.
170
+
171
+ **Re-aggregation (no API spend).** Re-running the existing 164
172
+ sidecar rows with κ-derived weights (Haiku 0.416, gpt-4o-mini 0.020):
173
+
174
+ | | n | raw% | κ |
175
+ |---|---|---|---|
176
+ | Jury (v1.0, broken) | 26 | 26.9% | +0.014 |
177
+ | Jury (v1.1, corrected weights) | 26 | 84.6% | **+0.416** |
178
+ | Haiku-baseline (control) | 26 | 84.6% | +0.416 |
179
+
180
+ The corrected jury matches the Haiku-baseline κ exactly. The
181
+ mechanism: with corrected weights, a disputed (Haiku=2, gpt=1) cell
182
+ aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 = 1.954`, frac 0.954 >
183
+ 0.5, ceil to 2. Haiku's verdict wins. gpt-4o-mini's near-zero weight
184
+ correctly suppresses its verdict.
185
+
186
+ This is the **pre-committed Outcome 2** from the v1.1 jury-rescue
187
+ plan: jury matches baseline within ±0.05 → "soft exclusion via
188
+ weighting." The weighting suppresses the biased member to near-
189
+ irrelevance; the jury isn't *worse* than baseline, but it isn't
190
+ *doing meaningful work* either. The intervention is necessary but
191
+ not sufficient — the jury's value-add over single-judge depends on
192
+ the second judge being calibrated, which on completeness it isn't.
193
+
194
+ ### 1.4 v1.1.1 prompt-positional intervention — one of two failure modes
195
+
196
+ The next investigation localized *why* gpt-4o-mini was so badly
197
+ miscalibrated on completeness. Confusion-matrix analysis (1A in the
198
+ investigation plan) on the existing sidecar showed:
199
+
200
+ - **17 of 19 disagreements** are gold=2/pred=1 (one-step-down)
201
+ - 1 is gold=2/pred=0, 1 is gold=1/pred=0
202
+ - **0 disagreements** are pred > gold
203
+
204
+ This is direction-aware structure, not balanced random labeling. The
205
+ probability of producing 19 same-direction disagreements by chance
206
+ under a balanced labeler is ~2⁻¹⁹. The bias is structural and
207
+ reproducible; gpt-4o-mini *consistently applies* a stricter standard
208
+ than the rubric specifies.
209
+
210
+ Reading the per-item reasoning surfaced an **extraction-vs-reasoning
211
+ split**: gpt-4o-mini's `evidence_quotes` field correctly extracts the
212
+ paraphrased coverage from the agent's answer, and then its `reasoning`
213
+ field denies that those quotes constitute coverage. The cleanest
214
+ example is `k8s_002` (Deployment vs StatefulSet) — gpt's
215
+ `evidence_quotes` literally contain the strings `"declarative
216
+ updates"` and `"sticky identity"`, while its `reasoning` says "the
217
+ answer does not explicitly mention 'declarative updates' and 'sticky
218
+ identity'." The score follows the reasoning, not the evidence. (Two
219
+ more examples in `measurements/2026-05-06-gpt4o-extraction-reasoning-
220
+ split.md`.)
221
+
222
+ The *intervention* that follows from this hypothesis: the model loses
223
+ the rubric's "paraphrase allowed" instruction across the rubric body,
224
+ the gold reference, the system answer, and its own reasoning step.
225
+ By the time it commits to a score, the literal-string-match standard
226
+ has displaced the rubric's permissive one. **Recency-positioning**
227
+ the paraphrase clause adjacent to the score instruction tests this:
228
+
229
+ ```
230
+ {rubric body}
231
+ ---
232
+ ## Reference answer (gold)
233
+ {reference}
234
+ ## Answer to score
235
+ {system_answer}
236
+ Note: a paraphrase that captures the same meaning as a gold-answer
237
+ point counts as covered. Score on content equivalence, not surface
238
+ form.
239
+ Score this answer against the rubric above. Respond with ONLY a {schema}.
240
+ ```
241
+
242
+ **3A 5-item probe** (`q006`, `q011`, `k8s_002`, `k8s_006`, `k8s_018`,
243
+ $0.0013): 3/5 disputed items shifted 1 → 2 — at the binomial-
244
+ significance threshold per the pre-committed criteria. The protocol
245
+ triggered the full-26 re-run on gpt-4o-mini only (Haiku held as
246
+ control to make the v1.1 → v1.1.1 delta cleanly attributable).
247
+
248
+ **Full-26 re-run** (`scripts/_dev/rerun_completeness_v1_1_1.py`,
249
+ $0.0075):
250
+
251
+ | | n | raw% | κ | AC1 |
252
+ |---|---|---|---|---|
253
+ | v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 |
254
+ | **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** |
255
+ | v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 |
256
+
257
+ 7 items shifted up (6 correct: gold=2/pred=1 → gold=2/pred=2 on
258
+ `q006`, `k8s_002`, `k8s_013`, `k8s_015`, `k8s_016`, `k8s_017`; 1
259
+ regression: `k8s_025` over-credited gold=1/pred=2). Net per-item
260
+ correctness delta: +5 items.
261
+
262
+ **Cohen's κ flat-lined** despite a 38× AC1 improvement and +16pp raw
263
+ agreement. This is the κ-as-weight degeneracy — section 1.6 below
264
+ covers the mechanism.
265
+
266
+ The intervention is real and partial: 5/19 disputed items recovered
267
+ via prompt positioning. 14 disagreements remained uncharacterized
268
+ after this step.
269
+
270
+ ### 1.5 4A residual characterization — model-class-specific
271
+
272
+ The v1.1.1 result is interview-precarious framed as "fixed" (5/19 is
273
+ a partial fix, not a complete one). The right diagnostic for the
274
+ residual was the originally-deferred 4A: run a frontier-class model
275
+ on 5 of the 14 unchanged items at the same v1.1.1 prompt, and see
276
+ whether the residual is small-model-specific or rubric-under-
277
+ specified.
278
+
279
+ **4A** (`gpt-4o-2024-08-06`, items `k8s_006`, `k8s_018`, `q011`,
280
+ `q012`, `k8s_001`, $0.005–0.01): **5/5 scored correctly** — every
281
+ item that gpt-4o-mini got wrong at the v1.1.1 prompt, GPT-4o got
282
+ right at the same prompt. Clean A/B at fixed prompt varying only
283
+ the model.
284
+
285
+ The cleanest side-by-side is `k8s_018` (autoscaling/v2 vs v1). The
286
+ reference specifies three points: stable API version, memory metrics
287
+ support, custom metrics support. Both models receive the same
288
+ prompt:
289
+
290
+ - **gpt-4o-mini (score 1):** "It mentions some key points from the
291
+ reference, including the stable version of `autoscaling/v2`,
292
+ support for custom metrics, and memory metrics, but it does not
293
+ explicitly state that the new fields in `autoscaling/v2` are
294
+ preserved as annotations when using `autoscaling/v1`, nor does it
295
+ mention the need to use `autoscaling/v2` directly for memory or
296
+ custom metric scaling for a Deployment or StatefulSet."
297
+ - **gpt-4o (score 2):** "The answer covers all the key points from
298
+ the reference. It mentions that the current stable version is
299
+ autoscaling/v2, which supports scaling on memory and custom
300
+ metrics, similar to the reference. It also notes that
301
+ autoscaling/v1 only supports CPU-based scaling, aligning with the
302
+ reference's points."
303
+
304
+ gpt-4o-mini's reasoning step **invents additional gold-criteria the
305
+ reference doesn't require** — "preserved as annotations," "use v2
306
+ directly for a Deployment or StatefulSet" — and deducts against
307
+ them. gpt-4o reads the reference's three points and scores against
308
+ exactly those. This is a **second, distinct failure mode** from the
309
+ 1.4 finding:
310
+
311
+ - **Failure mode A (rubric-positional):** literal-match regression
312
+ on paraphrased coverage. *Fixable* by recency-positioning the
313
+ paraphrase clause. Recovers 5/19 items. (Section 1.4.)
314
+ - **Failure mode B (capacity-limited):** criteria-invention during
315
+ the reasoning step — the model manufactures additional gold
316
+ criteria the reference never specified, then deducts against them.
317
+ *Not fixable* by the same prompt; demonstrably absent in gpt-4o.
318
+ (This section.)
319
+
320
+ The v1.1.1 prompt addresses A but not B. B is what 4A characterizes.
321
+
322
+ ### 1.6 κ-as-weight degeneracy — methodological observation
323
+
324
+ > **This section is the writeup's deepest finding.** The methodology
325
+ > arc 1.1–1.5 leads here: an intervention that improved a judge
326
+ > member at the per-cell level (raw 26.9% → 42.9%, AC1 0.006 → 0.232)
327
+ > was *silently excluded* from the jury aggregate by the weighting
328
+ > metric itself. The mechanism below generalizes beyond the v1.1.1
329
+ > instance and is what motivates v1.2 fix #5.
330
+
331
+ The v1.1.1 gpt-4o-mini result reveals a property of Cohen's κ as a
332
+ jury weight that the v1 design didn't anticipate: κ has a **self-
333
+ defeating property** under intervention-induced marginal shifts. An
334
+ intervention that improves a member can *lower* its weight even as
335
+ the member gets more accurate.
336
+
337
+ **Mechanism.** Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where
338
+ `P_e = Σ_k P(gold=k) × P(pred=k)`. P_e is *not* invariant to the
339
+ predictor's marginal distribution. When a member's predictions
340
+ become more diverse — closer to gold's marginals — P_e rises in
341
+ lockstep with P_o. The numerator stays small, and κ deflates even
342
+ as raw accuracy improves.
343
+
344
+ **Empirical instance.** v1.1 gpt-4o-mini completeness pred dist:
345
+ `{0:2, 1:19, 2:5}` (concentrated at 1). v1.1.1 dist: `{0:4, 1:12,
346
+ 2:12}` (more diverse, closer to gold's `{1:5, 2:23}`). Per-cell raw
347
+ accuracy 26.9% → 42.9%. AC1 (Gwet 2008, prevalence-robust):
348
+ 0.006 → 0.232 (38×). Cohen's κ: 0.020 → 0.000.
349
+
350
+ `_compute_kappa_weights` clips κ < 0 to weight = 0. v1.1.1's
351
+ gpt-4o-mini κ = 0.000 → weight = 0.000 → contribution to jury
352
+ verdict is multiplied by zero. The improved member is invisible at
353
+ the aggregate level. **The κ table doesn't move at v1.1.1** despite
354
+ a real per-member improvement; the visible artifact disagrees with
355
+ the per-judge measurement.
356
+
357
+ Why this is non-obvious: in static conditions (no intervention,
358
+ fixed prompts), κ as weight is a sensible default. The self-
359
+ defeating property is invisible until you observe a real
360
+ intervention that shifts marginals. v1.0's calibration sweep
361
+ couldn't surface it because nothing was changing the marginals;
362
+ v1.1.1's intervention is the first time the calibration set has
363
+ produced an intervention-induced marginal shift.
364
+
365
+ The same prevalence trap is what motivates AC1 over κ on the
366
+ relevance and groundedness *reporting* rows of the κ table. The
367
+ v1.1.1 finding is that the same trap also affects κ when used as a
368
+ *weight*, with worse consequences: a reporting-degenerate κ is just
369
+ visually surprising; a weighting-degenerate κ silently excludes a
370
+ correctly-improved member from the aggregate.
371
+
372
+ **Implication.** The v1.2 fix-list (section 3) splits weighting and
373
+ reporting cleanly: per-dimension weight metric reusing the
374
+ `_DIM_METRIC` mapping already used for reporting. AC1 where κ
375
+ degenerates; κ where the gold's prevalence supports it.
376
+
377
+ ---
378
+
379
+ ## 2. Position statement — when not to use LLM-judge
380
+
381
+ The combined findings support a sharper position than "small models
382
+ are bad at completeness." Two distinct failure modes were surfaced
383
+ on the same dimension, and they have different intervention classes:
384
+
385
+ | | Failure mode A (1.4) | Failure mode B (1.5) |
386
+ |--------------------|----------------------|----------------------|
387
+ | Mechanism | Literal-match regression on paraphrased coverage | Criteria-invention during reasoning |
388
+ | Diagnostic | 1A confusion matrix (17/19 disagreements one-step-down) | 4A A/B against gpt-4o (5/5 model-class swap fixes) |
389
+ | Intervention class | Rubric-positional prompt engineering | Model selection |
390
+ | Outcome | Recovers 5/19 items | Recovers all 5 sampled at the same prompt |
391
+
392
+ The v1.1.1 prompt-positional fix exhausts what prompt engineering
393
+ can do on this rubric: the recency clause directs the model to
394
+ paraphrase semantics, and that's the only failure mode the
395
+ intervention can address. Iterating further on prompt design to
396
+ address criteria-invention would either (a) need a longer prompt
397
+ that re-explains the rubric's score levels in the score-decision
398
+ adjacency — which would cost tokens and likely confuse smaller
399
+ models more — or (b) require rubric simplification (binary instead
400
+ of 3-point), which is a v1.2 design change, not a tuning change.
401
+
402
+ **The structural answer for v1.2 is per-dimension judge selection.**
403
+ 3-point ordinal completeness with paraphrase semantics is at the
404
+ boundary where mid-tier models exhibit capacity limits independent
405
+ of prompt engineering. Two defensible v1.2 paths:
406
+
407
+ 1. **Exclude gpt-4o-mini from completeness scoring.** Per-dimension
408
+ judge membership; jury reduces to single-judge Haiku on
409
+ completeness; explicit and visible in the jury config (not
410
+ emergent from κ-weight collapse).
411
+ 2. **Replace gpt-4o-mini with GPT-4o on completeness.** Per-
412
+ dimension judge selection; jury keeps two members; the second is
413
+ a frontier-class model on the dimension that needs it.
414
+
415
+ The choice depends on cost budget. agent-bench's calibration scale
416
+ (~30 items × per-row × dimension-count) is trivially cheap on either
417
+ model; production deployment evaluating thousands of agent outputs
418
+ makes the trade-off material. For v1.2 the calibration cost
419
+ difference between the two paths is on the order of $0.15 per full
420
+ calibration sweep — well below the threshold where cost should
421
+ constrain the choice.
422
+
423
+ The honest interview answer to *"did you fix gpt-4o-mini on
424
+ completeness?"* is **no, deliberately**: the GPT-4o A/B showed the
425
+ residual bias is model-class-specific. The fix isn't another prompt
426
+ intervention; it's per-dimension judge selection. v1.1.1
427
+ demonstrated that rubric-engineering can address one of two failure
428
+ modes; the second one is what model choice is for.
429
+
430
+ **This generalizes beyond the specific dimension as a hypothesis the
431
+ v1 data is consistent with, not a claim the v1 data establishes.**
432
+ The empirical scope is narrow: 3-point ordinal × paraphrase ×
433
+ completeness, n = 26–28 items, one mid-tier model (gpt-4o-mini)
434
+ tested against one frontier model (gpt-4o) at the same prompt.
435
+
436
+ Within that scope, the combination of (multi-class discrimination) ×
437
+ (paraphrase tolerance) × (reasoning-induced elaboration latitude) is
438
+ at the capacity boundary where mid-tier models manufacture failure
439
+ modes that look like they should be prompt-tunable but aren't. Within
440
+ the same scope, frontier-class models on those dimensions; mid-tier
441
+ models on binary or strict-match dimensions where they perform
442
+ identically (groundedness AC1 = 1.000, relevance AC1 = 1.000 on the
443
+ same gpt-4o-mini that fails on completeness).
444
+
445
+ Whether this generalizes to other ordinal arities (4-point, 5-point),
446
+ other mid-tier models (Mistral, Sonnet, Gemini-Flash), or other
447
+ dimensions with paraphrase tolerance is *open* and worth replication
448
+ in v1.2. The v1 data is one mid-tier vs one frontier on one
449
+ dimension; the broader categorical claim ("don't use mid-tier on any
450
+ ordinal-with-paraphrase task") needs replication across model
451
+ families and ordinal arities before it's defensible as a general
452
+ recommendation.
453
+
454
+ ---
455
+
456
+ ## 3. v1.2 fix-list with empirical justification
457
+
458
+ Five items, ordered by methodology depth. Items 1–4 are escalations
459
+ of known v1 risks the calibration confirmed; item 5 is the new
460
+ finding from the v1.1.1 + 4A investigation.
461
+
462
+ ### 3.1 Held-out jury weights
463
+
464
+ **v1 state.** v1.1 weights are computed on the same calibration set
465
+ used for κ reporting (circular). The pragmatic choice was driven by
466
+ N = 30 — splitting into a held-out subset would lose statistical
467
+ power on both halves.
468
+
469
+ **v1.2 fix.** A held-out 20-item validation set used solely for
470
+ jury-weight estimation; the 30-item calibration set retained for κ
471
+ reporting. Items selected by stratification across (corpus, gold-
472
+ class) so the validation set reflects the calibration set's
473
+ prevalence distribution.
474
+
475
+ **Empirical justification.** v1.1's circular weighting is documented
476
+ honestly (DECISIONS "v1.1 jury rescue" entry); a held-out set would
477
+ make the jury-weight numbers reproducible across calibration set
478
+ revisions without re-circularity.
479
+
480
+ ### 3.2 Symmetric coverage / hard-error on missing weights — DONE in v1.1
481
+
482
+ The v1 silent fallback to `1.0` was the second of the two compounding
483
+ bugs in section 1.3. v1.1 made this a hard `ValueError` per
484
+ DECISIONS commit `ab0e054`. Listed here for completeness; closed.
485
+
486
+ ### 3.3 Per-dimension judge membership
487
+
488
+ **v1 state.** Jury config declares members globally across all
489
+ dimensions (`configs/calibration/rows/jury_kappa_weighted.yaml`).
490
+ Weights are per-(member, dimension) but membership is per-jury.
491
+
492
+ **v1.2 fix.** Membership declared per-dimension in the jury config:
493
+
494
+ ```yaml
495
+ jury:
496
+ groundedness:
497
+ - haiku
498
+ - gpt-4o-mini
499
+ relevance:
500
+ - haiku
501
+ - gpt-4o-mini
502
+ completeness:
503
+ - haiku # gpt-4o-mini excluded; see writeup §1.5 + 4A
504
+ ```
505
+
506
+ The exclusion is *visible* in the config, with a comment pointing
507
+ to the rationale. Not buried in code logic.
508
+
509
+ **Empirical justification.** 4A (writeup §1.5): GPT-4o handles 5/5
510
+ of the v1.1.1-residual items at the same prompt; gpt-4o-mini's
511
+ residual bias is model-class-specific (criteria-invention during
512
+ reasoning). v1.1's κ-as-weight handles this by collapsing the
513
+ member's weight to 0; v1.2 makes the exclusion explicit.
514
+
515
+ ### 3.4 Per-dimension tie-break rule
516
+
517
+ `_discretize_mean` currently uses *ties to lower* (`floor + 1 if frac
518
+ > 0.5 else floor`) globally — selected for conservative behavior on
519
+ binary scales where "score 0 on uncertainty" matches the conservative
520
+ direction (hallucination, off-topic). v1.2 flips this per-dimension:
521
+ on 3-point completeness, "conservative" means scoring toward
522
+ *incomplete*, which is the wrong default given member miscalibration
523
+ already biases toward 1.
524
+
525
+ **This fix is independent of §3.5; even with correct AC1-weighted
526
+ aggregation, the global ties-to-lower default mis-handles ordinal
527
+ scales where the conservative direction differs from binary scales'
528
+ conservative direction.** Per-dimension tie-break is the *structural*
529
+ fix for ordinal asymmetry; per-dimension weight metric in §3.5 is the
530
+ *distributional* fix for prevalence-induced κ degeneracy. Different
531
+ defects, different fixes.
532
+
533
+ ### 3.5 Per-dimension weight metric (NEW from v1.1.1)
534
+
535
+ **v1 state.** `_compute_kappa_weights` uses Cohen's κ for every
536
+ dimension. Section 1.6 demonstrated that κ has a self-defeating
537
+ property under intervention-induced marginal shifts — an
538
+ intervention that improves a member can lower its weight to zero,
539
+ silently excluding it from the aggregate.
540
+
541
+ **v1.2 fix.** Per-dimension weight metric reusing the `_DIM_METRIC`
542
+ mapping already used in
543
+ `agent_bench/evaluation/calibration/report.py`. Use AC1 (Gwet 2008)
544
+ where the dimension's gold prevalence makes κ degenerate;
545
+ κ where the gold's prevalence supports it. Same lookup, same per-
546
+ dimension policy at both reporting and weighting layers.
547
+
548
+ **Empirical justification.** v1.1.1's gpt-4o-mini intervention
549
+ (writeup §1.4 + 1.6): raw 26.9% → 42.9%, AC1 0.006 → 0.232 (38×),
550
+ κ 0.020 → 0.000. v1.1's `_compute_kappa_weights` clips the new κ at
551
+ zero, weight = 0, member silently excluded from the aggregate. AC1
552
+ as weight would have given the v1.1.1-improved member a non-zero
553
+ contribution proportional to its actual reliability, surfacing the
554
+ intervention's per-member improvement in the jury aggregate.
555
+
556
+ This is the writeup's deepest finding. The interaction between
557
+ Cohen's κ and prevalence-induced marginal skew is well-documented in
558
+ the κ-reporting literature — Gwet (2008) introduced AC1 specifically
559
+ to address it, and the κ table at `docs/_generated/kappa_table.md`
560
+ already uses AC1 over κ on relevance and groundedness for that
561
+ reason. *What's underexplored, to the author's knowledge,* is the
562
+ specific case where κ is used as a jury *weight* rather than as a
563
+ reporting statistic, and where an intervention shifts the predictor's
564
+ marginals while the gold's marginals stay fixed. v1.2's per-dimension
565
+ weight metric addresses this case structurally.
566
+
567
+ ---
568
+
569
+ ## 4. Closing position
570
+
571
+ The v1 calibration set — 30 hand-labeled items, two corpora, three
572
+ dimensions — was small enough that every finding above lived inside
573
+ single-digit item counts on the disputed surface. The fact that the
574
+ calibration produced six *separable* findings rather than one or two
575
+ flat κ numbers is itself a signal about evaluation design: a
576
+ calibration set sized to support stratified ablation (rubric × CoT ×
577
+ abstain × jury × prompt-positional × model-class) returns more per
578
+ item than a larger flat set used only for headline-κ reporting.
579
+
580
+ The methodology arc the calibration produced is reproducible from
581
+ the artifacts on disk:
582
+
583
+ - `docs/_generated/kappa_table.md` — the headline κ table, joined
584
+ on `(item_id, dimension)` from
585
+ `results/calibration_v1_judge_*.json` ⋈
586
+ `measurements/2026-05-04-judge-calibration-labels.jsonl`. v1.1
587
+ jury-rescue row visible at `jury_kappa_weighted_v1_1` (κ = 0.416,
588
+ vs `jury_kappa_weighted` at κ = 0.014).
589
+ - `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — Opus-4
590
+ stress-test that surfaced the rubric drift (§1.1).
591
+ - `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` —
592
+ three side-by-side reasoning + evidence_quotes excerpts
593
+ demonstrating the literal-match regression mechanism (§1.4).
594
+ - `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the
595
+ 5-item probe artifact for the prompt-positional intervention
596
+ (§1.4).
597
+ - `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl` — GPT-4o A/B
598
+ on the v1.1.1 residual; the empirical separator between the two
599
+ failure modes (§1.5).
600
+ - `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
601
+ — merged sidecar (v1.1 unchanged dims + v1.1.1 fresh gpt-4o-mini
602
+ completeness rows). The data behind the per-member numbers in §1.4.
603
+ - `DECISIONS.md` — per-decision rationale for v1.1, v1.1.1, 3A, 4A.
604
+
605
+ **Total session API spend:** ~$0.013–0.018. v1.1 introduced no API
606
+ spend (re-aggregated existing predictions). v1.1.1 spent $0.0088 on
607
+ the prompt-positional intervention (5-item probe + 30-item full re-
608
+ run). 4A spent $0.005–0.01 on the diagnostic A/B.
609
+
610
+ **The v1 deliverable's position on when not to use LLM-judge:** mid-
611
+ tier models (gpt-4o-mini class) on 3-point ordinal scales with
612
+ paraphrase semantics exhibit capacity limits independent of prompt
613
+ engineering. The right architectural choice is per-dimension judge
614
+ selection, not iterative prompt tuning. Two defensible v1.2 paths
615
+ are listed in §3.3; the empirical evidence supports either one. The
616
+ choice between them depends on the cost of frontier inference at
617
+ production scale, which is a separate v1.2 decision.
618
+
619
+ ---
620
+
621
+ ## Appendix A — reproducer index
622
+
623
+ | Script | What it does | Cost |
624
+ |---|---|---|
625
+ | `scripts/_dev/reaggregate_jury_v1_1.py` | Re-aggregates the existing 164 sidecar rows with κ-derived weights; produces v1.1-corrected jury verdicts. Mirrors the production `Jury.score` aggregation logic offline. | $0.00 |
626
+ | `scripts/_dev/probe_3a_paraphrase_recency.py` | 5-item probe of the prompt-positional intervention on disputed completeness items; tests whether recency-positioning the paraphrase clause shifts gpt-4o-mini's verdicts. | $0.0013 |
627
+ | `scripts/_dev/rerun_completeness_v1_1_1.py` | Full-26 re-run of gpt-4o-mini completeness with the v1.1.1 production prompt. Haiku held as control. | $0.0075 |
628
+ | `scripts/_dev/probe_4a_gpt4o_full.py` | GPT-4o (full) A/B on 5 of the 14 v1.1.1-unchanged items at the same v1.1.1 prompt. Diagnostic for whether the residual is small-model-specific or rubric-under-specified. | $0.005–0.01 |
629
+
630
+ The production calibration runner (`scripts/run_calibration.py`) is
631
+ not in this list because it produces the headline κ table from the
632
+ canonical row configs; the `_dev` scripts above are one-off
633
+ diagnostics that produce the writeup's interpretive evidence.
634
+
635
+ ---
636
+
637
+ ## Appendix B — CoT-before-score by dimension
638
+
639
+ The `baseline_no_cot` ablation row (`use_cot=false`, schema requests
640
+ only the `score` field; reasoning + evidence_quotes omitted) shows a
641
+ per-dimension asymmetry that's interesting on its own but didn't
642
+ drive v1.1 design choices. Pulled out of the body to keep the
643
+ methodology arc focused on the v1.1 → v1.1.1 → 4A path.
644
+
645
+ | Dimension | baseline (CoT) | baseline_no_cot |
646
+ |---|---|---|
647
+ | completeness | κ = 0.416 (n = 26) | **κ = 1.000** (n = 24) |
648
+ | groundedness | AC1 = 1.000 (n = 26) | AC1 = 0.897 (n = 23) |
649
+ | relevance | AC1 = 0.964 (n = 29) | AC1 = 0.963 (n = 28) |
650
+
651
+ **Counterintuitive headline on completeness.** With CoT, the judge's
652
+ reasoning step over-emphasizes partial coverage and rationalizes
653
+ score = 1 ("the answer covers most of the points but misses
654
+ detail X") even when the gold's holistic reading is "covers the
655
+ points." Without CoT, the judge commits to a verdict against the
656
+ rubric directly, and the verdict aligns with the holistic reading.
657
+ The mechanism generalizes specifically to *ordinal scales with
658
+ permissive semantics* — where reasoning-induced elaboration can
659
+ manufacture grounds for downward verdicts.
660
+
661
+ **The n = 24 caveat.** `baseline_no_cot` excludes 2 cells (`q021`,
662
+ `k8s_012`) due to provider rate-limit retry exhaustion. Both were
663
+ gold = 2; neither was in `baseline`'s disagreement set. So the
664
+ agreement *isn't* selective in the misleading sense (the abstain set
665
+ isn't disproportionately drawn from `baseline`'s mistakes), but the
666
+ n = 24 vs n = 26 comparison is asymmetric across rows, and the
667
+ κ = 1.000 number is partly an abstain-exclusion artifact rather than
668
+ a pure counterfactual against `baseline`. The point estimate is real;
669
+ the bootstrap CI is wider than the table cell suggests.
670
+
671
+ **Why this didn't drive v1.1 design.** The no_cot row's groundedness
672
+ AC1 falls from 1.000 to 0.897 — meaningfully worse on the dimension
673
+ where CoT *does* help. Across dimensions: CoT helps on groundedness,
674
+ hurts on completeness, neutral on relevance. The right path is
675
+ *per-dimension* CoT selection (independent of v1.2 fix-list items
676
+ 3.1–3.5; tracked separately as a v1.2 follow-up). Not included in
677
+ the §3 fix-list because the empirical evidence is partial (n = 24
678
+ caveat) and the asymmetric effect across dimensions makes a single
679
+ global change incorrect.
680
+
681
+ **Interview-readiness note.** A reader probing the κ table will see
682
+ the no_cot row's completeness κ = 1.000 and ask. The honest answer
683
+ is "interesting tangent, see appendix B, didn't change v1.1 design
684
+ choices because the asymmetry across dimensions doesn't support a
685
+ global flip." That answer is defensible because the appendix is
686
+ honest about the n = 24 caveat; it would not be defensible if the
687
+ body claimed CoT-before-score was load-bearing for v1's design.
docs/plans/2026-05-04-judge-layer-v1-design.md ADDED
@@ -0,0 +1,613 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Judge Layer v1 — Design Document
2
+
3
+ **Date:** 2026-05-04
4
+ **Status:** Approved — ready for implementation
5
+ **Author:** Jane Yeung
6
+ **Scope:** v1 of a discrete-scale, per-dimension LLM-judge layer with a κ-validated 2-judge jury and a 30-item hand-labeled calibration set. Supersedes the existing continuous-scale `answer_faithfulness` / `answer_correctness` judges. Mistral self-hosted 3rd judge, Langfuse self-host, dual-pass intra-rater calibration, and DSPy/GEPA prompt optimization are explicitly v1.1+.
7
+
8
+ ---
9
+
10
+ ## Goal
11
+
12
+ Replace the existing single-call, continuous-score, no-abstain LLM-judge implementation in `agent_bench/evaluation/metrics.py` with a per-dimension judge layer that supports anchored discrete rubrics, abstain, evidence quotes, judge identity, rubric versioning, and variance-controlled aggregation (rubric permutation, jury). Validate the new layer against a 30-item hand-labeled calibration set with Cohen's κ and bootstrap CIs. Produce a κ ablation table that quantifies the contribution of each variance control (anchored rubric, abstain option, rubric permutation, 2-judge jury) on top of the single-judge baseline.
13
+
14
+ The deliverable is the merged PR. The interpretive artifact is `judge-design.md` (a separate writeup file, not this design doc) which presents the κ table, the methodology, and the closing position on when *not* to use LLM-judge — drafted in the third day of the v1 scope window, sourced from the calibration runs produced by this design.
15
+
16
+ ## Non-Goals
17
+
18
+ - **3rd-judge Mistral self-hosted via Modal.** Modal serving substrate exists from PR #8; deferring the third judge to v1.1 keeps the v1 jury at 2 members and the inference cost at the API-only floor.
19
+ - **Multi-seed self-consistency** (T=0 ensemble across seeds). Variance control via rubric permutation only in v1.
20
+ - **DSPy / GEPA / MIPROv2 prompt optimization.** Rubrics are hand-authored with anchored examples; automated optimization is v1.1+.
21
+ - **Length-bias study, bypass tests, full pass^k sweep.** Out-of-scope for v1.
22
+ - **Langfuse self-host integration.** Position paragraph in writeup §10 instead.
23
+ - **Dual-pass intra-rater calibration.** v1 cites the UK AISI bio/chem ceiling (κ ~0.8) as the literature reference; v1.1 may add intra-rater κ as an empirical ceiling.
24
+ - **Synthetic-anchor calibration set** (frontier-model-as-anchor). Methodologically delicate; v1.1+ if pursued.
25
+ - **Backward-compatible Optional fields on `EvalResult`.** Hard cut: `EvalResult.faithfulness` and `EvalResult.correctness` are removed. Existing run artifacts in `results/*.json` will not deserialize against the new schema; this is acceptable because those artifacts are documentation-of-history (read by humans), not inputs to live code, and none of the README's published numbers depend on the removed fields.
26
+
27
+ ## Architecture
28
+
29
+ ### Three-layer evaluation hierarchy
30
+
31
+ | Layer | What | Where | Cost | When |
32
+ |---|---|---|---|---|
33
+ | **L1 — Deterministic** | retrieval P@k/R@k, KHR, source_presence, grounded_refusal, citation_accuracy, calculator_used | `agent_bench/evaluation/metrics.py` (existing, untouched) | $0, CI-safe | every harness run |
34
+ | **L2 — LLM-judge** | per-dimension judges (groundedness, relevance, completeness; +citation_faithfulness opt-in), 2-judge jury, variance-controlled | `agent_bench/evaluation/{judges,rubrics,variance}/` (new) | ~$0.001–0.005/query | optional (`evaluation.judge_provider` set + `evaluation.judge_dimensions` non-empty) |
35
+ | **L3 — Human** | calibration set hand-labels (30 items × 3 dimensions) | `measurements/2026-05-04-judge-calibration-labels.jsonl` (new, hand-authored) | manual, one-time | once; locked |
36
+
37
+ L3 wraps L2 via the κ table; L1 wraps L2 by handling the cases regex can see (citation accuracy is the canonical example — v1 keeps the existing deterministic check; the writeup's §6 argues this is the right cut even after L2 exists).
38
+
39
+ ### Module layout
40
+
41
+ Four new sibling subpackages under `agent_bench/evaluation/`. Sibling siblings — not nested under a single `judging/` parent — because the file tree should make the L1/L2/L3 hierarchy legible and `calibration/` is L3 evaluation infrastructure that *uses* `judges/`, not a sub-concern of judging.
42
+
43
+ ```
44
+ agent_bench/evaluation/
45
+ harness.py # MIGRATED — drop inline _judge_call; plug in jury
46
+ metrics.py # KEEP deterministic; DELETE answer_faithfulness/answer_correctness/_judge_call/_FAITHFULNESS_PROMPT/_CORRECTNESS_PROMPT
47
+ report.py # existing
48
+ datasets/
49
+ tech_docs_golden.json # existing — 8 items get source_snippets added (calibration subset only)
50
+ k8s_golden.json # existing
51
+ k8s_golden_pilot.json # existing
52
+ calibration_v1.json # NEW — 30 stratified item IDs, version field, system_config_git_sha
53
+ judges/ # NEW
54
+ __init__.py
55
+ base.py # Judge ABC, ScoreResult, Rubric loader, MockJudge, abstain-reason constants
56
+ groundedness.py
57
+ relevance.py
58
+ completeness.py
59
+ citation_faithfulness.py # opt-in v1; default-on v1.1
60
+ rubrics/ # NEW (markdown)
61
+ groundedness.md
62
+ relevance.md
63
+ completeness.md
64
+ citation_faithfulness.md
65
+ variance/ # NEW
66
+ __init__.py
67
+ rubric_permute.py # wraps Judge; permutes rubric levels; aggregates
68
+ jury.py # multi-judge aggregation: mean | kappa_weighted; quorum
69
+ calibration/ # NEW
70
+ __init__.py
71
+ metrics.py # cohen_kappa (linear/quadratic), gwets_ac2, bootstrap_ci — hand-rolled
72
+ report.py # markdown table generator → docs/_generated/kappa_table.md
73
+
74
+ tests/evaluation/ # NEW directory (precedent: tests/test_langchain_baseline/)
75
+ __init__.py
76
+ test_judges.py
77
+ test_rubric_loading.py
78
+ test_calibration_metrics.py
79
+ test_jury_aggregation.py
80
+ test_calibration_report.py
81
+ test_harness_migration.py
82
+ test_mockjudge_coverage.py
83
+ ```
84
+
85
+ ### Supersession of existing judges (dedicated subsection)
86
+
87
+ The new `Judge` ABC fully supersedes `answer_faithfulness`, `answer_correctness`, and `_judge_call` in `agent_bench/evaluation/metrics.py:167-208`. The old code is **deleted** (no deprecation cycle). The supersession changes six axes:
88
+
89
+ | Axis | Old (`_judge_call`) | New (`Judge` ABC) |
90
+ |---|---|---|
91
+ | **Scale** | continuous 0.0–1.0, no anchors | discrete (binary or 3-point) with rubric-anchored examples per level |
92
+ | **Reasoning placement in JSON** | `{"score": …, "reasoning": …}` — score first | `{reasoning, evidence_quotes, score}` — score conditions on reasoning |
93
+ | **Granularity** | combined "faithfulness" / "correctness" | per-dimension (groundedness / relevance / completeness; citation_faithfulness opt-in) |
94
+ | **Versioning** | none — judge_id, rubric, prompt all unrecorded | `judge_id`, `rubric_version` (SHA-256 of rubric file content), `prompt_seed`, `system_output_hash` traceable in every `ScoreResult` |
95
+ | **Variance control** | single call only | composable wrappers (`rubric_permute`, `jury`) |
96
+ | **Failure mode** | bare `except Exception` returns `None`; harness silently drops | intentional: `"Unknown"` abstain on rubric/model noise (with structured-prefix reason); raise on caller bugs (see Error Handling) |
97
+
98
+ **Config knob preservation.** `evaluation.judge_provider` YAML field stays (5 configs reference it; `core/config.py:89`). New judges accept `judge_provider: LLMProvider` matching the existing harness signature pattern. Zero user-facing config migration. New `evaluation.judge_dimensions: list[str]` field (default `["groundedness", "relevance", "completeness"]`); `citation_faithfulness` is opt-in v1, default-on v1.1, decoupling the citation deterministic-vs-LLM head-to-head from the harness migration.
99
+
100
+ **Coupled artifact updates** (in scope of the judge PR):
101
+ - `docs/DESIGN.md:346-356, 395` — rewrite §"LLM-judge metrics (costs money, manual)" to point at this design doc and `judge-design.md` (the writeup).
102
+ - `DECISIONS.md` — append one supersession entry. Entry references file paths explicitly: `measurements/2026-05-04-judge-calibration-labels.jsonl`, the relevant `results/calibration_v1_judge_*.json` files, and the κ table file path. References by file path, not abstract claim — the supersession is defended by the calibration data, not by description.
103
+ - `measurements/README.md` — append one row pointing at the new calibration-labels file (otherwise it orphans next to the cold-start logs).
104
+ - `README.md` — add a "Targets that cost money" subheading (separate concern; see the README cost-disclosure obligation under Testing).
105
+
106
+ ### Dependency direction
107
+
108
+ Judge → Rubric (filesystem markdown loader) → existing `LLMProvider` ABC at `agent_bench/core/provider.py`. **No new external runtime dependencies.** Cohen's κ, Gwet's AC2, and bootstrap CI are hand-rolled (rationale in `calibration/metrics.py` under Components). scikit-learn is *not* added to the project; sklearn appears only in dev tooling under `scripts/_dev/` (see the sklearn fixture pattern under Testing).
109
+
110
+ ## Components
111
+
112
+ ### Rubric (the spec object)
113
+
114
+ ```python
115
+ class Rubric(BaseModel):
116
+ dimension: Literal["groundedness", "relevance", "completeness", "citation_faithfulness"]
117
+ scale: Literal["binary", "three_point"]
118
+ reference_based: bool
119
+ abstain_allowed: bool
120
+ levels: list[RubricLevel] # parsed from markdown sections
121
+ body_markdown: str # full file contents
122
+
123
+ @property
124
+ def source_hash(self) -> str:
125
+ # SHA-256 of body_markdown — immutable per file content, independent of git
126
+ ...
127
+
128
+ def render_prompt(self, *, level_permutation_seed: int = 0) -> str:
129
+ # if seed > 0, permute self.levels deterministically using PRNG(seed)
130
+ ...
131
+ ```
132
+
133
+ **Two-hash provenance.** `source_hash` (SHA-256 of canonical body) is immutable per rubric file; `prompt_seed` (per-call int, 0 = no permutation) is recorded on the call. κ aggregation groups by `source_hash`; ScoreResults with the same `source_hash` and different `prompt_seed` are agreement-eligible against the same label. Both fields appear in every `ScoreResult` so records are self-contained.
134
+
135
+ Loader reads markdown with YAML frontmatter (matching repo convention). Anchored examples are parsed by section header pattern (`## Score 0`, `## Score 1`, …) so level-permutation rewrites the prompt by reordering sections.
136
+
137
+ **Construction validates aggressively** (see Rubric construction validation under Error Handling): scale ∈ {binary, three_point}, levels arity matches scale, every level has at least one anchored example with thinking-trace explanation, frontmatter has all required fields. ValidationError raises with file path + field path. Failing at rubric construction (Day 1) is much cheaper than failing on first `judge.score` call (Day 2 with API budget already spent).
138
+
139
+ ### ScoreResult (per-call record)
140
+
141
+ ```python
142
+ class ScoreResult(BaseModel):
143
+ # Reasoning-first ordering — matters for Pydantic field order
144
+ # AND for the JSON schema sent to the model
145
+ reasoning: str
146
+ evidence_quotes: list[str] = Field(default_factory=list)
147
+ score: int | Literal["Unknown"]
148
+
149
+ # Provenance (self-contained — no run-metadata cross-reference needed)
150
+ judge_id: str # f"{model_id}_{dimension}", e.g. "claude-haiku-4-5_groundedness"
151
+ rubric_version: str # = Rubric.source_hash
152
+ prompt_seed: int = 0
153
+ system_output_hash: str # SHA-256 of canonical (item.id, output.answer, sorted(output.sources))
154
+
155
+ # Operations
156
+ cost_usd: float
157
+ latency_ms: float
158
+
159
+ @property
160
+ def abstained(self) -> bool:
161
+ return self.score == "Unknown"
162
+ ```
163
+
164
+ `score` is `int | Literal["Unknown"]` (not `int | None`) so abstain is structurally distinct from "we don't have a value yet" — the silent-`None` failure mode that the old `_judge_call` exhibited becomes impossible.
165
+
166
+ `system_output_hash` is the cross-run-aggregation guard: scores are agreement-eligible iff `(item.id, dimension, system_output_hash)` match. Any mismatch between labels and predictions raises in the calibration report (see Calibration report failure modes under Error Handling).
167
+
168
+ ### Judge ABC + concrete judges
169
+
170
+ ```python
171
+ class Judge(ABC):
172
+ def __init__(self, judge_provider: LLMProvider, rubric: Rubric, model_id: str):
173
+ self.judge_provider = judge_provider
174
+ self.rubric = rubric
175
+ self.model_id = model_id
176
+ self.judge_id = f"{model_id}_{rubric.dimension}"
177
+
178
+ @abstractmethod
179
+ async def score(
180
+ self,
181
+ item: GoldenQuestion,
182
+ output: AgentResponse,
183
+ *,
184
+ prompt_seed: int = 0,
185
+ ) -> ScoreResult: ...
186
+ ```
187
+
188
+ Concrete judges (`GroundednessJudge`, `RelevanceJudge`, `CompletenessJudge`, `CitationFaithfulnessJudge`) are thin per-dimension classes (~30 lines each), no shared base method. Factoring the prompt-assembly into a base method is rejected: at 3–4 judges of 30 lines each, each is more readable in full than as a delta against a base, and a shared base creates a future trap where dimension-specific logic creeps into the base via `if self.dimension == ...` branches.
189
+
190
+ **Per-judge input expectations** (matters for the FastAPI snippet-authoring scope):
191
+
192
+ | Judge | Reads from `item` | Reads from `output` |
193
+ |---|---|---|
194
+ | `GroundednessJudge` | `source_snippets` (the 8 FastAPI calibration items get hand-snippeted; see FastAPI snippet authoring under Calibration Methodology) | `answer` |
195
+ | `RelevanceJudge` | `question` only | `answer` |
196
+ | `CompletenessJudge` | `reference_answer` | `answer` |
197
+ | `CitationFaithfulnessJudge` | `source_chunk_ids` + retrieved-chunk text | `answer` (parsed for claims + citations) |
198
+
199
+ `CitationFaithfulnessJudge` returns one aggregate `ScoreResult` per item (preserving ABC polymorphism), with per-pair (claim, citation) detail in `evidence_quotes`. Aggregation rule for binary: **all-or-nothing** — any unfaithful citation → score=0. The rule is documented explicitly in `rubrics/citation_faithfulness.md`.
200
+
201
+ ### MockJudge
202
+
203
+ Same shape as `Judge`; constructor takes `verdicts: dict[str, ScoreResult]` keyed by `item.id`. Returns the pre-baked verdict on `score()`, no API call. **Raises `LookupError` on missing keys** — never returns a default — so test fixtures are self-checking. A separate fixture-validation test (`test_mockjudge_coverage.py`) walks `item.id` across all goldens and asserts every MockJudge instance has coverage for items its tests reference. Two-layer defense against the rename-breaking-tests failure mode. Mirrors the `MockProvider` pattern at `agent_bench/core/provider.py:118`.
204
+
205
+ ### rubric_permute (variance wrapper)
206
+
207
+ ```python
208
+ def rubric_permute(judge: Judge, n: int = 2, seeds: list[int] | None = None) -> PermutedJudge: ...
209
+ ```
210
+
211
+ `PermutedJudge.score(item, output)` runs `judge.score(item, output, prompt_seed=s)` for each `s` in `seeds` (default `[1, 2]`), aggregates:
212
+ - Binary: majority (n=2 → tie-break to lower score, more conservative)
213
+ - Three-point: mean, rounded to nearest level **with ties broken downward** (e.g., 1.5 → 1, 0.5 → 0); same conservative principle as the binary tie-break
214
+ - **Any abstain → "Unknown"** (any sample, not all): the whole point of rubric permutation is to surface whether judge behavior depends on prompt structure; averaging an abstain away with a confident sample defeats the technique. At N=2, "all abstain" essentially never fires, making it a silent aggressive default. "Any abstain → Unknown" is the conservative choice that preserves the variance signal.
215
+
216
+ Returns one `ScoreResult` with `judge_id = f"{judge.judge_id}_perm{n}"`, `prompt_seed=0` on the aggregate. Per-permutation results are written to a sidecar JSONL (same pattern as the jury subsection below) for traceability.
217
+
218
+ ### jury (multi-judge aggregator)
219
+
220
+ ```python
221
+ def jury(
222
+ judges: list[Judge],
223
+ aggregation: Literal["mean", "kappa_weighted"],
224
+ weights: dict[str, float] | None = None, # required if kappa_weighted
225
+ quorum: int | None = None, # default: len(judges) — strict
226
+ sidecar_path: str | None = None, # default: results/calibration_v1_judge_{aggregation}_members.jsonl
227
+ ) -> Jury: ...
228
+ ```
229
+
230
+ `Jury.score(item, output)` runs `asyncio.gather(*[j.score(item, output) for j in judges], return_exceptions=False)` with try/except at the jury level (so non-retryable exceptions cancel sibling tasks immediately — failing fast on caller bugs). Per-member ScoreResults always written to sidecar (successes and failure-as-abstains alike). Aggregate behavior:
231
+
232
+ 1. Count `successful_members = sum(1 for r in member_results if not r.abstained)`.
233
+ 2. If `successful_members < quorum`: aggregate = `ScoreResult(score="Unknown", reasoning=f"jury_below_quorum: {successful_members}/{len(judges)} members succeeded; required {quorum}", ...)`.
234
+ 3. Else: aggregate using `aggregation` strategy over the successful members' scores. **Discretization rule (same as `rubric_permute`):** binary scores threshold at 0.5 with ties → 0; three-point scores round to nearest with ties → lower level. Discretization happens at the aggregation step, before the κ join — Cohen's κ requires both inputs discrete.
235
+
236
+ **Strict quorum default for v1.** `quorum=N` (= `len(judges)`) at v1's 2-judge jury means any member abstain → jury abstain. Tolerant defaults at N=2 are silent single-judge in jury clothing. The parameter exists in v1 so v1.1's 3-judge jury can shift to `quorum=2` (majority) without rearchitecting failure semantics.
237
+
238
+ `kappa_weighted` requires explicit `weights` injection — computed offline once on the calibration set, *not* at jury construction (would be circular).
239
+
240
+ ### calibration/metrics.py (hand-rolled κ + bootstrap)
241
+
242
+ ```python
243
+ def cohen_kappa(
244
+ y1: list[int | str], y2: list[int | str],
245
+ weights: Literal[None, "linear", "quadratic"] = None,
246
+ ) -> float
247
+
248
+ def gwets_ac2(
249
+ y1: list[int | str], y2: list[int | str],
250
+ weights: Literal[None, "linear", "quadratic"] = None,
251
+ ) -> float
252
+
253
+ def bootstrap_ci(
254
+ y1: list, y2: list, metric_fn: Callable[[list, list], float],
255
+ n_iter: int = 1000, ci: float = 0.95, seed: int = 42,
256
+ ) -> tuple[float, float, float] # (point_estimate, ci_lo, ci_hi)
257
+ ```
258
+
259
+ **Hand-rolled, not sklearn.** Adding scikit-learn for one function (and transitively numpy + scipy + threadpoolctl + joblib) contradicts agent-bench's "built from primitives" identity. The hand-roll also serves the writeup: `(P_o − P_e) / (1 − P_e)` with explicit `P_e` computation demonstrates formula understanding in a way that an `sklearn.metrics.cohen_kappa_score` import does not. Fixture-tested against sklearn run *outside* the project venv (see the sklearn fixture pattern under Testing).
260
+
261
+ **Abstain handling in κ.** Excluded pairwise — if either side abstains on item *i*, item *i* drops from that κ calculation. Standard treatment (Tu et al. 2024, *Beyond Correlation*); abstain as "I don't know" is neither agreement nor disagreement. Abstain count per dimension is reported separately by the calibration report (see `calibration/report.py` below).
262
+
263
+ **Gwet's AC2 deferred from headline numbers.** AC2 is implemented in v1 but the published numbers in the v1 writeup come from κ only; AC2 fixture-test rigor (sympy-derived intermediate steps, not arithmetic-derived) is v1.1 work. Hand-computed AC2 fixtures in v1 cover three inspection-verifiable cases (perfect agreement, perfect disagreement, mid-range).
264
+
265
+ ### calibration/report.py
266
+
267
+ One function: `generate_kappa_table(predictions_glob, labels_path, output_path, *, strict: bool = False)` → writes `docs/_generated/kappa_table.md`. Idempotent. Joins predictions ⋈ labels on `(item_id, dimension, system_output_hash)`; raises on hash mismatch (collect-all, error includes first-item expected/actual hashes plus full mismatched-id list). Computes per-config, per-dimension κ + bootstrap CI + abstain rate; flags rows where abstain rate **strictly greater than** 20% with a footnote (`"κ computed on N=X of 30 items; high abstain rate (Y% — breakdown: Z% schema parse, W% genuine abstain) suggests rubric ambiguity"`).
268
+
269
+ **Two modes for missing predictions/labels:**
270
+ - Default: WARN-and-exclude (Day-2 development loop — partial coverage is real interim state).
271
+ - `--strict`: RAISE on any missing prediction/label (final-artifact path; `make calibrate` invokes this; the writeup is by-construction produced from `--strict` output).
272
+
273
+ The κ table is copy-pasted into the writeup at draft time, not include-by-reference — the writeup is a frozen v1 artifact and copy-paste lets the writeup add inline annotations to specific cells.
274
+
275
+ ## Data Flow
276
+
277
+ ### Production harness run (existing, migrated)
278
+
279
+ ```
280
+ golden file → load_golden_dataset() → list[GoldenQuestion]
281
+ → for each item, parallel:
282
+ orchestrator.run() → AgentResponse
283
+ compute L1 metrics (existing — untouched)
284
+ if judge_provider is not None and item.category != "out_of_scope":
285
+ system_output_hash = hash(item.id, response.answer, sorted(response.sources))
286
+ for each Judge in evaluation.judge_dimensions:
287
+ ScoreResult = await judge.score(item, response)
288
+ attach to EvalResult.judge_scores: dict[str, ScoreResult]
289
+ → write results/{run_label}.json
290
+ ```
291
+
292
+ **Migration delta** at `agent_bench/evaluation/harness.py:153-166`:
293
+ - DELETE inline import of `answer_faithfulness, answer_correctness`
294
+ - DELETE `result.faithfulness = ...` and `result.correctness = ...` assignments
295
+ - ADD: load configured judges from `evaluation.judge_dimensions` config; build with existing `judge_provider`
296
+ - ADD: `result.judge_scores: dict[str, ScoreResult]` field on `EvalResult`
297
+ - KEEP: `if judge_provider is not None and q.category != "out_of_scope"` gate (out-of-scope items still bypass L2; refusal is deterministic)
298
+ - KEEP: `evaluation.judge_provider` YAML field (5 configs reference it)
299
+
300
+ ### Calibration run (new)
301
+
302
+ ```
303
+ calibration_v1.json (30 IDs + version + system_config_git_sha)
304
+ → filter k8s_golden.json + tech_docs_golden.json → 30 GoldenQuestions
305
+
306
+ Step A (once, frozen): generate system outputs
307
+ → orchestrator.run() with frozen config for each item
308
+ → write results/calibration_v1_system_outputs.json
309
+ (each record includes system_output_hash, item_id, answer, sources, source_chunks, citations)
310
+
311
+ Step B (manual): hand-label
312
+ → labeling notebook reads system_outputs file, injects system_output_hash automatically
313
+ → for each (item, dimension), human authors score + notes
314
+ → append to measurements/2026-05-04-judge-calibration-labels.jsonl
315
+ {item_id, dimension, score | "Unknown", abstained, notes, label_timestamp, system_output_hash}
316
+
317
+ Step C (per ablation row): score with judges
318
+ → load row config from configs/calibration/rows/{label}.yaml
319
+ → load system_outputs file (frozen)
320
+ → for each item, judge.score(item, output) per row's judge configuration
321
+ → write results/calibration_v1_judge_{row_label}.json
322
+ and (jury rows) results/calibration_v1_judge_jury_{aggregation}_members.jsonl
323
+
324
+ Step D (κ table):
325
+ → calibration/report.generate_kappa_table(strict=True for final artifact)
326
+ → join predictions ⋈ labels on (item_id, system_output_hash); raise on mismatch
327
+ → exclude pairs where either side abstains
328
+ → cohen_kappa + bootstrap_ci + abstain_rate per (config, dimension)
329
+ → write docs/_generated/kappa_table.md
330
+ ```
331
+
332
+ **Hash propagation through labels** is intentional: labels carry `system_output_hash` because they are tied to specific outputs. If `system_outputs` are ever regenerated (config change, retry), labels become stale and the κ join raises loudly. This eliminates the cross-run aggregation bug class.
333
+
334
+ ### Concurrency
335
+
336
+ - **Within an item, across judges (jury):** `asyncio.gather` over `judges`; existing provider rate-limit/retry kicks in.
337
+ - **Across items in a calibration row:** `asyncio.gather` with semaphore, default concurrency=5, configurable via CLI flag with config-field fallback. **Resolved value logged at run start** so artifacts capture which concurrency was used.
338
+ - **Across rows of the ablation:** rows run sequentially. Each row writes its predictions file before the next starts — partial progress survives interruption.
339
+
340
+ ### New scripts and Makefile targets
341
+
342
+ ```
343
+ scripts/
344
+ evaluate.py # existing — full-corpus harness runs
345
+ run_calibration.py # NEW — orchestrates Steps A, C, D
346
+ # subcommands: generate-outputs | run-judges --row-config=<path> | build-table [--strict]
347
+ # Step B (labeling) is manual — done in a notebook
348
+ configs/calibration/rows/ # NEW — one YAML per ablation row (config-file-per-row)
349
+ baseline.yaml
350
+ baseline_no_cot.yaml
351
+ baseline_no_anchors.yaml
352
+ baseline_no_abstain.yaml
353
+ permute.yaml
354
+ jury_kappa_weighted.yaml
355
+
356
+ Makefile:
357
+ calibrate # runs full pipeline: generate-outputs → run-judges (all rows) → build-table --strict
358
+ evaluate-judges # runs run-judges + build-table against existing system_outputs (no regeneration)
359
+ ```
360
+
361
+ Row configs are independently versioned reproducible artifacts in the PR. `run-judges` is a generic runner taking `--row-config=<path>`; the script does not own the row inventory. Discovering a bug in row 4 means fixing row 4's config and rerunning rows 4-6 without touching 1-3.
362
+
363
+ ### Failure modes eliminated by this design
364
+
365
+ | Bug class | Eliminated by |
366
+ |---|---|
367
+ | Cross-run aggregation (run-A outputs scored against run-B labels) | `system_output_hash` join with raise-on-mismatch |
368
+ | Stale labels after system re-run | Same |
369
+ | MockJudge silently passing tests with renamed item IDs | `LookupError` on missing keys + fixture-validation test |
370
+ | Single-call judge bias hidden | Rubric permutation surfaces it via abstain propagation |
371
+ | Per-judge κ unrecoverable from jury aggregate | Sidecar JSONL with deterministic path |
372
+ | Partial progress lost on Step C interruption | One predictions file per row, written sequentially |
373
+ | Schema parse failures silently dropped (old `_judge_call` `None`) | Discrete abstain-with-prefix; abstain rate flagged at >20% |
374
+ | Final writeup citing N=28 while prose claims N=30 | `--strict` mode for final-artifact build; default warns |
375
+
376
+ ## Error Handling
377
+
378
+ ### Failure taxonomy at L2
379
+
380
+ | Category | Source | Where caught | Decision |
381
+ |---|---|---|---|
382
+ | Provider retryable (rate limit, timeout, network) | Infra | Existing `LLMProvider` retry/backoff | Bubbles up only on retry exhaustion |
383
+ | Provider exhausted (retries exhausted) | Infra | `Judge.score` | Abstain with `ABSTAIN_REASON_PROVIDER_EXHAUSTED` |
384
+ | Provider non-retryable (401, 400) | Caller misconfig | `Judge.score`; jury cancels siblings | **Raise** — bug, not noise |
385
+ | Schema parse error | Model glitch or broken prompt | `Judge.score` | Abstain after one strict-reprompt retry; `ABSTAIN_REASON_SCHEMA_PARSE` |
386
+ | Score out of range | Model glitch | `Judge.score` | Abstain after one strict-reprompt retry; `ABSTAIN_REASON_OUT_OF_RANGE` |
387
+ | Genuine model abstain (rubric allows) | Model judgment | `Judge.score` | Abstain with empty-prefix sentinel (`ABSTAIN_REASON_GENUINE` = `""`) |
388
+ | Hash mismatch on κ join | Stale labels | `calibration/report.py` | Raise after collect-all; first-item expected/actual hashes in message |
389
+
390
+ ### The abstain-vs-raise discipline
391
+
392
+ **One retry with strict reprompt** on schema parse / score out of range. Original prompt's formatting instructions are augmented at the end with a recency-positioned reminder: `STRICT FORMATTING NOTE: respond ONLY with a JSON object matching the schema; reasoning first, then evidence_quotes, then score`. If second attempt also fails, abstain with structured-prefix reason. **Exactly one retry** — zero retries throws away signal that recovers cheaply; N>1 retries silently mask systematic schema breaks.
393
+
394
+ **Failure-reason prefixes as constants** in `judges/base.py`:
395
+
396
+ ```python
397
+ ABSTAIN_REASON_PROVIDER_EXHAUSTED = "judge_call_failed_after_retry: "
398
+ ABSTAIN_REASON_SCHEMA_PARSE = "schema_parse_failed_after_retry: "
399
+ ABSTAIN_REASON_OUT_OF_RANGE = "score_out_of_range_after_retry: "
400
+ ABSTAIN_REASON_GENUINE = "" # empty-prefix sentinel for rubric-allowed abstain
401
+ ```
402
+
403
+ Calibration report imports + pattern-matches against typed constants for the four-way abstain-cause breakdown in the >20% threshold flag.
404
+
405
+ ### First-attempt-failure log schema (fires on success-after-retry too)
406
+
407
+ WARN-level structured log line, fixed key set, no schema drift. Uses `structlog` matching repo precedent at `agent_bench/evaluation/metrics.py:14` (`logger = structlog.get_logger()`):
408
+
409
+ ```python
410
+ logger.warning(
411
+ "judge_first_attempt_failure",
412
+ judge_id=self.judge_id,
413
+ item_id=item.id,
414
+ provider=type(self.judge_provider).__name__,
415
+ failure_cause=ABSTAIN_REASON_SCHEMA_PARSE, # one of the four constants
416
+ attempt_index=1,
417
+ )
418
+ ```
419
+
420
+ Fires on first-attempt failure regardless of whether the second attempt succeeds. The "first failed, second succeeded" branch is the most analytically interesting case — it tells you the reprompt is doing work rather than just consuming budget. Without this log, that branch is invisible.
421
+
422
+ ### Jury partial-failure (quorum)
423
+
424
+ Per the jury subsection above: strict quorum default; per-member ScoreResults always written to sidecar; aggregate is `score="Unknown"` with `jury_below_quorum` reason if `successful_members < quorum`. Provider non-retryable in any member → jury raises immediately, cancels sibling `gather` tasks (the `return_exceptions=False` + try/except pattern; *not* `return_exceptions=True` + inspection — the two look identical to a careless reader but only the former cancels siblings).
425
+
426
+ ### Permutation wrapper failure
427
+
428
+ Per the `rubric_permute` subsection above: any-permutation abstain → aggregate abstain. Per-permutation results written to sidecar.
429
+
430
+ ### Rubric construction validation
431
+
432
+ `Rubric.from_markdown_file()` validates aggressively: scale ∈ {binary, three_point}, levels arity matches scale, every level has at least one anchored example with thinking-trace explanation, frontmatter has all required fields. ValidationError raises with file path + field path. Validation discipline is named explicitly in the spec because the alternative ("validate lazily on first score call") is the kind of thing that creeps in if not specified — and a malformed-rubric error on Day 2 after API budget has been spent is materially worse than a malformed-rubric error on Day 1.
433
+
434
+ ### Calibration report failure modes
435
+
436
+ | Condition | Default behavior | `--strict` behavior |
437
+ |---|---|---|
438
+ | Hash mismatch | Raise after collect-all (first item expected/actual + full id list) — **applies to both modes; never warn** | Same |
439
+ | Missing prediction (label exists, no prediction for `(item_id, dim)`) | WARN; exclude from κ; coverage row in footer | RAISE |
440
+ | Missing label (prediction exists, no label) | WARN; exclude; coverage row in footer | RAISE |
441
+ | κ undefined (insufficient variance after exclusion, or N<3 agreement-eligible) | Render `"—"` with footnote — **applies to both modes** | Same |
442
+ | Abstain rate > 20% (strictly greater) | Render κ + footnote with cause breakdown — **applies to both modes** | Same |
443
+
444
+ ## Testing
445
+
446
+ ### File layout
447
+
448
+ Six new files under `tests/evaluation/` matching the new module subpackages. Existing `tests/test_evaluation.py` stays at top level (precedent: `tests/test_langchain_baseline/`); the existing file's faithfulness/correctness assertions are dropped, but the file is not renamed (preserves git blame).
449
+
450
+ ### sklearn fixture pattern (κ parity tests)
451
+
452
+ Four-part discipline:
453
+
454
+ 1. **Generation script** at `scripts/_dev/generate_kappa_fixtures.py` — committed; `_dev` prefix marks as not-runtime. Imports sklearn; documented to run from a venv outside the project. **Action item:** verify `_dev/*` is excluded from ruff/mypy via `pyproject.toml` (currently no `extend-exclude` set; add as part of this PR).
455
+ 2. **Inline constants** in `test_calibration_metrics.py` — `SKLEARN_KAPPA_FIXTURES: dict[str, float]` and `SKLEARN_KAPPA_INPUTS: dict[str, dict]`. Locality preserved, type-checked.
456
+ 3. **Version-pinned comment header** — `# Fixtures generated against scikit-learn==1.5.2 cohen_kappa_score on 2026-05-04` with regeneration instructions. Drift detection if sklearn behavior changes in a future version.
457
+ 4. **Load-bearing comment** — `# DO NOT add scikit-learn to the project's dependencies — these constants are the contract.` Prevents the well-meaning future contributor from "fixing" tests by importing sklearn at runtime.
458
+
459
+ **Cross-check CI test:** the generation script writes its inputs to a JSON sidecar under `tests/evaluation/fixtures/sklearn_kappa_inputs.json`; a CI test asserts `SKLEARN_KAPPA_INPUTS` matches that JSON. Catches the "updated CASES list, forgot to regenerate" failure mode at CI time. Five lines of test code.
460
+
461
+ **No sklearn parity for AC2 in v1.** sklearn doesn't have AC2; pulling `irrCAC` reintroduces the dependency problem one level over. Three hand-computed AC2 cases (perfect agreement, perfect disagreement, mid-range) where the formula reduces to inspection-verifiable values. v1.1 may add sympy-derived AC2 fixtures (script under `scripts/_dev/generate_ac2_fixtures.py` with sympy as dev-only dep, sympy intermediate steps printed for audit). v1.1 spec line: *"AC2 hand-computed fixtures are sympy-derived not arithmetic-derived; verification requires reading the sympy intermediate output, not just inspecting the test."*
462
+
463
+ ### Test inventory (~30 tests total)
464
+
465
+ | File | Tests | Notes |
466
+ |---|---|---|
467
+ | `test_judges.py` | ~7 | ABC contract, MockJudge round-trip + LookupError, ScoreResult validation, abstain-with-prefix (parameterized over 3 causes), raise on non-retryable, first-attempt-failure log fires |
468
+ | `test_rubric_loading.py` | ~6 | Construction validation (parameterized over 4 invalid cases), source_hash determinism, source_hash changes with content, permutation seed reproducibility, permutation changes prompt |
469
+ | `test_calibration_metrics.py` | ~7 | 3 hand-computed κ cases + 3 sklearn-fixture parity + 1 bootstrap-CI seed reproducibility |
470
+ | `test_jury_aggregation.py` | ~5 | mean, kappa_weighted, strict-quorum-abstain, sidecar capture, cancel-on-non-retryable |
471
+ | `test_calibration_report.py` | ~6 | hash-mismatch with first-item detail, --strict raise, default WARN, undefined-κ dash, abstain-flag boundary 6/30 (does not fire) and 7/30 (fires), abstain breakdown by cause |
472
+ | `test_harness_migration.py` | ~3 | judge_scores populated when configured, out_of_scope skipped, judge_provider config preserved |
473
+ | `test_mockjudge_coverage.py` | ~1 | item.id walk across all goldens |
474
+ | **Total** | **~35** | |
475
+
476
+ The original "~15–20" estimate was made before the Error Handling section was designed. Designing error handling and not expanding the test count is the inconsistency: the abstain-cause logic is the highest-stakes-when-silently-wrong piece of the project (wrong abstain semantics → quietly wrong κ in the published report). If Day 3 budget runs short, the cuttable margin is `test_harness_migration.py` (integration-y, failures show up loudly); the metric-correctness and judge-failure-handling tests do not get cut.
477
+
478
+ ### Discipline conventions
479
+
480
+ - Mocked providers everywhere. Zero network calls in CI. `MockProvider` for the underlying LLM; `MockJudge` for tests that need pre-baked verdicts.
481
+ - `pytest-asyncio` (`asyncio_mode = "auto"` already set) for async tests.
482
+ - Hand-computed κ cases include worked-out arithmetic in a comment block so a reader can verify the formula without running the test.
483
+ - Larger reusable fixtures live under `tests/evaluation/fixtures/`; one-off small fixtures stay inline.
484
+
485
+ ### CI scope
486
+
487
+ - All ~35 new tests run in `make test` in the existing GitHub Actions workflow. No new workflow files.
488
+ - `make lint` covers new modules (ruff + mypy).
489
+ - `make calibrate` and `make evaluate-judges` are **not** run in CI — they require API keys and burn budget. Manual invocation only.
490
+ - **GitHub Actions config** explicitly omits provider keys via an empty `env:` block, preventing the "PR worked in upstream because secret was injected; fails in contributor's fork because no secret" failure mode.
491
+
492
+ ### README cost-disclosure obligation (separate from spec)
493
+
494
+ `README.md` gets a "Targets that cost money" subheading with a four-column table (target, requires API key, approximate cost, what it produces). Not part of the spec body — a doc obligation owed to anyone running `make help` who shouldn't have to read the spec to know that `make calibrate` costs ~$2.
495
+
496
+ ## Calibration Methodology
497
+
498
+ ### Stratified sampling (30 items)
499
+
500
+ Stratification across the actual 52 golden items (FastAPI 27 + K8s 25):
501
+
502
+ FastAPI uses `category` as the stratification axis (the only typing in `tech_docs_golden.json`); K8s uses `question_type` (the CRAG 8-type taxonomy in `k8s_golden.json`). The 2 K8s items with `category: out_of_scope` are subsumed within their question_type stratum (most are within `false_premise`); they are not a separate K8s stratum.
503
+
504
+ | Stratum | Available | Sampled |
505
+ |---|---|---|
506
+ | FastAPI retrieval | 19 | 5 |
507
+ | FastAPI calculation | 3 | 1 |
508
+ | FastAPI out-of-scope | 5 | 2 |
509
+ | K8s simple | 6 | 4 |
510
+ | K8s simple_w_condition | 4 | 3 |
511
+ | K8s comparison | 4 | 3 |
512
+ | K8s multi_hop | 6 | 4 |
513
+ | K8s false_premise | 4 | 3 |
514
+ | K8s set | 1 | 1 |
515
+ | **Subtotal stratified** | **52** | **26** |
516
+ | Spare slots (filled from highest-variance R@5 strata) | — | 4 |
517
+ | **Total** | — | **30** |
518
+
519
+ The K8s `time_sensitive=True` flag is an overlay attribute, not an exclusive stratum — 2 K8s items carry the flag and are sampled incidentally based on the question_type they belong to. The flag does not constrain sampling.
520
+
521
+ **OOS items in calibration.** The 2 FastAPI items with `category: out_of_scope` (and however many of the sampled K8s false_premise items also carry `category: out_of_scope` — at most 2, since K8s has 2 OOS items total) follow the production harness gate: L2 judges are **skipped** for items where `category == "out_of_scope"` (the existing gate at `harness.py:153`). OOS items are still in the calibration set so that L1's `grounded_refusal` is exercised on the same items that produced labels. The κ-eligible item count per dimension is therefore at most 28 (30 minus the 2 FastAPI OOS) and possibly 26 (if both K8s OOS items get sampled into the K8s false_premise stratum); the writeup's κ table reports the actual N per row. This is the right cut because OOS handling is L1's job (deterministic refusal check) — judging "groundedness of a refusal" is methodologically incoherent (nothing to ground against).
522
+
523
+ IDs locked in `agent_bench/evaluation/datasets/calibration_v1.json` with `version: "v1"` field and `system_config_git_sha: <commit>` (the git SHA of the commit producing `system_outputs_v1.json` — name carries the limitation; v1.1 may add `system_config_resolved_hash` for stricter reproducibility).
524
+
525
+ ### FastAPI snippet authoring (calibration set only)
526
+
527
+ The 8 FastAPI items in the calibration set get hand-snippeted before labeling begins. Snippets are **verbatim spans** from `data/tech_docs/`, not paraphrases — same convention as the existing K8s `source_snippets`. **Scope discipline:** only the 8 calibration items, not the full 27-item FastAPI golden. The remaining 19 FastAPI items can be backfilled in v1.1.
528
+
529
+ If a verbatim span supporting the gold answer cannot be found, the gold answer is itself underspecified and the item is removed from the calibration set (replaced from the spare-slot stratum).
530
+
531
+ Slots into Day 1 between sampling and labeling; ~30 min of additional work; Day 1 budget shifts from 8h to 8.5h.
532
+
533
+ ### Hand-labeling rules
534
+
535
+ - Score by the rubric, not by intuition. If the rubric and intuition disagree, fix the rubric *after* the labeling pass — do not change the labels mid-pass.
536
+ - Genuine uncertainty → `abstained: true` with note. Abstains are signal.
537
+ - Track time per item; >2 minutes → rubric ambiguity, note it.
538
+ - **No AI assistance on label values.** AI may help with the labeling notebook, JSONL formatting, schema validation. Label values are hand-authored.
539
+
540
+ ### Opus stress-test (rubric ambiguity assist)
541
+
542
+ After hand-labeling, Claude Opus labels the same 30 items × 3 dimensions blind to the human labels. Disagreements are flagged as `rubric_ambiguous` for v1.1 rubric revision. **Labels are not changed.** The Opus output is a rubric-quality signal, not a ground-truth substitute. ~20 minutes of work; methodological texture for the writeup's calibration section.
543
+
544
+ ## Implementation Sequencing Notes
545
+
546
+ ### Rubric authoring order
547
+
548
+ Write the **groundedness rubric first**, alone. Dry-fit it against 3–4 calibration items to test operationalizability before authoring the other two. *Then* write relevance and completeness using whatever pattern worked for groundedness. This converts rubric authoring from "three parallel risky tasks" into "one risky task plus two near-mechanical replications," compressing realistic time variance and reducing spillover risk. The dry-fit step is what makes the tactic load-bearing: if groundedness turns out to be ill-shaped, you know after one rubric, not after three.
549
+
550
+ ### Contingency cuts (priority order)
551
+
552
+ If scope pressure forces cuts:
553
+
554
+ 1. Drop the citation deterministic-vs-LLM head-to-head section of the writeup (this section was already a stretch goal).
555
+ 2. Drop the per-judge individual κ table — keep only the variance ablation.
556
+ 3. Reduce the variance ablation to 4 rows (baseline → CoT → rubric+abstain → 2-judge jury), skipping rubric-permute.
557
+ 4. Reduce calibration set to 20 items if labeling has slipped — cite literature ceiling more heavily.
558
+
559
+ **Do not cut:** the writeup itself, the κ numbers, the rubric files, the closing position-statement paragraph (when NOT to use LLM-judge). Those are non-negotiable.
560
+
561
+ ## Acceptance Gates
562
+
563
+ Two gates with different scopes. The code PR is reviewable and mergeable independently of the writeup; coupling them creates an artificial blocker.
564
+
565
+ ### PR-open gate (required to merge `feat/judge-layer-v1`)
566
+
567
+ - All ~35 new tests pass; full `make test` suite green; `make lint` clean.
568
+ - `make calibrate --strict` runs end-to-end from a clean checkout (with API keys) and produces `docs/_generated/kappa_table.md`.
569
+ - `agent_bench/evaluation/metrics.py` no longer contains `answer_faithfulness`, `answer_correctness`, `_judge_call`, `_FAITHFULNESS_PROMPT`, or `_CORRECTNESS_PROMPT`.
570
+ - `agent_bench/evaluation/harness.py` no longer imports the deleted functions; new judges populate `EvalResult.judge_scores`.
571
+ - `evaluation.judge_provider` YAML field still functions (regression test).
572
+ - DECISIONS.md has the supersession entry referencing file paths explicitly.
573
+ - `docs/DESIGN.md` §"LLM-judge metrics" is rewritten to point at this design doc and `judge-design.md`.
574
+ - `measurements/README.md` has the new row.
575
+ - `README.md` has the "Targets that cost money" subheading.
576
+ - `pyproject.toml` excludes `scripts/_dev/*` from ruff/mypy if not already excluded.
577
+ - GitHub Actions workflow has an explicit empty `env:` block on the test job (verified to be documentation of existing behavior, not a behavior change — current workflow has no `env:` block and tests already run without provider keys via MockProvider).
578
+
579
+ ### v1-completion gate (lags PR merge by 1–2 days)
580
+
581
+ The writeup is interview material, not a PR-merge dependency. It is produced from the merged PR's calibration runs and is committed separately.
582
+
583
+ - `judge-design.md` (the writeup, separate file at `docs/judge-design.md`) is drafted with the κ ablation table copy-pasted in from `docs/_generated/kappa_table.md`.
584
+ - DECISIONS supersession entry's file-path references resolve (the calibration-labels JSONL and the relevant `results/calibration_v1_judge_*.json` files exist on `main` post-merge).
585
+
586
+ ## Out of Scope (v1.1+)
587
+
588
+ - 3rd judge (Mistral self-hosted via Modal) and quorum=2 default for the 3-judge jury.
589
+ - Multi-seed self-consistency (T=0 ensemble) on top of rubric permutation.
590
+ - DSPy / GEPA / MIPROv2 prompt optimization for rubric refinement.
591
+ - Length-bias study, bypass tests, full pass^k sweep.
592
+ - Langfuse self-host integration (judge call traces, cost dashboards).
593
+ - Dual-pass intra-rater calibration (4–6 day calendar gap; replaces literature ceiling with measured intra-rater κ in the writeup).
594
+ - Synthetic-anchor calibration set scaling (frontier-model-as-anchor on 200 items).
595
+ - AC2 sympy-derived parity tests (sympy as dev-only dep; intermediate steps printed for audit).
596
+ - Backfill `source_snippets` for the remaining 19 FastAPI golden items (only the 8 calibration items get snippets in v1).
597
+ - `system_config_resolved_hash` (canonical serialization of resolved config) added alongside `system_config_git_sha` for stricter reproducibility across noise commits.
598
+ - Citation faithfulness default-on (currently opt-in v1; `judge_dimensions` default extends to include it in v1.1).
599
+
600
+ ## Risks
601
+
602
+ | Risk | Mitigation |
603
+ |---|---|
604
+ | Day 1 rubric authoring overflows 2.5h budget | The rubric-authoring sequencing tactic (Implementation Sequencing Notes) compresses variance; if all three rubrics need full 2.5h each, fall back to the Contingency cuts subsection |
605
+ | Bootstrap CI half-width >0.15 at N=30 (κ values not defensibly distinct between rows) | Note in writeup; reduces strength of comparative claims but doesn't invalidate the table |
606
+ | Jury κ worse than the better individual judge (kappa-weighting wrong, or worse judge drags mean) | Sanity-check before final table; possible switch to trimmed mean; sidecar JSONL preserves per-judge data either way |
607
+ | Schema parse failures spike >20% on one dimension (rubric-prompt mismatch) | Abstain-rate flag surfaces in the report; fix prompt or rubric, rerun affected row only (config-file-per-row makes this cheap) |
608
+ | Hand-labeling time exceeds 2h budget | Reduce to 20-item subset (contingency cut #4); cite literature ceiling more heavily in writeup |
609
+ | Branch state at start (in-flight `docs/readme-test-count` README diff) | Land that 4-line PR first (~5 min — README test-count only; the previously-pending Option A DECISIONS entries and the warmup-penalty addendum already landed via commit `6409a40` on 2026-04-22, so they are not on the docs-PR critical path); branch `feat/judge-layer-v1` off updated main |
610
+
611
+ ---
612
+
613
+ **End of design document.** Implementation plan to follow in `docs/plans/2026-05-04-judge-layer-v1-implementation.md` (produced via the `writing-plans` skill).
docs/plans/2026-05-04-judge-layer-v1-implementation.md ADDED
The diff for this file is too large to render. See raw diff
 
measurements/2026-05-04-judge-calibration-labels.jsonl ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"item_id": "q021", "dimension": "groundedness", "score": 1, "abstained": false, "notes": "600 seconds and preflight caching are supported; conversion is arithmetic", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
2
+ {"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers the minutes conversion", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
3
+ {"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers 600/60 = 10 minutes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de"}
4
+ {"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported GraphQL and library claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
5
+ {"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses whether native GraphQL schema generation exists", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
6
+ {"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930"}
7
+ {"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer makes unsupported load-balancing claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
8
+ {"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly addresses FastAPI load balancing", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
9
+ {"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "reference answer is empty/missing for completeness", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e"}
10
+ {"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "v1.0 -> v1.1 flip: claim 'particularly useful for expensive operations like database connections' adds a use-case argument the snippet does not make. Snippet's get_db is an identifier in the example, not a use-case claim. Other claims entailed; this one is general LLM knowledge.", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
11
+ {"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers caching behavior and disabling mechanism", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
12
+ {"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers once-per-request cache and use_cache=False", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118"}
13
+ {"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported claims about other endpoints and customization beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
14
+ {"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly gives the default Swagger UI endpoint", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
15
+ {"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers /docs and interactive documentation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef"}
16
+ {"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported headers, response format, inheritance, and custom-handler claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
17
+ {"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains raising HTTPException in a route", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
18
+ {"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers HTTPException with status_code and detail", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b"}
19
+ {"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "large code sample adds many unsupported implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
20
+ {"item_id": "q023", "dimension": "relevance", "score": 1, "abstained": false, "notes": "on-topic but truncated before testing and dependency overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
21
+ {"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers error handling and CORS but misses TestClient/dependency_overrides", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde"}
22
+ {"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "only cursor pagination is supported; response_model/background task claims are unsupported by snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
23
+ {"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses pagination, validation, and analytics logging", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
24
+ {"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers cursor navigation, response_model, and BackgroundTasks", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715"}
25
+ {"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported naming, storage, ordering, examples, and YAML details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
26
+ {"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares StatefulSet and Deployment use cases", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
27
+ {"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stateless Deployment vs sticky StatefulSet identity and when to use each", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f"}
28
+ {"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported port range, cloud-provider, production, and allocation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
29
+ {"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares NodePort and LoadBalancer Services", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
30
+ {"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node IP/static port versus external load balancer and relationship", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8"}
31
+ {"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported scheduler, nodeAffinity, and nodeName implementation details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
32
+ {"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers Deployment vs DaemonSet scheduling difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
33
+ {"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers fixed replicas versus one copy on all or selected nodes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2"}
34
+ {"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Jaeger configuration guidance", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
35
+ {"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Jaeger sidecar injection setup", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
36
+ {"item_id": "k8s_004", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Jaeger docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437"}
37
+ {"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported CEL and admission-controller deny alternatives", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
38
+ {"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers RBAC deny-rule question", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
39
+ {"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no deny rules and not granting delete permission", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4"}
40
+ {"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "empty source snippets; answer gives unsupported Envoy ADS configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
41
+ {"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "addresses Envoy ADS sidecar configuration", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
42
+ {"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "notes corpus lacks Envoy ADS docs but fails to refuse as required", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f"}
43
+ {"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress-controller, EndpointSlice, kube-proxy, and DNAT details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
44
+ {"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains external HTTP traffic flow", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
45
+ {"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Ingress to Service to Pod routing and selector/load-balancing role", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5"}
46
+ {"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "snippets do not support sequential order, retry policy, or lifecycle details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
47
+ {"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains init-container startup order", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
48
+ {"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers sequential init containers, completion before app containers, and failure retry", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285"}
49
+ {"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported autoscaling/v2, memory/custom metric, and v1 comparison details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
50
+ {"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly identifies the HPA API version to use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
51
+ {"item_id": "k8s_018", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers autoscaling/v2 and why it supports memory/custom metrics", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7"}
52
+ {"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source only defines ConfigMap; mechanisms and update behavior are unsupported", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
53
+ {"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains how ConfigMap values reach Pods", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
54
+ {"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers env vars, volume mounts, and update behavior", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7"}
55
+ {"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported NodePort, ExternalIPs, Ingress, and Gateway claims beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
56
+ {"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly lists Service exposure options", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
57
+ {"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers NodePort/LoadBalancer and ClusterIP/Ingress but misses ExternalName", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b"}
58
+ {"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported sticky identity, rescheduling, headless service, and policy details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
59
+ {"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers StatefulSet Pod identity guarantees", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
60
+ {"item_id": "k8s_001", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers ordinal, network identity, stable storage, and sticky identity", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217"}
61
+ {"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported consumption mechanisms and Secret guidance beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
62
+ {"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains ConfigMap purpose and data type", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
63
+ {"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers non-confidential key-value config and not storing secrets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd"}
64
+ {"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported completion modes and configuration details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
65
+ {"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains what Jobs do and completion criteria", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
66
+ {"item_id": "k8s_007", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Pod creation, successful completions, retries, and completion state", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916"}
67
+ {"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "source snippet names the four kinds but not the detailed role/binding explanations", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
68
+ {"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly names and explains the four RBAC object kinds", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
69
+ {"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers Role, ClusterRole, RoleBinding, and ClusterRoleBinding with scope/use", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a"}
70
+ {"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported warn-mode and workload-resource behavior beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
71
+ {"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares enforce and warn modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
72
+ {"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers stable PSA, enforce rejection, warn allowance, and combined modes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6"}
73
+ {"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported default medium, memory accounting, size, and performance claims", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
74
+ {"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares default emptyDir and Memory medium", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
75
+ {"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers node-backed default, tmpfs Memory, speed, and memory-limit accounting", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5"}
76
+ {"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported kill, Service traffic, grace-period, and best-practice details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
77
+ {"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares failed liveness and readiness probes", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
78
+ {"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers liveness restart and readiness removal from traffic without restart", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566"}
79
+ {"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported idempotency and deadline details beyond snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
80
+ {"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Job and CronJob usage", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
81
+ {"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers one-off task versus recurring cron-scheduled Jobs", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6"}
82
+ {"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "single snippet does not support bypass, host-network, or trusted-workload details", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
83
+ {"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly answers what privileged profile enforces", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
84
+ {"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers no restrictions, unrestricted policy, and bypassing isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544"}
85
+ {"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported ingress/egress isolation rules beyond terse snippets", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
86
+ {"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly explains non-isolated baseline and NetworkPolicy isolation", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
87
+ {"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "covers baseline and allowed traffic but omits CNI enforcement point", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b"}
88
+ {"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "adds unsupported update semantics and traffic details beyond snippet", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
89
+ {"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "directly compares Secret volume and environment-variable consumption", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
90
+ {"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "covers both modes and update behavior difference", "label_timestamp": "2026-05-05T00:00:00Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4"}
measurements/2026-05-05-judge-rubric-opus-stress.jsonl ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"item_id": "k8s_001", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that StatefulSet Pods have a unique identity consisting of an ordinal, stable network identity, and stable storage. The answer makes many claims not supported by this single snippet: 'sticky identity' terminology, comparison to Deployments, podManagementPolicy field with OrderedReady and Parallel options, requirement of a Headless Service, and use cases like databases. None of these specific claims are entailed by the snippet provided.", "label_timestamp": "2026-05-05T14:29:41Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021464, "_latency_ms": 3804.864978999831}
2
+ {"item_id": "k8s_001", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by describing the identity guarantees StatefulSets provide to Pods, including sticky identity, persistent identifiers, uniqueness/ordering, and network identity via headless service. It covers the core ask comprehensively.", "label_timestamp": "2026-05-05T14:34:20Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020344, "_latency_ms": 3798.0725439847447}
3
+ {"item_id": "k8s_001", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has key points: (1) ordinal index, (2) stable network identity, (3) stable persistent storage, (4) identity sticks across rescheduling, (5) contrast with Deployment's interchangeable Pods. The answer covers sticky identity across rescheduling, network identity, and the contrast with Deployments. However, it does not explicitly mention ordinal index (only vaguely mentions 'ordering') and does not mention stable persistent storage at all. So partial coverage.", "label_timestamp": "2026-05-05T14:38:40Z", "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022176000000000006, "_latency_ms": 6259.8209959687665}
4
+ {"item_id": "k8s_002", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims that go far beyond the gold snippets. The snippets only support: (1) Deployment manages Pods for stateless workloads, and (2) StatefulSet maintains sticky identity per Pod. The answer adds numerous unsupported claims: ordinal-based naming (web-0, web-1), volumeClaimTemplates, headless Service requirement, ordered sequential operations, specific examples like MySQL/Cassandra/Kafka, YAML configurations, persistent volumes per Pod, etc. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:29:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0031912, "_latency_ms": 5066.343065002002}
5
+ {"item_id": "k8s_002", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it compares StatefulSet and Deployment across multiple dimensions (pod identity, naming, network, storage, ordering) and explicitly explains when to prefer each with examples. This matches Example F's on-target comparison pattern.", "label_timestamp": "2026-05-05T14:34:29Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0027416000000000003, "_latency_ms": 4008.2699899794534}
6
+ {"item_id": "k8s_002", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two main parts: (1) Deployment manages stateless, interchangeable replicas with declarative updates, and (2) StatefulSet provides sticky identity with stable network IDs, stable persistent storage, and ordered deployment/scaling. The answer covers both: Deployment is described as managing stateless apps with interchangeable Pods, and StatefulSet is described with stable hostnames, persistent storage per Pod, ordered operations, and unique identity.", "label_timestamp": "2026-05-05T14:38:51Z", "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0031176000000000003, "_latency_ms": 5786.756471032277}
7
+ {"item_id": "k8s_003", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that Ingress exposes HTTP/HTTPS routes from outside the cluster, and that Services target Pods usually via a selector. The answer makes many claims not supported by the snippets: kube-proxy details, EndpointSlices, iptables, DNAT, ClusterIP virtual IP, load balancing mechanics, session affinity, etc. None of these are in the gold snippets.", "label_timestamp": "2026-05-05T14:30:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002524, "_latency_ms": 3977.3940969607793}
8
+ {"item_id": "k8s_003", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the full traffic path from Ingress through Service to Pod, covering all three layers requested. It explains the Ingress controller's role at the edge, the Service layer with ClusterIP and EndpointSlices, kube-proxy's routing function, and final Pod delivery via DNAT.", "label_timestamp": "2026-05-05T14:34:38Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002372, "_latency_ms": 3949.3474710034207}
9
+ {"item_id": "k8s_003", "dimension": "completeness", "score": 2, "abstained": false, "notes": "Reference points: (1) Ingress exposes HTTP/HTTPS routes from outside and maps to backend Services based on rules - covered. (2) Service is abstraction defining logical set of endpoints (Pods) using selector - covered. (3) Load-balances traffic across Pods - covered. (4) Service delivers traffic to container port each Pod exposes - covered.", "label_timestamp": "2026-05-05T14:39:00Z", "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028088, "_latency_ms": 4416.859550983645}
10
+ {"item_id": "k8s_004", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets are empty, meaning there are no snippets to ground any claims against. The answer provides substantive technical claims about Jaeger sidecar injection (labels, annotations, ports, etc.) that cannot be verified against the (empty) snippets. The answer also acknowledges the docs don't contain Jaeger-specific info but then proceeds to make many specific claims anyway, so this is not a pure refusal/abstain.", "label_timestamp": "2026-05-05T14:30:08Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002128, "_latency_ms": 3805.78425899148}
11
+ {"item_id": "k8s_004", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to enable Jaeger sidecar injection in a Kubernetes Deployment, providing the relevant label/annotation methods, namespace-level injection, configuration options, and verification steps. This is on-target for the question's core ask.", "label_timestamp": "2026-05-05T14:34:46Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020648000000000003, "_latency_ms": 2984.175974968821}
12
+ {"item_id": "k8s_004", "dimension": "completeness", "score": 0, "abstained": false, "notes": "The reference answer's key point is that the agent should refuse to answer because Jaeger is not covered in the Kubernetes documentation corpus. The agent's answer does the opposite - it provides detailed instructions on Jaeger sidecar injection rather than refusing. While it briefly notes 'the specific search results don't contain Jaeger-specific documentation,' it then proceeds to give a full answer anyway, which contradicts the reference's point about refusing and citing zero sources.", "label_timestamp": "2026-05-05T14:39:11Z", "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0022992, "_latency_ms": 5866.323757043574}
13
+ {"item_id": "k8s_005", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only contain two minimal pieces of information: that policy violations cause pod rejection, and a feature state stable in v1.25. The answer makes many claims that are not supported by these snippets, including: warn mode triggers user-facing warnings, enforce mode not being applied to workload resources while warn is, the specific label syntax, and the configuration examples. Only the claim about enforce mode rejecting pods is supported by snippet [1].", "label_timestamp": "2026-05-05T14:30:18Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020616000000000002, "_latency_ms": 4882.2048010188155}
14
+ {"item_id": "k8s_005", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question about how Pod Security Admission behaves differently in enforce vs warn modes. It provides a clear comparison covering pod rejection behavior, user feedback, and applicability to workload resources, which are the key behavioral differences between these modes.", "label_timestamp": "2026-05-05T14:34:55Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020016, "_latency_ms": 3597.780067997519}
15
+ {"item_id": "k8s_005", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points: enforce mode rejects pods on violation, warn mode allows pods but triggers user-facing warning, and modes can be combined on the same namespace. Namespace-level application is also mentioned via labels. The stable since v1.25 detail is missing but is contextual rather than a core behavioral point about enforce vs warn.", "label_timestamp": "2026-05-05T14:39:21Z", "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002044, "_latency_ms": 4796.786619001068}
16
+ {"item_id": "k8s_006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports that a ConfigMap is an API object used to store non-confidential data in key-value pairs. The answer makes many additional claims not supported by the snippet: the three ways pods consume ConfigMaps (env vars, command-line args, volumes), the warning about not storing passwords/API keys/tokens/certificates, the recommendation to use Secrets, and the existence of `data` and `binaryData` fields. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:27Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020984000000000003, "_latency_ms": 3795.695087988861}
17
+ {"item_id": "k8s_006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly defines what a ConfigMap is (a Kubernetes API object for storing non-confidential data in key-value pairs) and explicitly addresses what kind of data should be stored in it (non-sensitive configuration data like settings, parameters, feature flags), while also clarifying what should not be stored. Both parts of the question are addressed on-target.", "label_timestamp": "2026-05-05T14:35:04Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002028, "_latency_ms": 4578.002344002016}
18
+ {"item_id": "k8s_006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) ConfigMap is an API object storing non-confidential data in key-value pairs, (2) used for application configuration that doesn't need to be secret, and (3) confidential data like passwords/tokens should be in Secrets instead.", "label_timestamp": "2026-05-05T14:39:30Z", "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0020288, "_latency_ms": 3426.676850998774}
19
+ {"item_id": "k8s_007", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a Job creates Pods and retries until a specified number successfully terminate. The answer makes many additional claims not supported by the snippet: NonIndexed vs Indexed completion modes, .spec.completionMode, completion indices 0 to N-1, annotations/labels like batch.kubernetes.io/job-completion-index, JOB_COMPLETION_INDEX env variable, .spec.parallelism, .spec.backoffLimit, and the 'Completed' state transition. None of these are entailed by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:36Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023552, "_latency_ms": 3761.0972960246727}
20
+ {"item_id": "k8s_007", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: what a Kubernetes Job does (runs finite/batch tasks to completion, creates Pods that run once) and how it decides the task is complete (based on completion mode, .spec.completions, NonIndexed vs Indexed). Both core asks are covered on-target.", "label_timestamp": "2026-05-05T14:35:13Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019904000000000002, "_latency_ms": 3753.890769032296}
21
+ {"item_id": "k8s_007", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) Job creates one or more Pods, (2) retries execution until specified number successfully terminate, (3) tracks successful completions, (4) complete when specified number reached, (5) deleting a Job cleans up the Pods it created. The answer covers points 1 (creates one or more Pods), 2 (implied via backoffLimit retries), 3 and 4 (tracks completions and completes when .spec.completions reached). However, point 5 (deleting a Job cleans up the Pods) is not mentioned in the answer.", "label_timestamp": "2026-05-05T14:39:40Z", "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024712, "_latency_ms": 5021.353780000936}
22
+ {"item_id": "k8s_009", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that the RBAC API declares four kinds: Role, ClusterRole, RoleBinding, and ClusterRoleBinding. The answer adds many descriptive claims about each (e.g., 'Defines a set of permissions within a specific namespace', 'cluster-scoped rather than namespace-scoped', 'Grants the permissions defined in a Role to a user or set of users', 'connects a Role to subjects', etc.) that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:30:45Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021512, "_latency_ms": 3805.1239320193417}
23
+ {"item_id": "k8s_009", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly lists all four RBAC object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding) and explains what each does, fully addressing the question's core ask.", "label_timestamp": "2026-05-05T14:35:21Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001976, "_latency_ms": 2915.8728439942934}
24
+ {"item_id": "k8s_009", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: the four object kinds (Role, ClusterRole, RoleBinding, ClusterRoleBinding), that Roles/ClusterRoles contain permission rules while bindings grant them to users/groups/service accounts, and the namespace vs cluster scope distinction.", "label_timestamp": "2026-05-05T14:39:48Z", "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002052, "_latency_ms": 3570.147737977095}
25
+ {"item_id": "k8s_011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippet only states that Secrets can be mounted as data volumes or exposed (presumably as environment variables). It does not contain any of the detailed claims about automatic updates, subPath behavior, character restrictions, pod restart requirements, or eventual consistency. The answer makes many specific claims that are not supported by the single truncated snippet provided.", "label_timestamp": "2026-05-05T14:30:53Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019776, "_latency_ms": 3797.7397789945826}
26
+ {"item_id": "k8s_011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by comparing the two methods of consuming Secrets in Pods. It covers update behavior, restart requirements, naming constraints, and subPath limitations, providing a clear comparison of both approaches.", "label_timestamp": "2026-05-05T14:35:29Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019824, "_latency_ms": 3311.2139879958704}
27
+ {"item_id": "k8s_011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) Secrets can be mounted as a data volume, (2) Secrets can be exposed as environment variables, (3) mounted volumes receive in-place updates when the Secret changes, and (4) environment variables are set at Pod start and don't update. The 'each key becomes a file' point is implicitly covered by 'Secrets are exposed as files in the mounted directory'.", "label_timestamp": "2026-05-05T14:39:57Z", "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023824000000000002, "_latency_ms": 4128.535017021932}
28
+ {"item_id": "k8s_012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that setting emptyDir.medium to 'Memory' causes Kubernetes to mount a tmpfs. The answer makes many additional claims not supported by the snippet: that default emptyDir is stored on node-backing media, that memory-backed files count against container memory limits, sizing behavior when no size is specified, cost/size comparisons, and other resource considerations. None of these are entailed by the single provided snippet.", "label_timestamp": "2026-05-05T14:31:02Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021024, "_latency_ms": 4015.2714860159904}
29
+ {"item_id": "k8s_012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core comparison between default emptyDir and Memory-medium emptyDir, covering storage location, performance, memory accounting, and size limits for both cases.", "label_timestamp": "2026-05-05T14:35:37Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017872, "_latency_ms": 2831.339330004994}
30
+ {"item_id": "k8s_012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) default emptyDir is stored on whatever medium backs the node (disk, SSD, or network storage), (2) setting medium to 'Memory' mounts a tmpfs/RAM-backed filesystem, and (3) tmpfs is fast but counts against the container's memory limit. The answer covers all three: it states the default storage location verbatim, describes Memory as tmpfs/RAM-backed, notes it's very fast, and explicitly states files count against the container's memory limit.", "label_timestamp": "2026-05-05T14:40:08Z", "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023808, "_latency_ms": 5550.627671997063}
31
+ {"item_id": "k8s_013", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that liveness probes are used to know when to restart a container, and readiness probes are used to know when a container is ready to accept traffic. The answer makes many additional claims not supported by the snippets: terminationGracePeriodSeconds behavior, deadlock examples, Pod Ready condition being set to false, kubelet continuing to run more probes periodically, best practice recommendations about shared endpoints and failureThreshold settings, etc. These claims may be true but are not entailed by the provided snippets.", "label_timestamp": "2026-05-05T14:31:12Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002268, "_latency_ms": 4528.785168018658}
32
+ {"item_id": "k8s_013", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question by explaining how the kubelet responds differently to failing liveness vs readiness probes. It clearly states liveness failures cause container restart while readiness failures remove the pod from service endpoints without killing the container.", "label_timestamp": "2026-05-05T14:35:46Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0019464, "_latency_ms": 4366.952612996101}
33
+ {"item_id": "k8s_013", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has key points: (1) liveness failure \u2192 kubelet restarts container, (2) purpose is to recover from wedged state like deadlock, (3) readiness failure \u2192 container not restarted, (4) Pod marked not-ready and removed from Service load balancers. The answer covers all of these: liveness kills/restarts container, mentions deadlocks as use case, readiness keeps container running, and pod won't receive traffic from Services.", "label_timestamp": "2026-05-05T14:40:18Z", "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023168000000000004, "_latency_ms": 4956.465922994539}
34
+ {"item_id": "k8s_014", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only support two basic claims: NodePort exposes the Service on each Node, and LoadBalancer exposes the Service externally using an external load balancer. The answer makes many additional claims not supported by the snippets, including the static port range (30000-32767), the access pattern <NodeIP>:<NodePort>, the cluster IP creation, the relationship that LoadBalancer is built on NodePort, the spec.allocateLoadBalancerNodePorts flag, and production/development suitability claims. None of these are entailed by the two minimal snippets provided.", "label_timestamp": "2026-05-05T14:31:21Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002252, "_latency_ms": 4352.603528008331}
35
+ {"item_id": "k8s_014", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core ask by comparing NodePort and LoadBalancer Service types, covering both sides with their characteristics, use cases, and relationship.", "label_timestamp": "2026-05-05T14:35:54Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017527999999999999, "_latency_ms": 2996.093010995537}
36
+ {"item_id": "k8s_014", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) NodePort exposes Service on each Node's IP at a static port - explicitly stated; (2) LoadBalancer exposes externally using an external load balancer - explicitly stated; (3) Kubernetes doesn't directly provide the load balancer, requires cloud provider integration - explicitly stated; (4) LoadBalancer typically implemented on top of NodePort - explicitly stated.", "label_timestamp": "2026-05-05T14:40:26Z", "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0023512, "_latency_ms": 3537.0634549763054}
37
+ {"item_id": "k8s_015", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains many claims not supported by the snippets. The snippets only state that Jobs are one-off tasks that run to completion and that CronJobs start Jobs on a repeating schedule. The answer adds unsupported claims about idempotency requirements, exactly-once semantics, two Jobs being created for a single schedule, startingDeadlineSeconds, and deadline handling behavior. None of these are in the snippets.", "label_timestamp": "2026-05-05T14:31:31Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021944, "_latency_ms": 4788.899898994714}
38
+ {"item_id": "k8s_015", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how CronJobs differ from Jobs (execution, scheduling, use case) and when to use one over the other. It also adds relevant considerations about idempotency and deadlines.", "label_timestamp": "2026-05-05T14:36:02Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001912, "_latency_ms": 2846.5557509916835}
39
+ {"item_id": "k8s_015", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) Job runs a one-off task to completion, creating Pods and retrying until success; (2) CronJob creates Jobs on a repeating cron schedule for recurring actions like backups; (3) use Job for single batch run, CronJob for recurring schedule. The answer covers all three: Job runs once to completion for batch tasks, CronJob runs on cron schedule for recurring tasks like backups/reports, and explicitly states use Job for one-time and CronJob for recurring. The 'creates Pods and retries' detail is not explicitly mentioned but the run-to-completion concept is covered.", "label_timestamp": "2026-05-05T14:40:38Z", "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0026271999999999997, "_latency_ms": 6194.046579999849}
40
+ {"item_id": "k8s_016", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that a Deployment manages Pods for stateless workloads and that a DaemonSet ensures all/some Nodes run a copy of a Pod. The answer makes many additional claims not supported by the snippets: details about the Kubernetes scheduler, nodeAffinity rules added by the DaemonSet controller, the .spec.nodeName field being set, specific replica distribution examples, and common use cases like logging agents, monitoring daemons, web servers. None of these specifics are entailed by the two brief snippets provided.", "label_timestamp": "2026-05-05T14:31:51Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002292, "_latency_ms": 14661.54205496423}
41
+ {"item_id": "k8s_016", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core scheduling difference between Deployment and DaemonSet, explaining that Deployments run a specified number of replicas distributed across nodes via the scheduler, while DaemonSets ensure one Pod per eligible node. Both sides of the comparison are covered.", "label_timestamp": "2026-05-05T14:36:11Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018928, "_latency_ms": 3361.026384984143}
42
+ {"item_id": "k8s_016", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has two key points: (1) Deployment schedules a fixed/configured number of replicas independent of node count, and (2) DaemonSet runs a Pod on each (matching) node, so replica count scales with nodes. The answer covers both: it states Deployment manages a specified number of replicas placed on any eligible nodes, and DaemonSet ensures all eligible nodes run one copy of a Pod. While it doesn't explicitly say 'as nodes are added DaemonSet Pods are added with them,' it conveys that the count is tied to eligible nodes.", "label_timestamp": "2026-05-05T14:40:50Z", "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025031999999999997, "_latency_ms": 7288.97923103068}
43
+ {"item_id": "k8s_017", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only state that init containers are like regular containers and run before the main application container. The answer makes many additional claims (sequential execution, retry behavior with restartPolicy, OnFailure handling, networking/storage readiness, one-time execution, exit code 0 requirement, etc.) that are not supported by the provided snippets.", "label_timestamp": "2026-05-05T14:32:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019376, "_latency_ms": 3801.133704953827}
44
+ {"item_id": "k8s_017", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: the order of init and application container execution, and the guarantees Kubernetes provides. It explains sequential init container execution, blocking of app containers, and includes related guarantees about retries and restart policy.", "label_timestamp": "2026-05-05T14:36:19Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001876, "_latency_ms": 3559.6700820024125}
45
+ {"item_id": "k8s_017", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has four key points: (1) init containers run one at a time in defined order, (2) each must complete before next starts, (3) regular containers start only after all init containers terminate successfully, (4) on failure, Pod restarts per restartPolicy and init sequence begins again, (5) suitable for one-time setup. The answer covers sequential ordering, completion requirement, app containers starting after init completion, and restartPolicy retry behavior. The 'one-time setup' purpose is implied at the end ('initialization tasks complete reliably before your application containers begin running').", "label_timestamp": "2026-05-05T14:41:00Z", "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 5247.3236820078455}
46
+ {"item_id": "k8s_018", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only mention that HorizontalPodAutoscaler automatically updates a workload resource and reference a current stable version. They do not support specific claims about autoscaling/v2 being stable since v1.23, support for custom/memory/external metrics, multiple metrics evaluation, scale-up/scale-down behaviors, or that v1 only supports CPU. Nearly every claim in the answer is unsupported by the snippets.", "label_timestamp": "2026-05-05T14:32:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020632, "_latency_ms": 4089.8927800008096}
47
+ {"item_id": "k8s_018", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it specifies the API version (autoscaling/v2) and explains why (stable support for custom metrics, memory metrics, multiple metrics, advanced features). Both the 'which' and 'why' components are covered.", "label_timestamp": "2026-05-05T14:36:28Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020032, "_latency_ms": 4004.737361043226}
48
+ {"item_id": "k8s_018", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has these key points: (1) current stable HPA API is autoscaling/v2, (2) it adds memory and custom metrics beyond CPU-only autoscaling/v1, (3) new fields are preserved as annotations when working with autoscaling/v1, (4) use autoscaling/v2 for memory/custom metric scaling. The answer covers points 1, 2, and 4 clearly. Point 3 about annotation preservation in v1 is not mentioned. This is partial coverage.", "label_timestamp": "2026-05-05T14:41:09Z", "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.002352, "_latency_ms": 4204.742238041945}
49
+ {"item_id": "k8s_019", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only states that a ConfigMap is an API object for storing non-confidential key-value pairs. The answer makes many claims about four mechanisms (env vars individual/bulk, volume mounts, API access), update behaviors, subPath limitations, kubelet handling, etc. None of these claims are supported by the single provided snippet.", "label_timestamp": "2026-05-05T14:32:18Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002272, "_latency_ms": 4105.534160975367}
50
+ {"item_id": "k8s_019", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing and explaining the mechanisms Kubernetes provides to make ConfigMap values available to a Pod: individual env vars, bulk env vars, volume mounts, and direct API access. It includes examples and notes on update behavior, fully covering the core ask.", "label_timestamp": "2026-05-05T14:36:37Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0023304000000000003, "_latency_ms": 3380.863350990694}
51
+ {"item_id": "k8s_019", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) env variables expose keys, (2) volume mount makes keys into files, (3) volume-mounted data updates in place while env vars require pod restart. The answer covers all three: it describes env vars, volume mounts with keys as filenames, and explicitly notes env vars require restart while volume mounts support dynamic updates.", "label_timestamp": "2026-05-05T14:41:19Z", "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025688, "_latency_ms": 4229.396597947925}
52
+ {"item_id": "k8s_020", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold snippets only contain two fragments: 'non-isolated' and 'namespaceSelector'. The answer makes many detailed claims about ingress/egress defaults, policyTypes, podSelector, connections from the Pod's own node, etc. While 'non-isolated' supports the default claim, the vast majority of claims (policyTypes values, podSelector mechanism, node connection allowance, egress rule behavior) are not supported by the minimal snippets provided.", "label_timestamp": "2026-05-05T14:32:26Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020512, "_latency_ms": 3578.6442419630475}
53
+ {"item_id": "k8s_020", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: it explains that by default Pods are non-isolated (and thus not subject to NetworkPolicy filtering in the restrictive sense), and it explains how applying a NetworkPolicy that selects a Pod changes the baseline by making the Pod isolated for the specified direction(s), allowing only explicitly permitted traffic.", "label_timestamp": "2026-05-05T14:36:45Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020232, "_latency_ms": 3397.536567004863}
54
+ {"item_id": "k8s_020", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has four key points: (1) Pods are non-isolated by default, (2) a Pod becomes isolated when a NetworkPolicy selects it via podSelector, (3) only traffic allowed by the union of NetworkPolicies is permitted, (4) policies can target ingress/egress/both, and (5) the CNI plugin enforces the policy, not Kubernetes itself. The answer covers points 1, 2, 3, and 4 clearly. However, it does not mention that the CNI plugin is what enforces the policy.", "label_timestamp": "2026-05-05T14:41:28Z", "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0025376000000000005, "_latency_ms": 4806.188436981756}
55
+ {"item_id": "k8s_022", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet only supports the claim that RBAC is purely additive with no deny rules. The answer makes many additional claims about CEL-based AuthorizationPolicy (with specific API version and schema), ValidatingAdmissionPolicy syntax, Kubernetes 1.26+ version requirements, and specific YAML structures that are not supported by the single snippet provided.", "label_timestamp": "2026-05-05T14:32:35Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0023096, "_latency_ms": 3849.2560360464267}
56
+ {"item_id": "k8s_022", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the user's question about writing an RBAC deny rule for blocking pod deletion. It correctly explains that RBAC doesn't support explicit deny rules and provides alternative approaches including not granting delete permission, admission controllers, etc. This directly engages with the core ask.", "label_timestamp": "2026-05-05T14:36:53Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0022656000000000004, "_latency_ms": 3391.0853110137396}
57
+ {"item_id": "k8s_022", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference's key points are: (1) RBAC does not support deny rules, (2) rules are purely additive, (3) to block an action, simply don't grant the permission. The answer covers all of these: it states RBAC does not have explicit deny rules, is allow-based only, and recommends not granting the delete verb as the primary approach.", "label_timestamp": "2026-05-05T14:41:37Z", "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0024144, "_latency_ms": 3988.3979110163637}
58
+ {"item_id": "k8s_023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided is 'Unrestricted policy', which is a very minimal piece of information. The answer makes multiple specific claims: that the privileged profile is 'entirely unrestricted', that pods can 'bypass typical container isolation mechanisms', that pods can access the node's host network, and that it's aimed at system/infrastructure-level workloads managed by privileged trusted users. While the snippet's phrase 'Unrestricted policy' loosely supports the 'unrestricted' claim, the specific claims about bypassing container isolation, host network access, and the intended audience (system/infrastructure workloads, trusted users) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:32:45Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.002084, "_latency_ms": 4419.9781229835935}
59
+ {"item_id": "k8s_023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks which container-isolation restrictions the 'privileged' profile enforces. The answer directly addresses this by stating that the privileged profile enforces no restrictions and is unrestricted by design, which is a direct on-topic answer to the core ask.", "label_timestamp": "2026-05-05T14:37:01Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017008000000000001, "_latency_ms": 2984.6201190375723}
60
+ {"item_id": "k8s_023", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all key points from the reference: (1) privileged enforces none/absence of restrictions, (2) purposely-open and entirely unrestricted, (3) bypasses typical container isolation mechanisms with host network example. The only minor omission is the explicit recommendation to use baseline or restricted for actual isolation, but the answer does contrast with these more restrictive profiles.", "label_timestamp": "2026-05-05T14:41:47Z", "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0018664, "_latency_ms": 4233.887287962716}
61
+ {"item_id": "k8s_024", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The gold source snippets section is empty - there are no snippets provided to ground any claims against. The answer makes numerous technical claims about Envoy ADS configuration, bootstrap settings, Kubernetes deployment, and control plane setup, none of which can be supported by the (nonexistent) snippets. The answer itself acknowledges the documentation doesn't contain this information but then proceeds to provide unsupported claims rather than abstaining. Since the answer is not a pure refusal/abstention but instead provides substantive unsupported claims, it fails groundedness.", "label_timestamp": "2026-05-05T14:32:55Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0027576000000000002, "_latency_ms": 5302.7043039910495}
62
+ {"item_id": "k8s_024", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question about configuring Envoy xDS ADS for sidecar proxies in a Kubernetes Deployment. It provides bootstrap configuration with ADS settings, a Kubernetes Deployment manifest with the Envoy sidecar, and discusses the control plane setup. Both aspects of the question (ADS configuration and Kubernetes Deployment integration) are covered.", "label_timestamp": "2026-05-05T14:37:11Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002908, "_latency_ms": 4470.350661023986}
63
+ {"item_id": "k8s_024", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference states the correct answer is to refuse and cite zero sources because the corpus doesn't cover Envoy/xDS/ADS. The agent does acknowledge that the documentation doesn't contain this information ('The documentation corpus doesn't appear to contain specific Envoy ADS configuration documentation'), which partially covers the reference point. However, instead of refusing, it then provides extensive configuration guidance, which goes against the 'refuse' aspect of the reference. The acknowledgment of absence is present but the refusal stance is not maintained.", "label_timestamp": "2026-05-05T14:41:56Z", "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0029904000000000003, "_latency_ms": 4743.537011963781}
64
+ {"item_id": "k8s_025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only describe ClusterIP ('Exposes the Service on a cluster-internal IP') and LoadBalancer ('Exposes the Service externally using an external load balancer'). The answer makes many claims about NodePort, ExternalIPs, Ingress, Gateway, and details about LoadBalancer (cloud provider, stable external IP) that are not supported by the snippets. Multiple unsupported claims fail the strict conjunction rule.", "label_timestamp": "2026-05-05T14:33:05Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0020808000000000003, "_latency_ms": 4765.863787964918}
65
+ {"item_id": "k8s_025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by listing Kubernetes Service types that expose applications externally (NodePort, LoadBalancer) and adds ExternalIPs as additional context, while clarifying ClusterIP is internal-only. This directly addresses the core ask.", "label_timestamp": "2026-05-05T14:37:19Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0017576000000000002, "_latency_ms": 3343.788483995013}
66
+ {"item_id": "k8s_025", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference identifies three external Service types: NodePort, LoadBalancer, and ExternalName, plus mentions ClusterIP as default and Ingress as alternative. The answer covers NodePort, LoadBalancer, ClusterIP as default, and Ingress. However, it does not mention ExternalName (instead it discusses ExternalIPs, which is a different concept). So one of the three external-exposing types is missing.", "label_timestamp": "2026-05-05T14:42:06Z", "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0021791999999999996, "_latency_ms": 4295.65988300601}
67
+ {"item_id": "q006", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer's core claims about default caching within a single request and using use_cache=False to disable it are supported by the snippet. However, the answer adds claims not supported: 'particularly useful for expensive operations like database connections' is an embellishment not in the snippet. More critically, the example shows using use_cache=False on two separate Depends calls in the same handler and claims this 'ensures that the dependency function is called twice' \u2014 but the snippet only describes caching when the same dependency is used multiple times (e.g., route + sub-dependency). The performance improvement claim ('improves performance by avoiding redundant function calls') is a reasonable inference but not explicitly stated. The strict-conjunction rule means unsupported claims fail the rubric.", "label_timestamp": "2026-05-05T14:28:25Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0026408, "_latency_ms": 6825.492414005566}
68
+ {"item_id": "q006", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses both parts of the question: how dependency caching works in FastAPI (per-request caching of dependency results) and how to disable it (using use_cache=False). Both core asks are covered with explanation and example.", "label_timestamp": "2026-05-05T14:33:13Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.001804, "_latency_ms": 3368.1467549758963}
69
+ {"item_id": "q006", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers both reference points: (1) FastAPI caches dependency results so each dependency is called only once per request, and (2) caching can be disabled by setting use_cache=False in Depends().", "label_timestamp": "2026-05-05T14:37:29Z", "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001668, "_latency_ms": 5226.763973012567}
70
+ {"item_id": "q010", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "No gold snippets are provided to support any of the claims in the answer. The answer makes multiple specific claims (FastAPI generates OpenAPI 3.1.0, no native GraphQL support, specific third-party libraries like Strawberry/Graphene/Ariadne) that cannot be verified against the empty snippet set.", "label_timestamp": "2026-05-05T14:28:33Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0016656000000000002, "_latency_ms": 2784.843849018216}
71
+ {"item_id": "q010", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the core question of whether FastAPI can natively generate GraphQL schemas, clearly stating it cannot, and provides additional context about alternatives.", "label_timestamp": "2026-05-05T14:33:21Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0016496000000000002, "_latency_ms": 2810.719419037923}
72
+ {"item_id": "q010", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "Without a gold reference provided, I cannot assess coverage of specific reference points against the answer.", "label_timestamp": "2026-05-05T14:37:38Z", "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0012887999999999999, "_latency_ms": 3803.2577409758233}
73
+ {"item_id": "q011", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippets only confirm that `/docs` is the Swagger UI endpoint and that FastAPI exposes three documentation-related endpoints by default. The answer makes many additional claims not supported by the snippets: the specific names `/redoc` and `/openapi.json`, that ReDoc is the alternative viewer, that `/openapi.json` returns the raw OpenAPI schema, and the customization/disabling code with parameters like `docs_url`, `redoc_url`, `openapi_url`. None of these specifics are entailed by the two snippets provided.", "label_timestamp": "2026-05-05T14:28:43Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0021912, "_latency_ms": 5105.2629159530625}
74
+ {"item_id": "q011", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by stating that the default Swagger UI endpoint in FastAPI is /docs, and provides additional helpful context about other documentation endpoints and customization.", "label_timestamp": "2026-05-05T14:33:29Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0015224000000000001, "_latency_ms": 3036.487515026238}
75
+ {"item_id": "q011", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states two points: (1) the default Swagger UI endpoint is /docs, and (2) it provides interactive API documentation. The answer covers both, explicitly stating /docs is the endpoint and describing it as an interactive API explorer.", "label_timestamp": "2026-05-05T14:37:46Z", "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0016392, "_latency_ms": 2931.246972992085}
76
+ {"item_id": "q012", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer contains several claims not supported by the snippets. The snippets only mention HTTPException as the primary way to return errors, that it terminates processing, returns status code and detail, and that detail can be string/list/dict serialized to JSON. Unsupported claims include: the 'headers' parameter, the example with custom headers, the default JSON response format showing {\"detail\": \"Item not found\"}, that HTTPException inherits from Starlette's HTTPException, and that you can create custom exception handlers. These are not entailed by the gold snippets.", "label_timestamp": "2026-05-05T14:28:52Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00242, "_latency_ms": 4274.66034895042}
77
+ {"item_id": "q012", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses how to raise an HTTP error in a FastAPI route handler by explaining the use of HTTPException, providing code examples, parameters, and additional context. This is a direct, on-target response to the question's core ask.", "label_timestamp": "2026-05-05T14:33:37Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0018336000000000001, "_latency_ms": 3302.3087149485946}
78
+ {"item_id": "q012", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference has three key points: (1) raise HTTPException, (2) with a status_code, (3) and a detail message. The answer covers all three explicitly, showing the raise statement, status_code parameter, and detail parameter.", "label_timestamp": "2026-05-05T14:37:55Z", "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0017992000000000001, "_latency_ms": 4265.608784975484}
79
+ {"item_id": "q021", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The snippet states max_age=600 seconds means the browser caches preflight results. The answer's core claim that 600 seconds = 10 minutes is a math conversion supported by the snippet's value. However, the answer adds claims about the 'Access-Control-Max-Age' header name and that max_age is a 'directive' specifying maximum cache time of preflight requests \u2014 these specifics (header name in particular) are not in the snippet. The snippet only mentions 'max_age' as a config with value 600 and that browser caches preflight results.", "label_timestamp": "2026-05-05T14:29:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0018016, "_latency_ms": 5443.063500977587}
80
+ {"item_id": "q021", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses the question by converting 600 seconds to 10 minutes and explaining the calculation.", "label_timestamp": "2026-05-05T14:33:45Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0012775999999999998, "_latency_ms": 2611.7517079692334}
81
+ {"item_id": "q021", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The reference states that with max_age 600 seconds, the browser caches preflight results for 10 minutes. The answer covers exactly this: 10 minutes caching, with the 600/60=10 calculation shown.", "label_timestamp": "2026-05-05T14:38:03Z", "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0014024, "_latency_ms": 2987.084314983804}
82
+ {"item_id": "q023", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The answer makes many specific claims that go far beyond the two snippets provided. The snippets only state that HTTPException is the primary way to return error responses and that CORS is configured using CORSMiddleware from Starlette. The answer includes extensive code with specific parameters (allow_credentials, allow_methods, expose_headers, max_age), custom exception handlers with @app.exception_handler decorator syntax, RequestValidationError handling, dependency injection patterns, and many other details not supported by the snippets.", "label_timestamp": "2026-05-05T14:29:12Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.00298, "_latency_ms": 4431.315872003324}
83
+ {"item_id": "q023", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three parts of the question: custom error handling (custom exception classes and handlers), CORS middleware (CORSMiddleware setup), and begins setting up dependency injection for testing. Although the answer is truncated mid-code, it clearly engages with the core ask on-topic.", "label_timestamp": "2026-05-05T14:33:53Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0024936000000000003, "_latency_ms": 3396.7526520136744}
84
+ {"item_id": "q023", "dimension": "completeness", "score": 1, "abstained": false, "notes": "The reference has three key points: (1) custom error handling via HTTPException/exception handlers, (2) CORS via CORSMiddleware with allowed origins, (3) testing with TestClient and app.dependency_overrides. The answer covers exception handlers extensively (point 1) and CORSMiddleware with allow_origins (point 2). However, the answer is cut off and does not cover testing with TestClient or app.dependency_overrides (point 3 missing).", "label_timestamp": "2026-05-05T14:38:13Z", "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028344, "_latency_ms": 4287.58555999957}
85
+ {"item_id": "q025", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "The only snippet provided describes cursor-based pagination as using an opaque token pointing to the last item in the previous page, to avoid performance issues with large offsets. The answer makes many claims far beyond this: specific Pydantic models, field constraints (min_length, max_length, le=100), base64 encoding of cursors, analytics logging structure, FastAPI implementation details, etc. None of these specifics are supported by the single snippet. While the general concept of cursor pagination is mentioned, the vast majority of the answer's claims (validation rules, base64 encoding scheme, analytics fields, etc.) are not entailed by the snippet.", "label_timestamp": "2026-05-05T14:29:23Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0030064000000000002, "_latency_ms": 5922.551519994158}
86
+ {"item_id": "q025", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The answer directly addresses all three components of the question: cursor-based pagination (with encode/decode utilities and cursor query params), response model validation (Pydantic models with Field validators), and background task processing for analytics logging (FastAPI BackgroundTasks with analytics logger). Although the code is cut off at the end, the answer substantively engages with the core ask.", "label_timestamp": "2026-05-05T14:34:02Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.002676, "_latency_ms": 4254.6321749687195}
87
+ {"item_id": "q025", "dimension": "completeness", "score": 2, "abstained": false, "notes": "The answer covers all three reference points: (1) cursor-based pagination with opaque cursor tokens (shown via base64 encoding/decoding utilities and cursor parameter), (2) response_model parameter on route decorators for validation (shown in @app.get with response_model=PaginatedResponse[Item]), and (3) analytics logging handled asynchronously via FastAPI's BackgroundTasks (imported and used as a parameter).", "label_timestamp": "2026-05-05T14:38:22Z", "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.0028224, "_latency_ms": 3933.714357961435}
88
+ {"item_id": "q027", "dimension": "groundedness", "score": 0, "abstained": false, "notes": "There are no gold source snippets provided. With no snippets, no claims in the answer can be grounded/supported. The answer makes multiple substantive claims (about Uvicorn workers, Nginx/HAProxy, Kubernetes, statelessness, etc.) that cannot be verified against any snippet. This is not a refusal answer, so abstain doesn't apply.", "label_timestamp": "2026-05-05T14:29:32Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_groundedness", "_rubric_version": "ac7ba699b20db68055ba5520c0d6520aef102cf6d56f305a6e6ba715a362c784", "_cost_usd": 0.0019168000000000002, "_latency_ms": 3519.5088270120323}
89
+ {"item_id": "q027", "dimension": "relevance", "score": 2, "abstained": false, "notes": "The question asks how FastAPI handles automatic load balancing across multiple servers. The answer directly addresses this by clarifying that FastAPI does not provide built-in load balancing, explains what it does provide (worker processes, ASGI foundation), and describes how load balancing is actually achieved via external tools. This directly engages with the question's core ask.", "label_timestamp": "2026-05-05T14:34:11Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_relevance", "_rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f", "_cost_usd": 0.0020432000000000002, "_latency_ms": 3767.282517044805}
90
+ {"item_id": "q027", "dimension": "completeness", "score": "Unknown", "abstained": true, "notes": "No gold reference answer was provided to compare against. Without reference points to check coverage, the score cannot be determined.", "label_timestamp": "2026-05-05T14:38:29Z", "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e", "_judge_id": "claude-opus-4-7_completeness", "_rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "_cost_usd": 0.001504, "_latency_ms": 2372.651186946314}
measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {"item_id": "q006", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer covers the key points from the reference about how FastAPI caches dependency results and how to disable caching. It explains that dependencies are called only once per request and provides the method to disable caching using `use_cache=False`. All essential points from the reference are present, albeit with additional detail and explanation.", "evidence_quotes": ["FastAPI implements automatic caching of dependency results within a single request.", "To disable caching and force a fresh call each time a dependency is needed, use the `use_cache` parameter set to `False`."], "shifted_up": true}
2
+ {"item_id": "q011", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer covers the key point from the reference about the default Swagger UI endpoint being /docs. However, it also includes additional details about the availability of the endpoint, other documentation endpoints, and customization options, which are not part of the reference. Since the main point about the default endpoint is clearly stated, the answer is considered to have full coverage of the reference's key point.", "evidence_quotes": ["/docs"], "shifted_up": true}
3
+ {"item_id": "k8s_002", "prior": 1, "new": 2, "gold": 2, "reasoning": "The answer provides a detailed comparison between Deployments and StatefulSets, covering all key points from the reference answer. It explains that a Deployment manages interchangeable Pods for stateless applications and provides declarative updates, while a StatefulSet maintains a sticky identity for each Pod, including stable network identifiers, stable persistent storage, and ordered deployment/scaling. All aspects of the reference are addressed, including the use cases for each type. Therefore, the answer fully covers the reference points.", "evidence_quotes": ["A Deployment manages a set of Pods for an application workload that does not maintain state and provides declarative updates; its Pods are interchangeable replicas.", "A StatefulSet, by contrast, maintains a sticky identity for each of its Pods \u2014 stable network identifiers, stable persistent storage, and ordered deployment/scaling."], "shifted_up": true}
4
+ {"item_id": "k8s_006", "prior": 1, "new": 1, "gold": 2, "reasoning": "The answer covers the key point that a ConfigMap is an API object used to store non-confidential data in key-value pairs, which is a direct match to the gold reference. However, it does not mention that ConfigMaps are intended for application configuration that does not need to be kept secret, nor does it mention that confidential data should be stored in a Secret instead of a ConfigMap. Therefore, while one key point is fully covered, the second key point is missing.", "evidence_quotes": ["A ConfigMap is a Kubernetes API object used to store non-confidential data in key-value pairs."], "shifted_up": false}
5
+ {"item_id": "k8s_018", "prior": 1, "new": 1, "gold": 2, "reasoning": "The answer covers several key points from the reference, including the current stable version of the HorizontalPodAutoscaler API (autoscaling/v2), the support for scaling on memory and custom metrics, and the distinction between autoscaling/v1 and autoscaling/v2. However, it does not explicitly mention that the new fields in autoscaling/v2 are preserved as annotations when working with autoscaling/v1, which is a key point in the reference. Therefore, while it covers most of the reference's points, it does not cover all of them.", "evidence_quotes": ["The `autoscaling/v2` API version is the current stable version (stable since Kubernetes v1.23) that includes support for scaling on **custom metrics**.", "The `autoscaling/v2` API version includes support for scaling on **memory metrics**, which is not available in the older `autoscaling/v1` version.", "With `autoscaling/v2`, you can specify multiple metrics for a HorizontalPodAutoscaler to scale on.", "When you create a HorizontalPodAutoscaler for scaling on custom or memory metrics, you must use `apiVersion: autoscaling/v2` in your manifest."], "shifted_up": false}
measurements/2026-05-06-4a-gpt4o-full-probe.jsonl ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {"reasoning": "The answer covers all the key points from the reference. It mentions that a ConfigMap is used to store non-confidential data in key-value pairs, which aligns with the reference's description. It also states that confidential data should be stored in a Secret, not a ConfigMap, which matches the reference's point about where to store sensitive information.", "evidence_quotes": ["A ConfigMap is a Kubernetes API object used to store non-confidential data in key-value pairs.", "Do NOT store sensitive or confidential data in ConfigMaps", "For sensitive data, you should use Kubernetes Secrets instead"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd", "cost_usd": 0.00022004999999999998, "latency_ms": 3570.2048700186424, "item_id": "k8s_006", "mini_prior_score": 1, "gold_score": 2}
2
+ {"reasoning": "The answer covers all the key points from the reference. It mentions that the current stable version is autoscaling/v2, which supports scaling on memory and custom metrics, similar to the reference. It also notes that autoscaling/v1 only supports CPU-based scaling, aligning with the reference's point about using autoscaling/v2 for memory or custom metric scaling.", "evidence_quotes": ["The `autoscaling/v2` API version is the current stable version", "includes support for scaling on **custom metrics**", "includes support for scaling on **memory metrics**", "Only supports CPU-based scaling"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7", "cost_usd": 0.00023295, "latency_ms": 1422.9646890307777, "item_id": "k8s_018", "mini_prior_score": 1, "gold_score": 2}
3
+ {"reasoning": "The agent's answer covers the key point from the reference answer, which is that the default Swagger UI endpoint in FastAPI is `/docs`. The additional details provided by the agent do not affect the completeness score as they do not detract from the coverage of the reference's key point.", "evidence_quotes": ["The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface.", "The default Swagger UI endpoint in FastAPI is **`/docs`**."], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef", "cost_usd": 0.0001923, "latency_ms": 1622.3357539856806, "item_id": "q011", "mini_prior_score": 1, "gold_score": 2}
4
+ {"reasoning": "The agent's answer covers all the key points from the reference answer. The reference states that an HTTP error in FastAPI is raised by using an HTTPException with a status_code and a detail message. The agent's answer explains the use of HTTPException, including the status_code and detail parameters, and provides examples that illustrate these points. Therefore, the agent's answer fully covers the reference's key points.", "evidence_quotes": ["The primary way to raise HTTP errors in FastAPI is using the `HTTPException` class:", "- **status_code** (required): The HTTP status code to return (e.g., 404, 403, 400)", "- **detail** (optional): The error message/details. Can be a string, list, or dictionary - FastAPI automatically serializes it to JSON"], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b", "cost_usd": 0.0002484, "latency_ms": 2396.0261089960113, "item_id": "q012", "mini_prior_score": 1, "gold_score": 2}
5
+ {"reasoning": "The agent's answer covers all the key points from the reference answer. It mentions the unique identity composed of an ordinal index, stable network identity, and persistent storage, which are the core components of the reference's identity description. Additionally, it explains the persistence of identity across rescheduling, similar to the reference's explanation of identity sticking to each Pod.", "evidence_quotes": ["StatefulSets maintain a **sticky identity** for each of its Pods.", "Each Pod has a **persistent identifier** that remains consistent even if the Pod is rescheduled or restarted.", "StatefulSets require a **Headless Service** to be responsible for the network identity of the Pods."], "score": 2, "judge_id": "gpt-4o-2024-08-06_completeness", "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20", "prompt_seed": 0, "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217", "cost_usd": 0.00023145, "latency_ms": 2257.6226279488765, "item_id": "k8s_001", "mini_prior_score": 1, "gold_score": 2}
measurements/2026-05-06-gpt4o-extraction-reasoning-split.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPT-4o-mini extraction-vs-reasoning split — three calibration items
2
+
3
+ This artifact documents the specific failure mode behind v1.1 finding 3 (jury
4
+ κ regression on completeness): on each of the three representative
5
+ disputed cells (gold=2, Haiku=2, gpt-4o-mini=1), gpt-4o-mini's
6
+ `evidence_quotes` correctly extracts the paraphrased coverage from the
7
+ agent's answer, and then its `reasoning` field denies that those quotes
8
+ constitute coverage. The score follows the reasoning, not the evidence.
9
+
10
+ The mechanism is a *post-extraction reasoning regression*: the
11
+ structured-output discipline forces the model to commit to an extraction
12
+ step (which it does correctly), and then the reasoning step applies a
13
+ literal-string-match standard to the answer text — even though the
14
+ rubric says "paraphrase allowed". This is a known failure mode in
15
+ chain-of-thought judges and shows up more in smaller models because the
16
+ reasoning step has less capacity to integrate the rubric's "paraphrase
17
+ allowed" instruction with the literal-text comparison the model is
18
+ running on autopilot.
19
+
20
+ Source: `results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
21
+ filtered to `judge_id="gpt-4o-mini-2024-07-18_completeness"` joined to
22
+ gold labels in `measurements/2026-05-04-judge-calibration-labels.jsonl`.
23
+
24
+ ---
25
+
26
+ ## Item q006 — FastAPI dependency caching (gold=2, Haiku=2, gpt=1)
27
+
28
+ **Reference (gold) answer:** "FastAPI caches dependency results so each
29
+ dependency is called only once per request; caching can be disabled by
30
+ setting `use_cache=False` in the `Depends()` call."
31
+
32
+ ### gpt-4o-mini evidence_quotes (extraction step — correct)
33
+
34
+ ```
35
+ [
36
+ "FastAPI implements automatic caching of dependency results within a single request.",
37
+ "To disable caching and force a fresh call each time a dependency is needed, use the `use_cache` parameter set to `False`."
38
+ ]
39
+ ```
40
+
41
+ Both gold points are present in the extracted quotes: (1) per-request
42
+ caching, (2) `use_cache=False` mechanism.
43
+
44
+ ### gpt-4o-mini reasoning (literal-match regression)
45
+
46
+ > "The answer provides a detailed explanation of how dependency caching
47
+ > works in FastAPI and how to disable it, but it does not explicitly
48
+ > mention that caching is enabled by default and that it can be disabled
49
+ > by setting `use_cache=False` in the `Depends()` call. While it covers
50
+ > the concept of caching and provides an example, it does not directly
51
+ > address the key points from the reference answer."
52
+
53
+ **The reasoning denies what the evidence_quotes confirm.** The model has
54
+ the strings "automatic caching of dependency results within a single
55
+ request" and "use the `use_cache` parameter set to `False`" in its own
56
+ extracted evidence, then writes "does not explicitly mention that caching
57
+ is enabled by default and that it can be disabled by setting
58
+ `use_cache=False`". Score=1 follows the reasoning.
59
+
60
+ ---
61
+
62
+ ## Item k8s_002 — Deployment vs StatefulSet (gold=2, Haiku=2, gpt=1)
63
+
64
+ **Reference (gold) answer:** "Deployment manages stateless pods with
65
+ declarative updates; StatefulSet maintains sticky identity for each pod
66
+ — stable network identifiers, stable persistent storage, ordered
67
+ operations."
68
+
69
+ ### gpt-4o-mini evidence_quotes (extraction step — correct)
70
+
71
+ ```
72
+ [
73
+ "A Deployment manages a set of Pods for an application workload that does not maintain state and provides declarative updates; its Pods are interchangeable replicas.",
74
+ "A StatefulSet, by contrast, maintains a sticky identity for each of its Pods — stable network identifiers, stable persistent storage, and ordered deployment/scaling."
75
+ ]
76
+ ```
77
+
78
+ The evidence_quotes literally contain the strings `"declarative updates"`
79
+ and `"sticky identity"` — the two reference phrases gpt's reasoning will
80
+ claim are missing.
81
+
82
+ ### gpt-4o-mini reasoning (literal-match regression)
83
+
84
+ > "The answer provides a detailed comparison between Deployments and
85
+ > StatefulSets, covering key differences such as pod identity, naming,
86
+ > network identity, storage, ordering, and use cases. However, it does
87
+ > not explicitly mention the points about 'declarative updates' and
88
+ > 'sticky identity' for StatefulSets as stated in the reference."
89
+
90
+ **This is the cleanest evidence-vs-reasoning contradiction in the
91
+ calibration set.** The model quotes "declarative updates" and "sticky
92
+ identity" into evidence, then writes "does not explicitly mention
93
+ 'declarative updates' and 'sticky identity'". The score field is then
94
+ 1, following the reasoning.
95
+
96
+ ---
97
+
98
+ ## Item k8s_018 — autoscaling/v2 vs v1 (gold=2, Haiku=2, gpt=1)
99
+
100
+ **Reference (gold) answer:** "autoscaling/v2 is the current stable API
101
+ version; supports memory metrics beyond CPU-only autoscaling/v1; supports
102
+ custom metrics."
103
+
104
+ ### gpt-4o-mini evidence_quotes (extraction step — correct)
105
+
106
+ ```
107
+ [
108
+ "The `autoscaling/v2` API version is the current stable version (stable since Kubernetes v1.23) that includes support for scaling on **custom metrics**.",
109
+ "The `autoscaling/v2` API version includes support for scaling on **memory metrics**, which is not available in the older `autoscaling/v1` version.",
110
+ "With `autoscaling/v2`, you can specify multiple metrics for a HorizontalPodAutoscaler to scale on."
111
+ ]
112
+ ```
113
+
114
+ All three reference points are extracted: stable version, memory metrics,
115
+ custom metrics.
116
+
117
+ ### gpt-4o-mini reasoning (literal-match regression)
118
+
119
+ > "The answer covers some key points from the reference but does not
120
+ > fully address all of them. It mentions the stable version of
121
+ > `autoscaling/v2`, support for custom metrics, and memory metrics, but
122
+ > it does not explicitly state that the new fields in `autoscaling/v2`
123
+ > are preserved as annotations when using `autoscaling/v1`, nor does it
124
+ > mention the need to use `autoscaling/v2` directly for memory or custom
125
+ > metric scaling for a Deployment or StatefulSet. Therefore, it
126
+ > partially covers the reference points."
127
+
128
+ **Same pattern, with extra-credit deduction.** The reasoning
129
+ acknowledges the three reference points are covered ("It mentions the
130
+ stable version of `autoscaling/v2`, support for custom metrics, and
131
+ memory metrics") and then deducts for points the *reference does not
132
+ require* ("does not explicitly state that the new fields in
133
+ `autoscaling/v2` are preserved as annotations when using
134
+ `autoscaling/v1`"). The reference (per the gold annotation) requires
135
+ three points; gpt's reasoning invents a fourth and penalizes for it.
136
+
137
+ ---
138
+
139
+ ## Why this matters for the writeup
140
+
141
+ This isn't "GPT-4o-mini is bad at completeness." It's a sharper claim:
142
+ *the structured-output discipline forces correct extraction, but the
143
+ reasoning step regresses to a literal-match standard the rubric does
144
+ not specify*. That regression is dimension-specific (groundedness AC1 =
145
+ 1.000, relevance AC1 = 1.000 on the same model) — it surfaces only on
146
+ the 3-point ordinal scale where "paraphrase allowed" is load-bearing.
147
+
148
+ Two consequences for evaluation framework design:
149
+
150
+ 1. **Per-dimension judge selection matters more than per-judge selection.**
151
+ gpt-4o-mini is fine for binary groundedness and saturated relevance;
152
+ it's miscalibrated for paraphrase-tolerant ordinal completeness. v1's
153
+ global "include in jury" decision flattens this.
154
+
155
+ 2. **A judge's `reasoning` field can contradict its `evidence_quotes`
156
+ field, and the score follows the reasoning.** Internal consistency
157
+ between the two structured-output fields is not enforced by any
158
+ provider's structured-output API; it's a property of the model's
159
+ capability that varies across model sizes and dimensions. v1.2
160
+ diagnostics should include an internal-consistency check (does the
161
+ reasoning's score-direction match what the evidence_quotes would
162
+ support?) as an additional signal beyond raw κ.
measurements/README.md CHANGED
@@ -12,3 +12,4 @@ Naming: `YYYY-MM-DD-<topic>-<variant>.log`
12
 
13
  Current entries:
14
  - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."
 
 
12
 
13
  Current entries:
14
  - `2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log` — HF Spaces cold-start samples N=1..3. Backs the DECISIONS.md entry "Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause."
15
+ - `2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3 dimensions hand-labels (single rater) for the κ ablation table in `docs/_generated/kappa_table.md` and the writeup at `docs/judge-design.md`. Backs the DECISIONS.md entry "LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call". Lands in Phase 10 (manual labeling).
pyproject.toml CHANGED
@@ -54,6 +54,7 @@ testpaths = ["tests"]
54
  [tool.ruff]
55
  target-version = "py311"
56
  line-length = 100
 
57
 
58
  [tool.ruff.lint]
59
  select = ["E", "F", "I", "N", "W"]
@@ -62,3 +63,4 @@ select = ["E", "F", "I", "N", "W"]
62
  python_version = "3.11"
63
  warn_return_any = true
64
  warn_unused_configs = true
 
 
54
  [tool.ruff]
55
  target-version = "py311"
56
  line-length = 100
57
+ extend-exclude = ["scripts/_dev"]
58
 
59
  [tool.ruff.lint]
60
  select = ["E", "F", "I", "N", "W"]
 
63
  python_version = "3.11"
64
  warn_return_any = true
65
  warn_unused_configs = true
66
+ exclude = ["scripts/_dev/"]
results/calibration_v1_judge_baseline.json ADDED
The diff for this file is too large to render. See raw diff
 
results/calibration_v1_judge_baseline_no_abstain.json ADDED
The diff for this file is too large to render. See raw diff
 
results/calibration_v1_judge_baseline_no_anchors.json ADDED
The diff for this file is too large to render. See raw diff
 
results/calibration_v1_judge_baseline_no_cot.json ADDED
@@ -0,0 +1,2115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "dimension": "groundedness",
4
+ "reasoning": "The answer contains a claim not entailed by the snippet. The snippet describes the caching mechanism (what it does) but does not state that caching 'is particularly useful for expensive operations like database connections.' This is an embellishment importing general LLM knowledge about why caching is useful, similar to Example B in the rubric. The snippet uses `Depends(get_db)` as an identifier in its example, not as a normative argument that caching is useful for database connections. All other claims in the answer are entailed: the default caching behavior, the once-per-request semantics, the mechanism of reusing cached values, the `use_cache=False` disable mechanism, and the code example demonstrating the disable feature. However, the single ungrounded claim about usefulness for expensive operations fails the strict binary rubric.",
5
+ "evidence_quotes": [
6
+ "Snippet: 'By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once.'",
7
+ "Answer claim: 'This improves performance by avoiding redundant function calls and is particularly useful for expensive operations like database connections.'",
8
+ "The 'particularly useful for expensive operations like database connections' portion is not entailed by the snippet."
9
+ ],
10
+ "score": 0,
11
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
12
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
13
+ "prompt_seed": 0,
14
+ "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
15
+ "cost_usd": 0.005529600000000001,
16
+ "latency_ms": 5959.3987130210735,
17
+ "item_id": "q006"
18
+ },
19
+ {
20
+ "dimension": "groundedness",
21
+ "reasoning": "The answer makes several claims that need to be evaluated against the gold snippets:\n\n1. 'The default Swagger UI endpoint in FastAPI is `/docs`' - Snippet [1] directly states this.\n2. 'This endpoint is automatically available in every FastAPI application with zero configuration required' - Snippet [2] states 'Every FastAPI application exposes three documentation-related endpoints by default', which entails this claim.\n3. 'When you navigate to http://localhost:8000/docs... you'll see an interactive API explorer powered by Swagger UI' - Snippet [1] describes `/docs` as 'Swagger UI -- interactive API explorer', which entails this.\n4. 'FastAPI also provides two other documentation endpoints by default: `/redoc` and `/openapi.json`' - Snippet [2] states there are 'three documentation-related endpoints by default', and the answer identifies three total endpoints. However, the snippets do not explicitly name `/redoc` and `/openapi.json` as the other two endpoints.\n5. The code examples showing customization of `docs_url`, `redoc_url`, and `openapi_url` parameters - These are not entailed by the snippets provided.\n6. The code example showing how to disable endpoints by setting them to `None` - This is not entailed by the snippets provided.\n\nThe critical issue is that while Snippet [2] confirms there are three documentation endpoints by default, it does not specify what those three endpoints are. The answer names `/redoc` and `/openapi.json` as the other two, but this is not entailed by the snippets. Additionally, the customization examples and the ability to disable endpoints are not mentioned in the gold snippets at all.\n\nUnder the strict reference scope rule, claims that are factually correct but not entailed by the snippets must score 0. The answer goes beyond what the snippets explicitly state regarding the specific endpoint names and the customization/disabling capabilities.",
22
+ "evidence_quotes": [
23
+ "Snippet [1]: '/docs | Swagger UI -- interactive API explorer'",
24
+ "Snippet [2]: 'Every FastAPI application exposes three documentation-related endpoints by default'"
25
+ ],
26
+ "score": 0,
27
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
28
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
29
+ "prompt_seed": 0,
30
+ "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
31
+ "cost_usd": 0.006808,
32
+ "latency_ms": 11671.607129916083,
33
+ "item_id": "q011"
34
+ },
35
+ {
36
+ "dimension": "groundedness",
37
+ "reasoning": "Evaluating each major claim in the answer against the gold snippets:\n\n1. 'HTTPException is the primary way to raise HTTP errors' - ENTAILED by snippet [1].\n2. 'Raising HTTPException immediately terminates request processing and returns specified status code and detail' - ENTAILED by snippet [2].\n3. 'detail parameter can be a string, list, or dictionary' - ENTAILED by snippet [2].\n4. 'FastAPI serializes detail to JSON automatically' - ENTAILED by snippet [2].\n5. 'status_code is required parameter' - NOT ENTAILED. Snippets do not specify which parameters are required vs optional.\n6. 'detail is optional parameter' - NOT ENTAILED. Snippets do not specify optionality.\n7. 'headers parameter (optional)' - NOT ENTAILED. Snippets make no mention of a headers parameter.\n8. 'Custom headers example with X-Custom-Header' - NOT ENTAILED. Snippets do not mention headers capability.\n9. 'Response is automatically formatted as JSON with detail field' - ENTAILED by snippet [2] (serializes to JSON).\n10. 'HTTPException inherits from Starlette's HTTPException' - NOT ENTAILED. Snippets do not mention Starlette or inheritance.\n11. 'Can create custom exception handlers' - NOT ENTAILED. Snippets do not mention custom exception handlers.\n\nThe answer contains multiple ungrounded claims: the headers parameter and its usage, parameter optionality/requirement specifications, the Starlette inheritance detail, and custom exception handlers. Under the strict binary rubric, even one ungrounded claim requires a score of 0.",
38
+ "evidence_quotes": [
39
+ "Snippet [1]: 'The HTTPException class is the primary way to return error responses from route handlers'",
40
+ "Snippet [2]: 'When raised, HTTPException immediately terminates request processing and returns the specified status code and detail message. The detail parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically.'"
41
+ ],
42
+ "score": 0,
43
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
44
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
45
+ "prompt_seed": 0,
46
+ "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
47
+ "cost_usd": 0.006624,
48
+ "latency_ms": 7320.906443987042,
49
+ "item_id": "q012"
50
+ },
51
+ {
52
+ "dimension": "groundedness",
53
+ "reasoning": "",
54
+ "evidence_quotes": [],
55
+ "score": 1,
56
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
57
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
58
+ "prompt_seed": 0,
59
+ "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
60
+ "cost_usd": 0.0016296000000000001,
61
+ "latency_ms": 806.1586309922859,
62
+ "item_id": "q021"
63
+ },
64
+ {
65
+ "dimension": "groundedness",
66
+ "reasoning": "The answer provides a comprehensive FastAPI setup guide covering error handling, CORS, and dependency injection. However, examining against the gold snippets: Snippet [1] states 'The `HTTPException` class is the primary way to return error responses from route handlers.' Snippet [2] states 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette.' The answer makes numerous claims beyond what is entailed by these snippets: (1) Custom exception classes (ItemNotFoundException, DatabaseError) and their handlers are presented as a primary pattern, but the snippet specifies HTTPException as 'the primary way'; (2) Custom exception handlers for RequestValidationError are shown without snippet support; (3) Detailed CORS configuration parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age) are provided, but the snippet only states that CORS 'is configured using CORSMiddleware from Starlette' without detailing these parameters; (4) The entire dependency injection section is completely unsupported by the snippets; (5) Database class implementation, Pydantic models, and route handler examples are all beyond snippet scope. While the answer is factually correct FastAPI documentation, it dramatically exceeds what is entailed by the two minimal snippets provided. The strict-conjunction rule applies: even though most claims are correct in general FastAPI knowledge, they are not entailed by the gold snippets.",
67
+ "evidence_quotes": [
68
+ "Snippet [1]: 'The `HTTPException` class is the primary way to return error responses from route handlers'",
69
+ "Snippet [2]: 'Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette'",
70
+ "Answer claims: Custom exception classes, RequestValidationError handlers, detailed CORS parameters (allow_origins, allow_credentials, allow_methods, allow_headers, expose_headers, max_age), dependency injection patterns, database classes, Pydantic models, and route implementations"
71
+ ],
72
+ "score": 0,
73
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
74
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
75
+ "prompt_seed": 0,
76
+ "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
77
+ "cost_usd": 0.007796000000000001,
78
+ "latency_ms": 9348.714338964783,
79
+ "item_id": "q023"
80
+ },
81
+ {
82
+ "dimension": "groundedness",
83
+ "reasoning": "The answer provides a comprehensive implementation of a paginated API with cursor-based pagination, validation, and analytics. However, the gold snippet only states: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.' The answer goes far beyond this single sentence by providing: (1) complete Pydantic response models with validation, (2) cursor encoding/decoding utilities using base64, (3) analytics logging with background tasks, (4) a complete FastAPI endpoint implementation with mock database, and (5) detailed field specifications and configurations. While these implementations are factually correct and follow best practices, they are NOT entailed by the gold snippet. The snippet only describes what cursor-based pagination is and why it's useful (avoiding performance degradation). The answer adds extensive implementation details, specific technologies (Pydantic, base64 encoding, logging, BackgroundTasks), and architectural patterns that are not mentioned in or entailed by the snippet. This is similar to the k8s_006 calibration anchor: the answer dramatically extends beyond what the gold snippet covers.",
84
+ "evidence_quotes": [
85
+ "Gold snippet: 'Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets.'",
86
+ "Answer claims not in snippet: Complete Pydantic BaseModel definitions, Generic type parameters, Field validators with constraints, base64 encoding/decoding implementation, analytics logging with JSON serialization, FastAPI endpoint decorators, BackgroundTasks integration, mock database with 100 items, datetime handling, and specific HTTP query parameter configurations."
87
+ ],
88
+ "score": 0,
89
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
90
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
91
+ "prompt_seed": 0,
92
+ "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
93
+ "cost_usd": 0.007131199999999999,
94
+ "latency_ms": 8201.74567901995,
95
+ "item_id": "q025"
96
+ },
97
+ {
98
+ "dimension": "groundedness",
99
+ "reasoning": "",
100
+ "evidence_quotes": [],
101
+ "score": 1,
102
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
103
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
104
+ "prompt_seed": 0,
105
+ "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
106
+ "cost_usd": 0.0018592,
107
+ "latency_ms": 16725.520576001145,
108
+ "item_id": "k8s_001"
109
+ },
110
+ {
111
+ "dimension": "groundedness",
112
+ "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets provided. The snippets state: (1) A Deployment manages Pods for stateless workloads, and (2) A StatefulSet maintains sticky identity for each Pod. The answer goes far beyond these minimal claims to include: ordinal-based naming conventions, stable hostnames, persistent volumes via volumeClaimTemplates, ordered scaling/updates, specific use cases (databases, message queues, etc.), YAML configuration examples, headless service requirements, and detailed comparisons across multiple dimensions. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only establish the core distinction (stateless vs. stateful identity), not the mechanisms, naming schemes, storage patterns, ordering semantics, or use-case guidance that comprise most of the answer. This is analogous to the k8s_006 calibration anchor: the answer dramatically extends beyond the gold-snippet scope into general Kubernetes knowledge not present in the reference material.",
113
+ "evidence_quotes": [
114
+ "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
115
+ "Snippet [2]: 'Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods'",
116
+ "Answer claims not in snippets: 'Ordinal-based names (e.g., web-0, web-1, web-2)', 'Stable hostname maintained across rescheduling', 'Persistent volumes per Pod via volumeClaimTemplates', 'Ordered, sequential operations (0\u21921\u21922)', 'Headless Service Required', detailed YAML examples, specific use cases like 'Databases (MySQL, PostgreSQL, MongoDB)'"
117
+ ],
118
+ "score": 0,
119
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
120
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
121
+ "prompt_seed": 0,
122
+ "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
123
+ "cost_usd": 0.0074048,
124
+ "latency_ms": 20093.030811985955,
125
+ "item_id": "k8s_002"
126
+ },
127
+ {
128
+ "dimension": "groundedness",
129
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011CajszTeiZPK5rrayb3438'}",
130
+ "evidence_quotes": [],
131
+ "score": "Unknown",
132
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
133
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
134
+ "prompt_seed": 0,
135
+ "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
136
+ "cost_usd": 0.0031856,
137
+ "latency_ms": 51021.941915969364,
138
+ "item_id": "k8s_003"
139
+ },
140
+ {
141
+ "dimension": "groundedness",
142
+ "reasoning": "The answer makes several claims about Pod Security Admission's enforce and warn modes. Evaluating against the gold snippets:\n\nSnippet [1] states: 'Policy violations will cause the pod to be rejected'\nSnippet [2] only provides feature state information (Kubernetes v1.25 [stable])\n\nThe answer's key claims:\n1. 'Policy violations will cause the pod to be rejected' in enforce mode - ENTAILED by snippet [1]\n2. 'Policy violations are otherwise allowed' in warn mode - NOT ENTAILED by snippets\n3. 'Pod creation fails silently' in enforce mode - NOT ENTAILED by snippets\n4. 'Policy violations trigger a user-facing warning' in warn mode - NOT ENTAILED by snippets\n5. Claims about workload resources not being applied in enforce mode - NOT ENTAILED by snippets\n6. Claims about workload resources being applied in warn mode - NOT ENTAILED by snippets\n7. Namespace labeling examples and configuration details - NOT ENTAILED by snippets\n8. Claims about independent configuration - NOT ENTAILED by snippets\n\nThe gold snippets provide only minimal information: that policy violations cause pod rejection (in some mode, presumably enforce), and a feature state marker. The answer goes far beyond this, making numerous claims about warn mode behavior, workload resource handling, configuration examples, and practical implications that are not entailed by the provided snippets. While these claims may be factually correct in Kubernetes documentation, they are not grounded in the gold snippets provided. The strict-entailment rule requires that every claim be entailed by the gold snippets only.",
143
+ "evidence_quotes": [
144
+ "Snippet [1]: 'Policy violations will cause the pod to be rejected'",
145
+ "Snippet [2]: 'FEATURE STATE: Kubernetes v1.25 [stable]'",
146
+ "Answer claims about warn mode, workload resources, configuration examples, and practical implications are not present in or entailed by the gold snippets"
147
+ ],
148
+ "score": 0,
149
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
150
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
151
+ "prompt_seed": 0,
152
+ "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
153
+ "cost_usd": 0.0065352000000000006,
154
+ "latency_ms": 36792.009823955595,
155
+ "item_id": "k8s_005"
156
+ },
157
+ {
158
+ "dimension": "groundedness",
159
+ "reasoning": "The gold snippet states only: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer makes numerous claims beyond this single sentence: (1) three consumption methods (environment variables, command-line arguments, volumes), (2) specific examples of what to store (application configuration settings, environment-specific parameters, configuration files, feature flags), (3) what NOT to store (passwords, API keys, tokens, certificates), (4) recommendation to use Kubernetes Secrets instead, and (5) details about `data` and `binaryData` fields. While these claims are factually correct in Kubernetes documentation, they are NOT entailed by the gold snippet provided. The snippet only defines what a ConfigMap is and that it stores non-confidential data in key-value pairs. It does not describe consumption methods, specific use cases, security warnings, or schema fields. This is analogous to calibration anchor k8s_006: the answer dramatically extends beyond the snippet's scope. The strict-conjunction rule applies\u2014a single ungrounded claim fails the binary rubric, and this answer contains multiple ungrounded claims.",
160
+ "evidence_quotes": [
161
+ "G",
162
+ "o",
163
+ "l",
164
+ "d",
165
+ " ",
166
+ "s",
167
+ "n",
168
+ "i",
169
+ "p",
170
+ "p",
171
+ "e",
172
+ "t",
173
+ ":",
174
+ " ",
175
+ "'",
176
+ "A",
177
+ " ",
178
+ "C",
179
+ "o",
180
+ "n",
181
+ "f",
182
+ "i",
183
+ "g",
184
+ "M",
185
+ "a",
186
+ "p",
187
+ " ",
188
+ "i",
189
+ "s",
190
+ " ",
191
+ "a",
192
+ "n",
193
+ " ",
194
+ "A",
195
+ "P",
196
+ "I",
197
+ " ",
198
+ "o",
199
+ "b",
200
+ "j",
201
+ "e",
202
+ "c",
203
+ "t",
204
+ " ",
205
+ "u",
206
+ "s",
207
+ "e",
208
+ "d",
209
+ " ",
210
+ "t",
211
+ "o",
212
+ " ",
213
+ "s",
214
+ "t",
215
+ "o",
216
+ "r",
217
+ "e",
218
+ " ",
219
+ "n",
220
+ "o",
221
+ "n",
222
+ "-",
223
+ "c",
224
+ "o",
225
+ "n",
226
+ "f",
227
+ "i",
228
+ "d",
229
+ "e",
230
+ "n",
231
+ "t",
232
+ "i",
233
+ "a",
234
+ "l",
235
+ " ",
236
+ "d",
237
+ "a",
238
+ "t",
239
+ "a",
240
+ " ",
241
+ "i",
242
+ "n",
243
+ " ",
244
+ "k",
245
+ "e",
246
+ "y",
247
+ "-",
248
+ "v",
249
+ "a",
250
+ "l",
251
+ "u",
252
+ "e",
253
+ " ",
254
+ "p",
255
+ "a",
256
+ "i",
257
+ "r",
258
+ "s",
259
+ ".",
260
+ "'",
261
+ " ",
262
+ "A",
263
+ "n",
264
+ "s",
265
+ "w",
266
+ "e",
267
+ "r",
268
+ " ",
269
+ "c",
270
+ "l",
271
+ "a",
272
+ "i",
273
+ "m",
274
+ "s",
275
+ " ",
276
+ "n",
277
+ "o",
278
+ "t",
279
+ " ",
280
+ "i",
281
+ "n",
282
+ " ",
283
+ "s",
284
+ "n",
285
+ "i",
286
+ "p",
287
+ "p",
288
+ "e",
289
+ "t",
290
+ ":",
291
+ " ",
292
+ "'",
293
+ "P",
294
+ "o",
295
+ "d",
296
+ "s",
297
+ " ",
298
+ "c",
299
+ "a",
300
+ "n",
301
+ " ",
302
+ "c",
303
+ "o",
304
+ "n",
305
+ "s",
306
+ "u",
307
+ "m",
308
+ "e",
309
+ " ",
310
+ "C",
311
+ "o",
312
+ "n",
313
+ "f",
314
+ "i",
315
+ "g",
316
+ "M",
317
+ "a",
318
+ "p",
319
+ "s",
320
+ " ",
321
+ "i",
322
+ "n",
323
+ " ",
324
+ "t",
325
+ "h",
326
+ "r",
327
+ "e",
328
+ "e",
329
+ " ",
330
+ "w",
331
+ "a",
332
+ "y",
333
+ "s",
334
+ ":",
335
+ " ",
336
+ "E",
337
+ "n",
338
+ "v",
339
+ "i",
340
+ "r",
341
+ "o",
342
+ "n",
343
+ "m",
344
+ "e",
345
+ "n",
346
+ "t",
347
+ " ",
348
+ "v",
349
+ "a",
350
+ "r",
351
+ "i",
352
+ "a",
353
+ "b",
354
+ "l",
355
+ "e",
356
+ "s",
357
+ ",",
358
+ " ",
359
+ "C",
360
+ "o",
361
+ "m",
362
+ "m",
363
+ "a",
364
+ "n",
365
+ "d",
366
+ "-",
367
+ "l",
368
+ "i",
369
+ "n",
370
+ "e",
371
+ " ",
372
+ "a",
373
+ "r",
374
+ "g",
375
+ "u",
376
+ "m",
377
+ "e",
378
+ "n",
379
+ "t",
380
+ "s",
381
+ ",",
382
+ " ",
383
+ "C",
384
+ "o",
385
+ "n",
386
+ "f",
387
+ "i",
388
+ "g",
389
+ "u",
390
+ "r",
391
+ "a",
392
+ "t",
393
+ "i",
394
+ "o",
395
+ "n",
396
+ " ",
397
+ "f",
398
+ "i",
399
+ "l",
400
+ "e",
401
+ "s",
402
+ " ",
403
+ "i",
404
+ "n",
405
+ " ",
406
+ "a",
407
+ " ",
408
+ "v",
409
+ "o",
410
+ "l",
411
+ "u",
412
+ "m",
413
+ "e",
414
+ "'",
415
+ ";",
416
+ " ",
417
+ "'",
418
+ "C",
419
+ "o",
420
+ "n",
421
+ "f",
422
+ "i",
423
+ "g",
424
+ "M",
425
+ "a",
426
+ "p",
427
+ "s",
428
+ " ",
429
+ "a",
430
+ "r",
431
+ "e",
432
+ " ",
433
+ "d",
434
+ "e",
435
+ "s",
436
+ "i",
437
+ "g",
438
+ "n",
439
+ "e",
440
+ "d",
441
+ " ",
442
+ "f",
443
+ "o",
444
+ "r",
445
+ " ",
446
+ "n",
447
+ "o",
448
+ "n",
449
+ "-",
450
+ "c",
451
+ "o",
452
+ "n",
453
+ "f",
454
+ "i",
455
+ "d",
456
+ "e",
457
+ "n",
458
+ "t",
459
+ "i",
460
+ "a",
461
+ "l",
462
+ ",",
463
+ " ",
464
+ "n",
465
+ "o",
466
+ "n",
467
+ "-",
468
+ "s",
469
+ "e",
470
+ "n",
471
+ "s",
472
+ "i",
473
+ "t",
474
+ "i",
475
+ "v",
476
+ "e",
477
+ " ",
478
+ "d",
479
+ "a",
480
+ "t",
481
+ "a",
482
+ ",",
483
+ " ",
484
+ "s",
485
+ "u",
486
+ "c",
487
+ "h",
488
+ " ",
489
+ "a",
490
+ "s",
491
+ ":",
492
+ " ",
493
+ "A",
494
+ "p",
495
+ "p",
496
+ "l",
497
+ "i",
498
+ "c",
499
+ "a",
500
+ "t",
501
+ "i",
502
+ "o",
503
+ "n",
504
+ " ",
505
+ "c",
506
+ "o",
507
+ "n",
508
+ "f",
509
+ "i",
510
+ "g",
511
+ "u",
512
+ "r",
513
+ "a",
514
+ "t",
515
+ "i",
516
+ "o",
517
+ "n",
518
+ " ",
519
+ "s",
520
+ "e",
521
+ "t",
522
+ "t",
523
+ "i",
524
+ "n",
525
+ "g",
526
+ "s",
527
+ ",",
528
+ " ",
529
+ "E",
530
+ "n",
531
+ "v",
532
+ "i",
533
+ "r",
534
+ "o",
535
+ "n",
536
+ "m",
537
+ "e",
538
+ "n",
539
+ "t",
540
+ "-",
541
+ "s",
542
+ "p",
543
+ "e",
544
+ "c",
545
+ "i",
546
+ "f",
547
+ "i",
548
+ "c",
549
+ " ",
550
+ "p",
551
+ "a",
552
+ "r",
553
+ "a",
554
+ "m",
555
+ "e",
556
+ "t",
557
+ "e",
558
+ "r",
559
+ "s",
560
+ ",",
561
+ " ",
562
+ "C",
563
+ "o",
564
+ "n",
565
+ "f",
566
+ "i",
567
+ "g",
568
+ "u",
569
+ "r",
570
+ "a",
571
+ "t",
572
+ "i",
573
+ "o",
574
+ "n",
575
+ " ",
576
+ "f",
577
+ "i",
578
+ "l",
579
+ "e",
580
+ "s",
581
+ ",",
582
+ " ",
583
+ "F",
584
+ "e",
585
+ "a",
586
+ "t",
587
+ "u",
588
+ "r",
589
+ "e",
590
+ " ",
591
+ "f",
592
+ "l",
593
+ "a",
594
+ "g",
595
+ "s",
596
+ "'",
597
+ ";",
598
+ " ",
599
+ "'",
600
+ "D",
601
+ "o",
602
+ " ",
603
+ "N",
604
+ "O",
605
+ "T",
606
+ " ",
607
+ "s",
608
+ "t",
609
+ "o",
610
+ "r",
611
+ "e",
612
+ " ",
613
+ "s",
614
+ "e",
615
+ "n",
616
+ "s",
617
+ "i",
618
+ "t",
619
+ "i",
620
+ "v",
621
+ "e",
622
+ " ",
623
+ "o",
624
+ "r",
625
+ " ",
626
+ "c",
627
+ "o",
628
+ "n",
629
+ "f",
630
+ "i",
631
+ "d",
632
+ "e",
633
+ "n",
634
+ "t",
635
+ "i",
636
+ "a",
637
+ "l",
638
+ " ",
639
+ "d",
640
+ "a",
641
+ "t",
642
+ "a",
643
+ " ",
644
+ "i",
645
+ "n",
646
+ " ",
647
+ "C",
648
+ "o",
649
+ "n",
650
+ "f",
651
+ "i",
652
+ "g",
653
+ "M",
654
+ "a",
655
+ "p",
656
+ "s",
657
+ ",",
658
+ " ",
659
+ "s",
660
+ "u",
661
+ "c",
662
+ "h",
663
+ " ",
664
+ "a",
665
+ "s",
666
+ ":",
667
+ " ",
668
+ "P",
669
+ "a",
670
+ "s",
671
+ "s",
672
+ "w",
673
+ "o",
674
+ "r",
675
+ "d",
676
+ "s",
677
+ ",",
678
+ " ",
679
+ "A",
680
+ "P",
681
+ "I",
682
+ " ",
683
+ "k",
684
+ "e",
685
+ "y",
686
+ "s",
687
+ ",",
688
+ " ",
689
+ "T",
690
+ "o",
691
+ "k",
692
+ "e",
693
+ "n",
694
+ "s",
695
+ ",",
696
+ " ",
697
+ "C",
698
+ "e",
699
+ "r",
700
+ "t",
701
+ "i",
702
+ "f",
703
+ "i",
704
+ "c",
705
+ "a",
706
+ "t",
707
+ "e",
708
+ "s",
709
+ "'",
710
+ ";",
711
+ " ",
712
+ "'",
713
+ "y",
714
+ "o",
715
+ "u",
716
+ " ",
717
+ "s",
718
+ "h",
719
+ "o",
720
+ "u",
721
+ "l",
722
+ "d",
723
+ " ",
724
+ "u",
725
+ "s",
726
+ "e",
727
+ " ",
728
+ "K",
729
+ "u",
730
+ "b",
731
+ "e",
732
+ "r",
733
+ "n",
734
+ "e",
735
+ "t",
736
+ "e",
737
+ "s",
738
+ " ",
739
+ "S",
740
+ "e",
741
+ "c",
742
+ "r",
743
+ "e",
744
+ "t",
745
+ "s",
746
+ " ",
747
+ "i",
748
+ "n",
749
+ "s",
750
+ "t",
751
+ "e",
752
+ "a",
753
+ "d",
754
+ "'",
755
+ ";",
756
+ " ",
757
+ "'",
758
+ "C",
759
+ "o",
760
+ "n",
761
+ "f",
762
+ "i",
763
+ "g",
764
+ "M",
765
+ "a",
766
+ "p",
767
+ "s",
768
+ " ",
769
+ "h",
770
+ "a",
771
+ "v",
772
+ "e",
773
+ " ",
774
+ "`",
775
+ "d",
776
+ "a",
777
+ "t",
778
+ "a",
779
+ "`",
780
+ " ",
781
+ "a",
782
+ "n",
783
+ "d",
784
+ " ",
785
+ "`",
786
+ "b",
787
+ "i",
788
+ "n",
789
+ "a",
790
+ "r",
791
+ "y",
792
+ "D",
793
+ "a",
794
+ "t",
795
+ "a",
796
+ "`",
797
+ " ",
798
+ "f",
799
+ "i",
800
+ "e",
801
+ "l",
802
+ "d",
803
+ "s",
804
+ "'"
805
+ ],
806
+ "score": 0,
807
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
808
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
809
+ "prompt_seed": 0,
810
+ "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
811
+ "cost_usd": 0.006216,
812
+ "latency_ms": 16998.984744946938,
813
+ "item_id": "k8s_006"
814
+ },
815
+ {
816
+ "dimension": "groundedness",
817
+ "reasoning": "The gold snippet states: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.' The answer makes numerous claims that go far beyond what is entailed by this single snippet. Specifically: (1) The answer describes Jobs as 'finite or batch tasks to completion' \u2014 not entailed by the snippet. (2) The answer explains completion modes (NonIndexed vs Indexed) with detailed specifications about `.spec.completionMode`, `.spec.completions`, and how each mode determines completion \u2014 none of this is mentioned in the snippet. (3) The answer describes how completion indices are made available through pod annotations, labels, and environment variables \u2014 not entailed. (4) The answer explains `.spec.parallelism` and `.spec.backoffLimit` parameters \u2014 not entailed. (5) The answer states Jobs are 'useful for running tasks that need to run once and then stop, unlike Deployments which run continuously' \u2014 a use-case comparison not in the snippet. While the answer is factually correct about Kubernetes Jobs, the strict-snippet scope rule is clear: claims not entailed by the gold snippet must score 0, even if they are correct in the broader documentation or world knowledge. This answer dramatically extends beyond the single sentence provided, similar to the k8s_006 calibration anchor. The only claim entailed by the snippet is that a Job 'creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate,' which appears in the opening paragraph. Everything else fails strict entailment.",
818
+ "evidence_quotes": [
819
+ "Gold snippet: 'A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate'",
820
+ "Answer claims about completion modes, indexed vs non-indexed, .spec.completionMode, environment variables, parallelism, backoffLimit \u2014 none present in snippet"
821
+ ],
822
+ "score": 0,
823
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
824
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
825
+ "prompt_seed": 0,
826
+ "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
827
+ "cost_usd": 0.0067864,
828
+ "latency_ms": 12302.207177970558,
829
+ "item_id": "k8s_007"
830
+ },
831
+ {
832
+ "dimension": "groundedness",
833
+ "reasoning": "The gold snippet states: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding.' The answer correctly identifies these four object types. However, the answer goes significantly beyond what is entailed by the snippet. The snippet only declares that these four kinds exist; it does not describe their purposes, scopes, or relationships. The answer makes numerous claims not entailed by the snippet: (1) Role 'defines a set of permissions within a specific namespace' and 'contains rules that represent a set of permissions'; (2) ClusterRole is 'cluster-scoped rather than namespace-scoped' and 'can be used to grant permissions across the entire cluster'; (3) RoleBinding 'grants the permissions defined in a Role' and 'connects a Role to subjects'; (4) ClusterRoleBinding 'grants the permissions defined in a ClusterRole at the cluster level'; (5) the distinction that 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions'; (6) the namespace vs cluster-scoped distinction and its implications. While these claims are factually correct in Kubernetes documentation, they are not entailed by the single-sentence snippet provided. The snippet only names the four objects; it does not describe their functionality, scope, or relationships. This is analogous to calibration anchor k8s_006, where the answer dramatically extends beyond what the snippet entails.",
834
+ "evidence_quotes": [
835
+ "Gold snippet: 'The RBAC API declares four kinds of Kubernetes object: Role, ClusterRole, RoleBinding and ClusterRoleBinding'",
836
+ "Answer claims about Role: 'Defines a set of permissions within a specific namespace. It contains rules that represent a set of permissions.' \u2014 NOT in snippet",
837
+ "Answer claims about ClusterRole: 'Similar to a Role, but is cluster-scoped rather than namespace-scoped. It can be used to grant permissions across the entire cluster.' \u2014 NOT in snippet",
838
+ "Answer claims about RoleBinding: 'Grants the permissions defined in a Role to a user or set of users within a specific namespace. It connects a Role to subjects' \u2014 NOT in snippet",
839
+ "Answer claims about ClusterRoleBinding: 'Similar to a RoleBinding, but grants the permissions defined in a ClusterRole at the cluster level.' \u2014 NOT in snippet",
840
+ "Answer meta-claim: 'Roles and ClusterRoles define what permissions are available, while RoleBindings and ClusterRoleBindings define who has those permissions.' \u2014 NOT in snippet"
841
+ ],
842
+ "score": 0,
843
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
844
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
845
+ "prompt_seed": 0,
846
+ "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
847
+ "cost_usd": 0.0070079999999999995,
848
+ "latency_ms": 10670.279069978278,
849
+ "item_id": "k8s_009"
850
+ },
851
+ {
852
+ "dimension": "groundedness",
853
+ "reasoning": "The gold snippet provided is incomplete: '[1] Secrets can be mounted as data volumes or exposed as'. It cuts off mid-sentence and provides no substantive information about the behavioral differences between these two approaches. The answer makes numerous specific claims about automatic updates, dynamic refresh, subPath mount limitations, static environment variables, pod restart requirements, character restrictions, and use-case recommendations. None of these claims are entailed by the incomplete snippet fragment, which merely states that Secrets *can* be mounted or exposed without elaborating on how they differ or what their characteristics are. Under strict reference scope (gold snippets only) and the binary rubric requiring every claim to be entailed, this answer fails because the snippet provides no content to ground any of the detailed behavioral claims made.",
854
+ "evidence_quotes": [
855
+ "Gold snippet: '[1] Secrets can be mounted as data volumes or exposed as' (incomplete/truncated)"
856
+ ],
857
+ "score": 0,
858
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
859
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
860
+ "prompt_seed": 0,
861
+ "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
862
+ "cost_usd": 0.0053408,
863
+ "latency_ms": 11916.670492035337,
864
+ "item_id": "k8s_011"
865
+ },
866
+ {
867
+ "dimension": "groundedness",
868
+ "reasoning": "The gold snippet states only: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'. This is a single factual claim about what happens when medium is set to Memory. The answer makes numerous claims beyond this snippet: (1) default emptyDir storage location and behavior, (2) performance characteristics of both modes, (3) memory impact and counting against container limits, (4) size limit specifications and node allocatable memory, (5) resource considerations about memory cost and capacity. While many of these claims may be factually correct in Kubernetes documentation, they are NOT entailed by the provided gold snippet. The snippet only establishes that setting medium to 'Memory' results in tmpfs mounting. Claims about default behavior, memory accounting, size limits, and resource considerations go dramatically beyond what the snippet entails. This is analogous to the k8s_006 calibration anchor where the answer extends far beyond the single-sentence snippet provided.",
869
+ "evidence_quotes": [
870
+ "Gold snippet: 'If you set the `emptyDir.medium` field to `\"Memory\"`, Kubernetes mounts a tmpfs'",
871
+ "Answer claims not in snippet: storage location for default emptyDir, performance characteristics, memory impact and counting against limits, size limit behavior, node allocatable memory sizing, resource considerations"
872
+ ],
873
+ "score": 0,
874
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
875
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
876
+ "prompt_seed": 0,
877
+ "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
878
+ "cost_usd": 0.005747200000000001,
879
+ "latency_ms": 30587.856588943396,
880
+ "item_id": "k8s_012"
881
+ },
882
+ {
883
+ "dimension": "groundedness",
884
+ "reasoning": "The answer makes several claims about liveness and readiness probes. Checking against the gold snippets:\n\nSnippet [1]: 'uses liveness probes to know when to restart a container'\nSnippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'\n\nThe answer claims:\n1. Liveness probe failure \u2192 kubelet kills container and restarts it\n2. Readiness probe failure \u2192 kubelet continues running container, doesn't kill it\n3. Readiness probe failure \u2192 pod won't receive traffic from Services\n4. Liveness probe is for detecting unrecoverable failures\n5. Readiness probe is for detecting temporary unavailability\n6. Best practice details about using both probes together with same endpoint, higher failureThreshold for liveness\n\nSnippet [1] only states that liveness probes are used 'to know when to restart a container' - it entails that liveness probes trigger restarts.\n\nSnippet [2] only states that readiness probes are used 'to know when a container is ready to start accepting traffic' - it entails that readiness probes relate to traffic acceptance.\n\nHowever, the answer makes many claims NOT entailed by these minimal snippets:\n- The specific behavior that readiness probe failure keeps the container running (not stated in snippet [2])\n- The detail about Pod's Ready condition being set to false\n- The use-case explanations (unrecoverable failures, temporary unavailability)\n- The best practice section about using both together, same endpoint, failureThreshold settings\n- The terminationGracePeriodSeconds detail\n\nWhile these claims are factually correct in Kubernetes documentation, they are NOT entailed by the two provided snippets. The snippets only state the basic purpose of each probe type, not the detailed behavioral differences or best practices described in the answer.\n\nUnder strict reference scope (gold snippets only), this answer goes far beyond what is entailed.",
885
+ "evidence_quotes": [
886
+ "Snippet [1]: 'uses liveness probes to know when to restart a container'",
887
+ "Snippet [2]: 'uses readiness probes to know when a container is ready to start accepting traffic'",
888
+ "Answer claims: 'The kubelet kills the container and restarts it', 'The kubelet continues running the container', 'The pod will not receive traffic from Kubernetes Services', 'Use case: Detecting unrecoverable application failures', 'Use case: Indicating that a container is temporarily unable to handle traffic', 'A common pattern is to use both probes together'"
889
+ ],
890
+ "score": 0,
891
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
892
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
893
+ "prompt_seed": 0,
894
+ "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
895
+ "cost_usd": 0.007291200000000001,
896
+ "latency_ms": 26607.242761005182,
897
+ "item_id": "k8s_013"
898
+ },
899
+ {
900
+ "dimension": "groundedness",
901
+ "reasoning": "The answer makes numerous claims beyond what is entailed by the two gold snippets. The snippets only state: (1) NodePort 'Exposes the Service on each Node' and (2) LoadBalancer 'Exposes the Service externally using an external load balancer.' The answer adds many ungrounded claims including: the NodePort range (30000-32767), that NodePort creates a cluster IP internally, that external traffic accesses via <NodeIP>:<NodePort>, that LoadBalancer requires cloud provider integration, that LoadBalancer creates a NodePort internally, the optional `spec.allocateLoadBalancerNodePorts: false` configuration, and various use-case recommendations (development/testing vs production). While these claims may be factually correct in Kubernetes documentation, they are not entailed by the gold snippets provided. The strict-reference-scope rule requires that claims be entailed by the gold snippets only, not by broader domain knowledge or the underlying documentation. This is analogous to calibration anchor k8s_006, where correct information beyond the snippet scope scores 0.",
902
+ "evidence_quotes": [
903
+ "Gold snippet [1]: 'Exposes the Service on each Node'",
904
+ "Gold snippet [2]: 'Exposes the Service externally using an external load balancer'",
905
+ "Answer claims not in snippets: NodePort range (30000-32767), cluster IP creation, <NodeIP>:<NodePort> access pattern, cloud provider integration requirement, LoadBalancer creates NodePort internally, spec.allocateLoadBalancerNodePorts configuration, development vs production suitability"
906
+ ],
907
+ "score": 0,
908
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
909
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
910
+ "prompt_seed": 0,
911
+ "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
912
+ "cost_usd": 0.0065256,
913
+ "latency_ms": 11835.180599999148,
914
+ "item_id": "k8s_014"
915
+ },
916
+ {
917
+ "dimension": "groundedness",
918
+ "reasoning": "The answer makes several claims that must be checked against the two gold snippets: [1] 'Jobs represent one-off tasks that run to completion and then stop' and [2] 'A CronJob starts one-time Jobs on a repeating schedule'. The answer's core claims are: (1) Jobs run once when created; (2) CronJobs run on a repeating schedule; (3) Jobs are for one-time batch tasks; (4) CronJobs are for recurring tasks; (5) CronJobs create Job objects approximately once per scheduled execution; (6) Kubernetes doesn't guarantee exactly-once semantics; (7) Two Jobs might be created for a single schedule; (8) No Job might be created for a scheduled time; (9) Jobs created by CronJobs should be idempotent; (10) You can set startingDeadlineSeconds on a CronJob; (11) startingDeadlineSeconds controls what happens if a scheduled Job is missed. Claims 1-4 are directly entailed by the snippets (paraphrase of 'one-off tasks' and 'repeating schedule'). However, claims 5-11 regarding exactly-once semantics, idempotency requirements, the possibility of multiple or zero Job creations, and the startingDeadlineSeconds mechanism are NOT entailed by the gold snippets. These are factually correct Kubernetes behaviors but are not present in the provided reference material. The strict-entailment rubric requires that every claim be entailed by the gold snippets only. The answer goes well beyond what the snippets support, similar to the k8s_006 calibration anchor where correct information from the broader documentation fails because it's not in the gold snippet scope.",
919
+ "evidence_quotes": [
920
+ "Gold snippet [1]: 'Jobs represent one-off tasks that run to completion and then stop'",
921
+ "Gold snippet [2]: 'A CronJob starts one-time Jobs on a repeating schedule'",
922
+ "Answer claims not in snippets: 'Kubernetes doesn't guarantee exactly-once semantics', 'Two Jobs might be created for a single schedule', 'No Job might be created for a scheduled time', 'Jobs created by CronJobs should be idempotent', 'You can set startingDeadlineSeconds on a CronJob'"
923
+ ],
924
+ "score": 0,
925
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
926
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
927
+ "prompt_seed": 0,
928
+ "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
929
+ "cost_usd": 0.007357600000000001,
930
+ "latency_ms": 27444.345601077657,
931
+ "item_id": "k8s_015"
932
+ },
933
+ {
934
+ "dimension": "groundedness",
935
+ "reasoning": "The answer makes several claims about Deployments and DaemonSets. Checking against the gold snippets:\n\nSnippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'\nSnippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'\n\nClaims in the answer:\n1. Deployment manages 'specified number of Pod replicas' - entailed by [1] ('manages a set of Pods')\n2. Deployment Pods run on 'any eligible nodes' - NOT entailed by snippet [1]\n3. Deployment 'uses the standard Kubernetes scheduler' - NOT entailed by snippets\n4. Deployment 'Pods are distributed across nodes based on availability and scheduling policies' - NOT entailed by snippets\n5. DaemonSet 'ensures that all eligible nodes (or a subset based on node selectors/affinity) run exactly one copy' - partially entailed by [2] ('all (or some) Nodes run a copy of a Pod'), though [2] doesn't specify 'exactly one'\n6. DaemonSet controller 'creates a Pod for each eligible node and adds nodeAffinity rules' - NOT entailed by [2]\n7. DaemonSet uses '.spec.nodeName field' - NOT entailed by [2]\n8. Use cases section - NOT entailed by snippets\n\nThe answer goes far beyond what the snippets entail. While some core concepts are present (Deployment manages Pods, DaemonSet runs on nodes), the answer adds substantial detail about scheduling mechanisms, nodeAffinity, nodeName fields, and use cases that are not in the gold snippets. Under strict snippet entailment, this fails the binary rubric.",
936
+ "evidence_quotes": [
937
+ "Snippet [1]: 'A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state'",
938
+ "Snippet [2]: 'DaemonSet ensures that all (or some) Nodes run a copy of a Pod'",
939
+ "Answer claims about 'standard Kubernetes scheduler', 'nodeAffinity rules', '.spec.nodeName field', and use cases are not entailed by the snippets"
940
+ ],
941
+ "score": 0,
942
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
943
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
944
+ "prompt_seed": 0,
945
+ "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
946
+ "cost_usd": 0.006692,
947
+ "latency_ms": 14147.950275044423,
948
+ "item_id": "k8s_016"
949
+ },
950
+ {
951
+ "dimension": "groundedness",
952
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt33D8HsUdB7oGVqJ1A'}",
953
+ "evidence_quotes": [],
954
+ "score": "Unknown",
955
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
956
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
957
+ "prompt_seed": 0,
958
+ "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
959
+ "cost_usd": 0.0,
960
+ "latency_ms": 34940.02001499757,
961
+ "item_id": "k8s_017"
962
+ },
963
+ {
964
+ "dimension": "groundedness",
965
+ "reasoning": "The answer makes multiple specific claims about the `autoscaling/v2` API version and HorizontalPodAutoscaler functionality. However, the gold snippets provided are extremely minimal: [1] states that HorizontalPodAutoscaler 'automatically updates a workload resource' and [2] is incomplete ('The current stable version can be found in the'). Neither snippet entails the detailed claims in the answer about: (a) autoscaling/v2 being the current stable version since Kubernetes v1.23, (b) support for custom metrics, (c) support for memory metrics, (d) the ability to specify multiple metrics, (e) configurable scaling behavior with separate scale-up/scale-down behaviors, (f) autoscaling/v1 only supporting CPU-based scaling, or (g) the comparison between v1 and v2 features. While these claims may be factually correct in the broader Kubernetes documentation, they are not entailed by the gold snippets provided. The strict reference scope is the gold snippets only, and the answer dramatically exceeds what can be inferred from them. This is analogous to the k8s_006 calibration anchor: most of the answer is well-supported by the broader corpus, but the gold-snippet scope is what the rubric measures.",
966
+ "evidence_quotes": [
967
+ "G",
968
+ "o",
969
+ "l",
970
+ "d",
971
+ " ",
972
+ "s",
973
+ "n",
974
+ "i",
975
+ "p",
976
+ "p",
977
+ "e",
978
+ "t",
979
+ " ",
980
+ "[",
981
+ "1",
982
+ "]",
983
+ ":",
984
+ " ",
985
+ "'",
986
+ "H",
987
+ "o",
988
+ "r",
989
+ "i",
990
+ "z",
991
+ "o",
992
+ "n",
993
+ "t",
994
+ "a",
995
+ "l",
996
+ "P",
997
+ "o",
998
+ "d",
999
+ "A",
1000
+ "u",
1001
+ "t",
1002
+ "o",
1003
+ "s",
1004
+ "c",
1005
+ "a",
1006
+ "l",
1007
+ "e",
1008
+ "r",
1009
+ "*",
1010
+ " ",
1011
+ "a",
1012
+ "u",
1013
+ "t",
1014
+ "o",
1015
+ "m",
1016
+ "a",
1017
+ "t",
1018
+ "i",
1019
+ "c",
1020
+ "a",
1021
+ "l",
1022
+ "l",
1023
+ "y",
1024
+ " ",
1025
+ "u",
1026
+ "p",
1027
+ "d",
1028
+ "a",
1029
+ "t",
1030
+ "e",
1031
+ "s",
1032
+ " ",
1033
+ "a",
1034
+ " ",
1035
+ "w",
1036
+ "o",
1037
+ "r",
1038
+ "k",
1039
+ "l",
1040
+ "o",
1041
+ "a",
1042
+ "d",
1043
+ " ",
1044
+ "r",
1045
+ "e",
1046
+ "s",
1047
+ "o",
1048
+ "u",
1049
+ "r",
1050
+ "c",
1051
+ "e",
1052
+ "'",
1053
+ ".",
1054
+ " ",
1055
+ "G",
1056
+ "o",
1057
+ "l",
1058
+ "d",
1059
+ " ",
1060
+ "s",
1061
+ "n",
1062
+ "i",
1063
+ "p",
1064
+ "p",
1065
+ "e",
1066
+ "t",
1067
+ " ",
1068
+ "[",
1069
+ "2",
1070
+ "]",
1071
+ ":",
1072
+ " ",
1073
+ "'",
1074
+ "T",
1075
+ "h",
1076
+ "e",
1077
+ " ",
1078
+ "c",
1079
+ "u",
1080
+ "r",
1081
+ "r",
1082
+ "e",
1083
+ "n",
1084
+ "t",
1085
+ " ",
1086
+ "s",
1087
+ "t",
1088
+ "a",
1089
+ "b",
1090
+ "l",
1091
+ "e",
1092
+ " ",
1093
+ "v",
1094
+ "e",
1095
+ "r",
1096
+ "s",
1097
+ "i",
1098
+ "o",
1099
+ "n",
1100
+ " ",
1101
+ "c",
1102
+ "a",
1103
+ "n",
1104
+ " ",
1105
+ "b",
1106
+ "e",
1107
+ " ",
1108
+ "f",
1109
+ "o",
1110
+ "u",
1111
+ "n",
1112
+ "d",
1113
+ " ",
1114
+ "i",
1115
+ "n",
1116
+ " ",
1117
+ "t",
1118
+ "h",
1119
+ "e",
1120
+ "'",
1121
+ " ",
1122
+ "(",
1123
+ "i",
1124
+ "n",
1125
+ "c",
1126
+ "o",
1127
+ "m",
1128
+ "p",
1129
+ "l",
1130
+ "e",
1131
+ "t",
1132
+ "e",
1133
+ ")",
1134
+ ".",
1135
+ " ",
1136
+ "N",
1137
+ "e",
1138
+ "i",
1139
+ "t",
1140
+ "h",
1141
+ "e",
1142
+ "r",
1143
+ " ",
1144
+ "s",
1145
+ "n",
1146
+ "i",
1147
+ "p",
1148
+ "p",
1149
+ "e",
1150
+ "t",
1151
+ " ",
1152
+ "s",
1153
+ "u",
1154
+ "p",
1155
+ "p",
1156
+ "o",
1157
+ "r",
1158
+ "t",
1159
+ "s",
1160
+ " ",
1161
+ "c",
1162
+ "l",
1163
+ "a",
1164
+ "i",
1165
+ "m",
1166
+ "s",
1167
+ " ",
1168
+ "a",
1169
+ "b",
1170
+ "o",
1171
+ "u",
1172
+ "t",
1173
+ " ",
1174
+ "A",
1175
+ "P",
1176
+ "I",
1177
+ " ",
1178
+ "v",
1179
+ "e",
1180
+ "r",
1181
+ "s",
1182
+ "i",
1183
+ "o",
1184
+ "n",
1185
+ "s",
1186
+ ",",
1187
+ " ",
1188
+ "c",
1189
+ "u",
1190
+ "s",
1191
+ "t",
1192
+ "o",
1193
+ "m",
1194
+ " ",
1195
+ "m",
1196
+ "e",
1197
+ "t",
1198
+ "r",
1199
+ "i",
1200
+ "c",
1201
+ "s",
1202
+ ",",
1203
+ " ",
1204
+ "m",
1205
+ "e",
1206
+ "m",
1207
+ "o",
1208
+ "r",
1209
+ "y",
1210
+ " ",
1211
+ "m",
1212
+ "e",
1213
+ "t",
1214
+ "r",
1215
+ "i",
1216
+ "c",
1217
+ "s",
1218
+ ",",
1219
+ " ",
1220
+ "m",
1221
+ "u",
1222
+ "l",
1223
+ "t",
1224
+ "i",
1225
+ "p",
1226
+ "l",
1227
+ "e",
1228
+ " ",
1229
+ "m",
1230
+ "e",
1231
+ "t",
1232
+ "r",
1233
+ "i",
1234
+ "c",
1235
+ "s",
1236
+ ",",
1237
+ " ",
1238
+ "s",
1239
+ "c",
1240
+ "a",
1241
+ "l",
1242
+ "i",
1243
+ "n",
1244
+ "g",
1245
+ " ",
1246
+ "b",
1247
+ "e",
1248
+ "h",
1249
+ "a",
1250
+ "v",
1251
+ "i",
1252
+ "o",
1253
+ "r",
1254
+ "s",
1255
+ ",",
1256
+ " ",
1257
+ "o",
1258
+ "r",
1259
+ " ",
1260
+ "v",
1261
+ "e",
1262
+ "r",
1263
+ "s",
1264
+ "i",
1265
+ "o",
1266
+ "n",
1267
+ " ",
1268
+ "c",
1269
+ "o",
1270
+ "m",
1271
+ "p",
1272
+ "a",
1273
+ "r",
1274
+ "i",
1275
+ "s",
1276
+ "o",
1277
+ "n",
1278
+ "s",
1279
+ "."
1280
+ ],
1281
+ "score": 0,
1282
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1283
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1284
+ "prompt_seed": 0,
1285
+ "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
1286
+ "cost_usd": 0.005912000000000001,
1287
+ "latency_ms": 29024.256383010652,
1288
+ "item_id": "k8s_018"
1289
+ },
1290
+ {
1291
+ "dimension": "groundedness",
1292
+ "reasoning": "The gold snippet states only that 'A ConfigMap is an API object used to store non-confidential data in key-value pairs.' The answer provides four detailed mechanisms for consuming ConfigMap values (environment variables individual, environment variables bulk, volume mounts, and Kubernetes API access), along with YAML examples, limitations, update behavior, and a comparison table. None of these consumption methods, technical details, YAML syntax, update semantics, or use-case guidance are entailed by the single-sentence snippet. While these claims are factually correct in Kubernetes documentation, they are not entailed by the gold snippet provided. This is analogous to calibration anchor k8s_006: the answer goes dramatically beyond what the snippet supports, covering consumption methods, schema details, and operational guidance that the snippet does not mention. The strict-conjunction rule applies: even though the answer is well-supported by the broader Kubernetes corpus, the gold-snippet scope is what the rubric measures.",
1293
+ "evidence_quotes": [
1294
+ "Gold snippet: 'A ConfigMap is an API object used to store non-confidential data in key-value pairs'",
1295
+ "Answer claims not in snippet: 'four main mechanisms', 'Environment Variables (Individual)', 'Environment Variables (Bulk)', 'Volume Mounts (Files)', 'Kubernetes API (Direct Access)', YAML examples, 'not updated automatically', 'supports dynamic updates', 'propagation delays', 'subPath volume mounts will not receive ConfigMap updates', comparison table with update behavior"
1296
+ ],
1297
+ "score": 0,
1298
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1299
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1300
+ "prompt_seed": 0,
1301
+ "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
1302
+ "cost_usd": 0.0060544,
1303
+ "latency_ms": 21910.545678983908,
1304
+ "item_id": "k8s_019"
1305
+ },
1306
+ {
1307
+ "dimension": "groundedness",
1308
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50 requests per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt3cHxhyPgmbXLYt5AS'}",
1309
+ "evidence_quotes": [],
1310
+ "score": "Unknown",
1311
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1312
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1313
+ "prompt_seed": 0,
1314
+ "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
1315
+ "cost_usd": 0.0022536000000000006,
1316
+ "latency_ms": 36177.12139704963,
1317
+ "item_id": "k8s_020"
1318
+ },
1319
+ {
1320
+ "dimension": "groundedness",
1321
+ "reasoning": "The gold snippet states: 'purely additive (there are no \"deny\" rules)'. This establishes that standard Kubernetes RBAC is allow-based only with no deny rules. The answer's opening claim that 'Standard Kubernetes RBAC does **not have explicit deny rules**' and 'RBAC is **allow-based only**' are directly entailed by this snippet. However, the answer then introduces Option 2 (CEL-based Authorization Policy) and Option 3 (Admission Controllers) as mechanisms to implement explicit deny rules. These options are presented as valid approaches to achieve deny functionality in Kubernetes. The gold snippet does not mention CEL-based authorization policies, AuthorizationPolicy resources, or admission controllers as ways to implement deny rules. While these may be factually correct in the broader Kubernetes ecosystem, they are not entailed by the gold snippet provided. The snippet only establishes what standard RBAC is (allow-based, no deny rules), not what alternative mechanisms exist to add deny functionality. The answer goes beyond the scope of the gold snippet by introducing solutions not mentioned in the reference material. This violates the strict-entailment requirement that claims must be entailed by the gold snippets only, not by general LLM knowledge.",
1322
+ "evidence_quotes": [
1323
+ "Gold snippet: 'purely additive (there are no \"deny\" rules)'",
1324
+ "Answer claim: 'Option 2: Use CEL-based Authorization Policy (Kubernetes 1.26+)' with detailed YAML examples",
1325
+ "Answer claim: 'Option 3: Use Admission Controllers' with ValidatingAdmissionPolicy examples"
1326
+ ],
1327
+ "score": 0,
1328
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1329
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1330
+ "prompt_seed": 0,
1331
+ "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
1332
+ "cost_usd": 0.0063864,
1333
+ "latency_ms": 14159.782970964443,
1334
+ "item_id": "k8s_022"
1335
+ },
1336
+ {
1337
+ "dimension": "groundedness",
1338
+ "reasoning": "",
1339
+ "evidence_quotes": [],
1340
+ "score": 1,
1341
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1342
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1343
+ "prompt_seed": 0,
1344
+ "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
1345
+ "cost_usd": 0.001712,
1346
+ "latency_ms": 558.5277889622375,
1347
+ "item_id": "k8s_023"
1348
+ },
1349
+ {
1350
+ "dimension": "groundedness",
1351
+ "reasoning": "The gold snippets provided are minimal: [1] 'Exposes the Service on a cluster-internal IP' and [2] 'Exposes the Service externally using an external load balancer.' Snippet [1] describes ClusterIP (internal exposure), and snippet [2] describes LoadBalancer (external exposure). The answer makes numerous claims about NodePort, LoadBalancer, ExternalIPs, ClusterIP, and Ingress/Gateway resources. Most of these claims are not entailed by the two snippets. Specifically: (1) The detailed description of NodePort (allocating ports on every node, connecting via node IP and port, traffic forwarding) is not mentioned in either snippet. (2) The claim about LoadBalancer 'automatically provisions an external load balancer' and 'provides a stable external IP address' goes beyond snippet [2], which only states it 'exposes the Service externally using an external load balancer.' (3) The entire section on ExternalIPs is not covered by either snippet. (4) The ClusterIP section's claim that it 'only exposes the Service on a cluster-internal IP' is a paraphrase of snippet [1], which is grounded. (5) The claim about using Ingress or Gateway resources to expose ClusterIP is not entailed by either snippet. Under the strict-conjunction rule, a single ungrounded claim fails the binary rubric. This answer contains multiple ungrounded claims beyond what the snippets entail.",
1352
+ "evidence_quotes": [
1353
+ "Snippet [1]: 'Exposes the Service on a cluster-internal IP'",
1354
+ "Snippet [2]: 'Exposes the Service externally using an external load balancer'",
1355
+ "Answer claims about NodePort (allocation, node IP access, traffic forwarding) are not in snippets",
1356
+ "Answer claims about LoadBalancer details (automatic provisioning, stable external IP) exceed snippet [2]",
1357
+ "Answer section on ExternalIPs is entirely absent from snippets",
1358
+ "Answer claims about Ingress/Gateway are not entailed by snippets"
1359
+ ],
1360
+ "score": 0,
1361
+ "judge_id": "claude-haiku-4-5-20251001_groundedness",
1362
+ "rubric_version": "bf4f9de02373061079765f00e06d970bbc60b2c5f376a8132d68cf4be82b120b",
1363
+ "prompt_seed": 0,
1364
+ "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
1365
+ "cost_usd": 0.0067864,
1366
+ "latency_ms": 20653.504143992905,
1367
+ "item_id": "k8s_025"
1368
+ },
1369
+ {
1370
+ "dimension": "relevance",
1371
+ "reasoning": "",
1372
+ "evidence_quotes": [],
1373
+ "score": 2,
1374
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1375
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1376
+ "prompt_seed": 0,
1377
+ "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
1378
+ "cost_usd": 0.0008928,
1379
+ "latency_ms": 5275.652518961579,
1380
+ "item_id": "q006"
1381
+ },
1382
+ {
1383
+ "dimension": "relevance",
1384
+ "reasoning": "",
1385
+ "evidence_quotes": [],
1386
+ "score": 2,
1387
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1388
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1389
+ "prompt_seed": 0,
1390
+ "system_output_hash": "614e55fa482638a470a21120af307cbf65e5ed64380882e3addbd99d996a3930",
1391
+ "cost_usd": 0.000776,
1392
+ "latency_ms": 613.2173900259659,
1393
+ "item_id": "q010"
1394
+ },
1395
+ {
1396
+ "dimension": "relevance",
1397
+ "reasoning": "",
1398
+ "evidence_quotes": [],
1399
+ "score": 2,
1400
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1401
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1402
+ "prompt_seed": 0,
1403
+ "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
1404
+ "cost_usd": 0.0007768,
1405
+ "latency_ms": 4376.481520943344,
1406
+ "item_id": "q011"
1407
+ },
1408
+ {
1409
+ "dimension": "relevance",
1410
+ "reasoning": "",
1411
+ "evidence_quotes": [],
1412
+ "score": 2,
1413
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1414
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1415
+ "prompt_seed": 0,
1416
+ "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
1417
+ "cost_usd": 0.0008864000000000001,
1418
+ "latency_ms": 8851.143473002594,
1419
+ "item_id": "q012"
1420
+ },
1421
+ {
1422
+ "dimension": "relevance",
1423
+ "reasoning": "",
1424
+ "evidence_quotes": [],
1425
+ "score": 2,
1426
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1427
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1428
+ "prompt_seed": 0,
1429
+ "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
1430
+ "cost_usd": 0.0006552000000000001,
1431
+ "latency_ms": 6161.781317030545,
1432
+ "item_id": "q021"
1433
+ },
1434
+ {
1435
+ "dimension": "relevance",
1436
+ "reasoning": "",
1437
+ "evidence_quotes": [],
1438
+ "score": 2,
1439
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1440
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1441
+ "prompt_seed": 0,
1442
+ "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
1443
+ "cost_usd": 0.001404,
1444
+ "latency_ms": 1018.7896409770474,
1445
+ "item_id": "q023"
1446
+ },
1447
+ {
1448
+ "dimension": "relevance",
1449
+ "reasoning": "",
1450
+ "evidence_quotes": [],
1451
+ "score": 2,
1452
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1453
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1454
+ "prompt_seed": 0,
1455
+ "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
1456
+ "cost_usd": 0.001404,
1457
+ "latency_ms": 22714.352431998122,
1458
+ "item_id": "q025"
1459
+ },
1460
+ {
1461
+ "dimension": "relevance",
1462
+ "reasoning": "",
1463
+ "evidence_quotes": [],
1464
+ "score": 2,
1465
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1466
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1467
+ "prompt_seed": 0,
1468
+ "system_output_hash": "7d1fa1afe474dc2cf5944be153e9151584f9ce66aa78f804fd8e225c3936ad1e",
1469
+ "cost_usd": 0.0009288,
1470
+ "latency_ms": 4240.290573972743,
1471
+ "item_id": "q027"
1472
+ },
1473
+ {
1474
+ "dimension": "relevance",
1475
+ "reasoning": "",
1476
+ "evidence_quotes": [],
1477
+ "score": 2,
1478
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1479
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1480
+ "prompt_seed": 0,
1481
+ "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
1482
+ "cost_usd": 0.00088,
1483
+ "latency_ms": 4285.718351020478,
1484
+ "item_id": "k8s_001"
1485
+ },
1486
+ {
1487
+ "dimension": "relevance",
1488
+ "reasoning": "",
1489
+ "evidence_quotes": [],
1490
+ "score": 2,
1491
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1492
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1493
+ "prompt_seed": 0,
1494
+ "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
1495
+ "cost_usd": 0.001404,
1496
+ "latency_ms": 6870.535210997332,
1497
+ "item_id": "k8s_002"
1498
+ },
1499
+ {
1500
+ "dimension": "relevance",
1501
+ "reasoning": "",
1502
+ "evidence_quotes": [],
1503
+ "score": 2,
1504
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1505
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1506
+ "prompt_seed": 0,
1507
+ "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
1508
+ "cost_usd": 0.0011152,
1509
+ "latency_ms": 2953.8072769646533,
1510
+ "item_id": "k8s_003"
1511
+ },
1512
+ {
1513
+ "dimension": "relevance",
1514
+ "reasoning": "",
1515
+ "evidence_quotes": [],
1516
+ "score": 2,
1517
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1518
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1519
+ "prompt_seed": 0,
1520
+ "system_output_hash": "09b4cf08124a393533ba34d779fc4729c7c9b9e3b3b488d04bbcf782354a6437",
1521
+ "cost_usd": 0.0010424000000000002,
1522
+ "latency_ms": 4848.448178963736,
1523
+ "item_id": "k8s_004"
1524
+ },
1525
+ {
1526
+ "dimension": "relevance",
1527
+ "reasoning": "",
1528
+ "evidence_quotes": [],
1529
+ "score": 2,
1530
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1531
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1532
+ "prompt_seed": 0,
1533
+ "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
1534
+ "cost_usd": 0.0008928,
1535
+ "latency_ms": 10320.249837997835,
1536
+ "item_id": "k8s_005"
1537
+ },
1538
+ {
1539
+ "dimension": "relevance",
1540
+ "reasoning": "",
1541
+ "evidence_quotes": [],
1542
+ "score": 2,
1543
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1544
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1545
+ "prompt_seed": 0,
1546
+ "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
1547
+ "cost_usd": 0.0008608000000000001,
1548
+ "latency_ms": 8522.756394988392,
1549
+ "item_id": "k8s_006"
1550
+ },
1551
+ {
1552
+ "dimension": "relevance",
1553
+ "reasoning": "",
1554
+ "evidence_quotes": [],
1555
+ "score": 2,
1556
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1557
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1558
+ "prompt_seed": 0,
1559
+ "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
1560
+ "cost_usd": 0.0009432,
1561
+ "latency_ms": 6325.502372987103,
1562
+ "item_id": "k8s_007"
1563
+ },
1564
+ {
1565
+ "dimension": "relevance",
1566
+ "reasoning": "",
1567
+ "evidence_quotes": [],
1568
+ "score": 2,
1569
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1570
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1571
+ "prompt_seed": 0,
1572
+ "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
1573
+ "cost_usd": 0.0007928000000000001,
1574
+ "latency_ms": 1811.099338985514,
1575
+ "item_id": "k8s_009"
1576
+ },
1577
+ {
1578
+ "dimension": "relevance",
1579
+ "reasoning": "",
1580
+ "evidence_quotes": [],
1581
+ "score": 2,
1582
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1583
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1584
+ "prompt_seed": 0,
1585
+ "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
1586
+ "cost_usd": 0.0009128000000000001,
1587
+ "latency_ms": 768.1542619829997,
1588
+ "item_id": "k8s_011"
1589
+ },
1590
+ {
1591
+ "dimension": "relevance",
1592
+ "reasoning": "",
1593
+ "evidence_quotes": [],
1594
+ "score": 2,
1595
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1596
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1597
+ "prompt_seed": 0,
1598
+ "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
1599
+ "cost_usd": 0.0008704000000000001,
1600
+ "latency_ms": 12236.18174099829,
1601
+ "item_id": "k8s_012"
1602
+ },
1603
+ {
1604
+ "dimension": "relevance",
1605
+ "reasoning": "",
1606
+ "evidence_quotes": [],
1607
+ "score": 2,
1608
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1609
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1610
+ "prompt_seed": 0,
1611
+ "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
1612
+ "cost_usd": 0.000932,
1613
+ "latency_ms": 11616.58075498417,
1614
+ "item_id": "k8s_013"
1615
+ },
1616
+ {
1617
+ "dimension": "relevance",
1618
+ "reasoning": "",
1619
+ "evidence_quotes": [],
1620
+ "score": 2,
1621
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1622
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1623
+ "prompt_seed": 0,
1624
+ "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
1625
+ "cost_usd": 0.0009008000000000001,
1626
+ "latency_ms": 4196.989124990068,
1627
+ "item_id": "k8s_014"
1628
+ },
1629
+ {
1630
+ "dimension": "relevance",
1631
+ "reasoning": "",
1632
+ "evidence_quotes": [],
1633
+ "score": 2,
1634
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1635
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1636
+ "prompt_seed": 0,
1637
+ "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
1638
+ "cost_usd": 0.0009376,
1639
+ "latency_ms": 838.445411005523,
1640
+ "item_id": "k8s_015"
1641
+ },
1642
+ {
1643
+ "dimension": "relevance",
1644
+ "reasoning": "",
1645
+ "evidence_quotes": [],
1646
+ "score": 2,
1647
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1648
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1649
+ "prompt_seed": 0,
1650
+ "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
1651
+ "cost_usd": 0.0008352,
1652
+ "latency_ms": 5632.905109028798,
1653
+ "item_id": "k8s_016"
1654
+ },
1655
+ {
1656
+ "dimension": "relevance",
1657
+ "reasoning": "",
1658
+ "evidence_quotes": [],
1659
+ "score": 2,
1660
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1661
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1662
+ "prompt_seed": 0,
1663
+ "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
1664
+ "cost_usd": 0.0009328000000000001,
1665
+ "latency_ms": 2904.8574669868685,
1666
+ "item_id": "k8s_017"
1667
+ },
1668
+ {
1669
+ "dimension": "relevance",
1670
+ "reasoning": "",
1671
+ "evidence_quotes": [],
1672
+ "score": 2,
1673
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1674
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1675
+ "prompt_seed": 0,
1676
+ "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
1677
+ "cost_usd": 0.000872,
1678
+ "latency_ms": 16631.02817395702,
1679
+ "item_id": "k8s_018"
1680
+ },
1681
+ {
1682
+ "dimension": "relevance",
1683
+ "reasoning": "",
1684
+ "evidence_quotes": [],
1685
+ "score": 2,
1686
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1687
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1688
+ "prompt_seed": 0,
1689
+ "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
1690
+ "cost_usd": 0.0011104,
1691
+ "latency_ms": 5025.444047001656,
1692
+ "item_id": "k8s_019"
1693
+ },
1694
+ {
1695
+ "dimension": "relevance",
1696
+ "reasoning": "",
1697
+ "evidence_quotes": [],
1698
+ "score": 2,
1699
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1700
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1701
+ "prompt_seed": 0,
1702
+ "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
1703
+ "cost_usd": 0.000888,
1704
+ "latency_ms": 6382.348418002948,
1705
+ "item_id": "k8s_020"
1706
+ },
1707
+ {
1708
+ "dimension": "relevance",
1709
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6uVe7LRm3cyR9dtBR'}",
1710
+ "evidence_quotes": [],
1711
+ "score": "Unknown",
1712
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1713
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1714
+ "prompt_seed": 0,
1715
+ "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
1716
+ "cost_usd": 0.0,
1717
+ "latency_ms": 20767.425035010092,
1718
+ "item_id": "k8s_022"
1719
+ },
1720
+ {
1721
+ "dimension": "relevance",
1722
+ "reasoning": "",
1723
+ "evidence_quotes": [],
1724
+ "score": 2,
1725
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1726
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1727
+ "prompt_seed": 0,
1728
+ "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
1729
+ "cost_usd": 0.0007504000000000001,
1730
+ "latency_ms": 5516.492156020831,
1731
+ "item_id": "k8s_023"
1732
+ },
1733
+ {
1734
+ "dimension": "relevance",
1735
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt6wc9SEcDRfPrzZBVp'}",
1736
+ "evidence_quotes": [],
1737
+ "score": "Unknown",
1738
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1739
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1740
+ "prompt_seed": 0,
1741
+ "system_output_hash": "1e8fc6086c8751052c1b22fcc728df75411562f3ecdffa30146931afd47dd37f",
1742
+ "cost_usd": 0.0,
1743
+ "latency_ms": 18978.80935500143,
1744
+ "item_id": "k8s_024"
1745
+ },
1746
+ {
1747
+ "dimension": "relevance",
1748
+ "reasoning": "",
1749
+ "evidence_quotes": [],
1750
+ "score": 2,
1751
+ "judge_id": "claude-haiku-4-5-20251001_relevance",
1752
+ "rubric_version": "6db05bfe4bfc8d25eb11bd9bf1a9b005cad7fc99ecbfd72d8cc3a404a96bf68f",
1753
+ "prompt_seed": 0,
1754
+ "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
1755
+ "cost_usd": 0.0008464000000000001,
1756
+ "latency_ms": 3490.8632279839367,
1757
+ "item_id": "k8s_025"
1758
+ },
1759
+ {
1760
+ "dimension": "completeness",
1761
+ "reasoning": "",
1762
+ "evidence_quotes": [],
1763
+ "score": 2,
1764
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1765
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1766
+ "prompt_seed": 0,
1767
+ "system_output_hash": "3a79cfc1b2704a3bc427751108a08f038b33612329abee296ee3f25610c8e118",
1768
+ "cost_usd": 0.0008888,
1769
+ "latency_ms": 722.3775110323913,
1770
+ "item_id": "q006"
1771
+ },
1772
+ {
1773
+ "dimension": "completeness",
1774
+ "reasoning": "",
1775
+ "evidence_quotes": [],
1776
+ "score": 2,
1777
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1778
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1779
+ "prompt_seed": 0,
1780
+ "system_output_hash": "6d8d2e201916d9c9d4d8f525009acaa8a02280dcd1573b8ecbb7bae461e26eef",
1781
+ "cost_usd": 0.0007648,
1782
+ "latency_ms": 4397.730973025318,
1783
+ "item_id": "q011"
1784
+ },
1785
+ {
1786
+ "dimension": "completeness",
1787
+ "reasoning": "",
1788
+ "evidence_quotes": [],
1789
+ "score": 2,
1790
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1791
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1792
+ "prompt_seed": 0,
1793
+ "system_output_hash": "b2fa2200ac582365a5f2c96fb8bcdc2d9788be5693046a68af870d073779e31b",
1794
+ "cost_usd": 0.0008768000000000001,
1795
+ "latency_ms": 1102.155871980358,
1796
+ "item_id": "q012"
1797
+ },
1798
+ {
1799
+ "dimension": "completeness",
1800
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt7RiL9hz5pVxt333xL'}",
1801
+ "evidence_quotes": [],
1802
+ "score": "Unknown",
1803
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1804
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1805
+ "prompt_seed": 0,
1806
+ "system_output_hash": "32640bd1016bf34227a79195ad181f538bbbe937d3172f21ca733e7c729903de",
1807
+ "cost_usd": 0.0,
1808
+ "latency_ms": 18437.84686899744,
1809
+ "item_id": "q021"
1810
+ },
1811
+ {
1812
+ "dimension": "completeness",
1813
+ "reasoning": "",
1814
+ "evidence_quotes": [],
1815
+ "score": 1,
1816
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1817
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1818
+ "prompt_seed": 0,
1819
+ "system_output_hash": "42df91909038e53d05fe290544b6dbe63c631cedb6464cece79775105a7ddcde",
1820
+ "cost_usd": 0.0014032,
1821
+ "latency_ms": 1967.281456978526,
1822
+ "item_id": "q023"
1823
+ },
1824
+ {
1825
+ "dimension": "completeness",
1826
+ "reasoning": "",
1827
+ "evidence_quotes": [],
1828
+ "score": 2,
1829
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1830
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1831
+ "prompt_seed": 0,
1832
+ "system_output_hash": "e65efe1620df931603666588bea32ab8768035928f43b9170cd30cde25d89715",
1833
+ "cost_usd": 0.0014048,
1834
+ "latency_ms": 4361.171844007913,
1835
+ "item_id": "q025"
1836
+ },
1837
+ {
1838
+ "dimension": "completeness",
1839
+ "reasoning": "",
1840
+ "evidence_quotes": [],
1841
+ "score": 2,
1842
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1843
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1844
+ "prompt_seed": 0,
1845
+ "system_output_hash": "95582498779bbb3574afc12b70b73c8229f2d86aeb2cb02d96fbc44b4661e217",
1846
+ "cost_usd": 0.0008992000000000001,
1847
+ "latency_ms": 821.1477959994227,
1848
+ "item_id": "k8s_001"
1849
+ },
1850
+ {
1851
+ "dimension": "completeness",
1852
+ "reasoning": "",
1853
+ "evidence_quotes": [],
1854
+ "score": 2,
1855
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1856
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1857
+ "prompt_seed": 0,
1858
+ "system_output_hash": "858b5d51052c4491a8340a8676367f07b446db3e8ad1110863e07a23662fa30f",
1859
+ "cost_usd": 0.001436,
1860
+ "latency_ms": 4178.335952979978,
1861
+ "item_id": "k8s_002"
1862
+ },
1863
+ {
1864
+ "dimension": "completeness",
1865
+ "reasoning": "",
1866
+ "evidence_quotes": [],
1867
+ "score": 2,
1868
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1869
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1870
+ "prompt_seed": 0,
1871
+ "system_output_hash": "feb4dfee8e9d49dd2fa61616b515e0be633d8f93d202a1a37a5c88e77803f4f5",
1872
+ "cost_usd": 0.0011384000000000001,
1873
+ "latency_ms": 5098.598277952988,
1874
+ "item_id": "k8s_003"
1875
+ },
1876
+ {
1877
+ "dimension": "completeness",
1878
+ "reasoning": "",
1879
+ "evidence_quotes": [],
1880
+ "score": 2,
1881
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1882
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1883
+ "prompt_seed": 0,
1884
+ "system_output_hash": "bfad2ede0dd56bcbd0a32d9ed0fa9f78bc1eea7ad5364f6f764fd133b60e20f6",
1885
+ "cost_usd": 0.0009088000000000001,
1886
+ "latency_ms": 1836.335435975343,
1887
+ "item_id": "k8s_005"
1888
+ },
1889
+ {
1890
+ "dimension": "completeness",
1891
+ "reasoning": "",
1892
+ "evidence_quotes": [],
1893
+ "score": 2,
1894
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1895
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1896
+ "prompt_seed": 0,
1897
+ "system_output_hash": "578d1632f1f46be8a8f4d45758d433fc223546d7ec92df5ca2d0877f3e8198cd",
1898
+ "cost_usd": 0.0008712,
1899
+ "latency_ms": 20613.944871001877,
1900
+ "item_id": "k8s_006"
1901
+ },
1902
+ {
1903
+ "dimension": "completeness",
1904
+ "reasoning": "",
1905
+ "evidence_quotes": [],
1906
+ "score": 2,
1907
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1908
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1909
+ "prompt_seed": 0,
1910
+ "system_output_hash": "6ed7deff5411307bebfa2f318fa82011fb499b068dc733b77ffd0a16c1776916",
1911
+ "cost_usd": 0.0009632,
1912
+ "latency_ms": 1971.2769520119764,
1913
+ "item_id": "k8s_007"
1914
+ },
1915
+ {
1916
+ "dimension": "completeness",
1917
+ "reasoning": "",
1918
+ "evidence_quotes": [],
1919
+ "score": 2,
1920
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1921
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1922
+ "prompt_seed": 0,
1923
+ "system_output_hash": "acddc826212df9c439cc2185cf54d832a77b89d14f3272f9b7cff9e9949f217a",
1924
+ "cost_usd": 0.0008248000000000001,
1925
+ "latency_ms": 5351.545320998412,
1926
+ "item_id": "k8s_009"
1927
+ },
1928
+ {
1929
+ "dimension": "completeness",
1930
+ "reasoning": "",
1931
+ "evidence_quotes": [],
1932
+ "score": 2,
1933
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1934
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1935
+ "prompt_seed": 0,
1936
+ "system_output_hash": "5f662014659d6aa61d27fd67f348b014f88025d58df9f5f0f2d451714d021cb4",
1937
+ "cost_usd": 0.0009288,
1938
+ "latency_ms": 2994.747666991316,
1939
+ "item_id": "k8s_011"
1940
+ },
1941
+ {
1942
+ "dimension": "completeness",
1943
+ "reasoning": "judge_call_failed_after_retry: ProviderRateLimitError: Rate limited after 3 retries: Error code: 429 - {'type': 'error', 'error': {'type': 'rate_limit_error', 'message': \"This request would exceed your organization's rate limit of 50,000 input tokens per minute (org: d1e9ee34-748f-4bd3-a248-753a7e6a7275, model: claude-haiku-4-5-20251001). For details, refer to: https://docs.claude.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://claude.com/contact-sales to discuss your options for a rate limit increase.\"}, 'request_id': 'req_011Cajt8UJ8fPy6FvgG5MQKo'}",
1944
+ "evidence_quotes": [],
1945
+ "score": "Unknown",
1946
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1947
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1948
+ "prompt_seed": 0,
1949
+ "system_output_hash": "e35bb38c436523fe2336aaa56045152e389e274662fba67633a1e4c39ab743b5",
1950
+ "cost_usd": 0.0,
1951
+ "latency_ms": 19476.016786997207,
1952
+ "item_id": "k8s_012"
1953
+ },
1954
+ {
1955
+ "dimension": "completeness",
1956
+ "reasoning": "",
1957
+ "evidence_quotes": [],
1958
+ "score": 2,
1959
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1960
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1961
+ "prompt_seed": 0,
1962
+ "system_output_hash": "028660796eeb558b1293457bbec76392877d86c0ee859308b20ae90ec1a65566",
1963
+ "cost_usd": 0.0009552,
1964
+ "latency_ms": 574.3700260063633,
1965
+ "item_id": "k8s_013"
1966
+ },
1967
+ {
1968
+ "dimension": "completeness",
1969
+ "reasoning": "",
1970
+ "evidence_quotes": [],
1971
+ "score": 2,
1972
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1973
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1974
+ "prompt_seed": 0,
1975
+ "system_output_hash": "63a0e85b245371ce00082ed8827b0d9efd3c76dac9a3c1de9574df2ff2e097d8",
1976
+ "cost_usd": 0.0009384,
1977
+ "latency_ms": 5021.697896998376,
1978
+ "item_id": "k8s_014"
1979
+ },
1980
+ {
1981
+ "dimension": "completeness",
1982
+ "reasoning": "",
1983
+ "evidence_quotes": [],
1984
+ "score": 2,
1985
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1986
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
1987
+ "prompt_seed": 0,
1988
+ "system_output_hash": "28cce97784ed6be2331cb3757ddc2b93cb558939b96bd271f289c2ae16f55fb6",
1989
+ "cost_usd": 0.000972,
1990
+ "latency_ms": 776.8337430316024,
1991
+ "item_id": "k8s_015"
1992
+ },
1993
+ {
1994
+ "dimension": "completeness",
1995
+ "reasoning": "",
1996
+ "evidence_quotes": [],
1997
+ "score": 2,
1998
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
1999
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2000
+ "prompt_seed": 0,
2001
+ "system_output_hash": "d5ce2becd7e454321d33605c5d123a1298d16b0bd2a031280161e38ec61263a2",
2002
+ "cost_usd": 0.0008712,
2003
+ "latency_ms": 4251.321821007878,
2004
+ "item_id": "k8s_016"
2005
+ },
2006
+ {
2007
+ "dimension": "completeness",
2008
+ "reasoning": "",
2009
+ "evidence_quotes": [],
2010
+ "score": 2,
2011
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2012
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2013
+ "prompt_seed": 0,
2014
+ "system_output_hash": "7dc9ed4e57d4c46d18503075dee17ab44ed9f522465c4c41ce1b4e7c8704e285",
2015
+ "cost_usd": 0.0009608,
2016
+ "latency_ms": 4064.9051589425653,
2017
+ "item_id": "k8s_017"
2018
+ },
2019
+ {
2020
+ "dimension": "completeness",
2021
+ "reasoning": "",
2022
+ "evidence_quotes": [],
2023
+ "score": 2,
2024
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2025
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2026
+ "prompt_seed": 0,
2027
+ "system_output_hash": "2954a16f1a00e175ff9e8185698563b44054de6181e3c309a2c38c2c0b8e44f7",
2028
+ "cost_usd": 0.0008984,
2029
+ "latency_ms": 4741.949952032883,
2030
+ "item_id": "k8s_018"
2031
+ },
2032
+ {
2033
+ "dimension": "completeness",
2034
+ "reasoning": "The reference answer identifies two main ways to surface ConfigMap data to Pods: (1) exposing specific keys as environment variables, and (2) mounting as a volume. It also specifies a key difference: volume-mounted data updates in place when ConfigMap changes, while environment variables do not update until Pod restart. The agent's answer covers four mechanisms instead of two. Evaluating against the reference's key points: (1) Environment variables as a method - COVERED (both individual and bulk import shown). (2) Volume mounting as a method - COVERED (with detailed example). (3) Environment variables don't update until Pod restart - COVERED (explicitly stated in sections 1 and 2). (4) Volume-mounted data updates in place when ConfigMap changes - COVERED (stated in section 3). The agent adds two additional mechanisms (Kubernetes API direct access) not in the reference, but the rubric explicitly states 'The judge does not penalize the agent for adding correct extra detail.' All four key points from the reference are present in the answer, though paraphrased and expanded with additional context and examples.",
2035
+ "evidence_quotes": [
2036
+ "Reference point 1 - env vars: Agent states 'ConfigMap key-value pairs can be injected as individual environment variables' and 'All key-value pairs from a ConfigMap can be imported as environment variables'",
2037
+ "Reference point 2 - volume mounting: Agent states 'ConfigMap data can be mounted as files in the container's filesystem' with 'Each key in the ConfigMap becomes a filename'",
2038
+ "Reference point 3 - env vars don't auto-update: Agent explicitly states 'ConfigMaps consumed as environment variables are not updated automatically and require a pod restart'",
2039
+ "Reference point 4 - volume updates in place: Agent states 'This mechanism supports dynamic updates when the ConfigMap changes (though there may be propagation delays)'"
2040
+ ],
2041
+ "score": 2,
2042
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2043
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2044
+ "prompt_seed": 0,
2045
+ "system_output_hash": "7761711620ffc8120f1aafdfb0e550fda47a0a70232686f087c45a97877ea6c7",
2046
+ "cost_usd": 0.004544,
2047
+ "latency_ms": 10436.621424974874,
2048
+ "item_id": "k8s_019"
2049
+ },
2050
+ {
2051
+ "dimension": "completeness",
2052
+ "reasoning": "Comparing the agent's answer against the reference's key points: (1) 'By default, Pods are non-isolated \u2014 they accept traffic from any source' - The agent covers this explicitly in the 'Default Baseline' section, stating 'By default, all Pods are non-isolated' for both ingress and egress. (2) 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' - The agent covers this, explaining that isolation occurs when a NetworkPolicy selects the Pod via podSelector. (3) 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' - The agent covers this concept, stating 'only the following connections are allowed' and 'only connections explicitly allowed by the rules...are permitted'. (4) 'NetworkPolicy rules can target ingress, egress, or both' - The agent covers this extensively with separate sections on ingress and egress isolation. (5) 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' - The agent does NOT mention the CNI plugin or clarify that Kubernetes itself does not enforce the policy. This is a notable omission of a key technical point from the reference.",
2053
+ "evidence_quotes": [
2054
+ "Reference: 'By default, Pods are non-isolated' | Agent: 'By default, all Pods are non-isolated for both ingress and egress traffic'",
2055
+ "Reference: 'A Pod becomes isolated as soon as any NetworkPolicy in its namespace selects it via podSelector' | Agent: 'A Pod becomes isolated for ingress if there is any NetworkPolicy that selects the Pod (via podSelector)'",
2056
+ "Reference: 'only traffic explicitly allowed by the union of NetworkPolicies that select that Pod is permitted' | Agent: 'only the following connections are allowed' and 'only connections explicitly allowed by the rules'",
2057
+ "Reference: 'NetworkPolicy rules can target ingress, egress, or both' | Agent: Covers this with separate ingress and egress sections",
2058
+ "Reference: 'the CNI plugin is what enforces the policy \u2014 Kubernetes itself does not' | Agent: No mention of CNI plugin or enforcement mechanism"
2059
+ ],
2060
+ "score": 1,
2061
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2062
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2063
+ "prompt_seed": 0,
2064
+ "system_output_hash": "b957c3ed390693ede4acfdf07509200e52bf42dbc86c0c5588400b685a45288b",
2065
+ "cost_usd": 0.0041456,
2066
+ "latency_ms": 9525.302856054623,
2067
+ "item_id": "k8s_020"
2068
+ },
2069
+ {
2070
+ "dimension": "completeness",
2071
+ "reasoning": "The reference answer makes one key point: Kubernetes RBAC does not support deny rules; it is purely additive, and the way to prevent an action is to not grant the permission. The answer to score covers this point explicitly in the opening section ('Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only') and recommends Option 1 (not granting the delete verb) as the primary solution. However, the answer goes significantly beyond the reference by introducing two additional options (CEL-based Authorization Policy and Admission Controllers) that are not mentioned in the reference. While these additions are correct and helpful, the scoring rubric explicitly states: 'The judge does not penalize the agent for adding correct extra detail (that's a separate concern). Score only on what fraction of the reference's points are present.' The single key point from the reference\u2014that RBAC has no deny rules and you must not grant permissions\u2014is fully covered in the answer.",
2072
+ "evidence_quotes": [
2073
+ "Reference: 'Kubernetes RBAC does not support deny rules...there are no deny rules. To prevent a user from deleting Pods you simply do not grant them a Role that contains the delete verb'",
2074
+ "Answer: 'Standard Kubernetes RBAC does not have explicit deny rules. RBAC is allow-based only...The simplest approach is to not grant the delete verb to the user.'"
2075
+ ],
2076
+ "score": 2,
2077
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2078
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2079
+ "prompt_seed": 0,
2080
+ "system_output_hash": "a12fba966149f0e235bc5bc483b748b4693d9f52a215fecdbd8965ff6a9ac7b4",
2081
+ "cost_usd": 0.0041928,
2082
+ "latency_ms": 13928.576675010845,
2083
+ "item_id": "k8s_022"
2084
+ },
2085
+ {
2086
+ "dimension": "completeness",
2087
+ "reasoning": "",
2088
+ "evidence_quotes": [],
2089
+ "score": 2,
2090
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2091
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2092
+ "prompt_seed": 0,
2093
+ "system_output_hash": "9f58ab3eaeaaae5e5b500e686040b0c59ec06b789659406b79b32991c489d544",
2094
+ "cost_usd": 0.0007776,
2095
+ "latency_ms": 2871.9944750191644,
2096
+ "item_id": "k8s_023"
2097
+ },
2098
+ {
2099
+ "dimension": "completeness",
2100
+ "reasoning": "The reference answer identifies four key points: (1) NodePort exposes on each Node's IP at a static port, (2) LoadBalancer uses external load balancer from cloud integration, (3) ExternalName maps to external DNS name via CNAME record, and (4) ClusterIP is default/cluster-internal only, with Ingress as alternative. The agent's answer covers: (1) NodePort - fully covered with detailed explanation, (2) LoadBalancer - fully covered, (3) ExternalIPs - this is mentioned but is NOT the same as ExternalName from the reference (ExternalIPs route to cluster nodes; ExternalName maps to external DNS via CNAME), (4) ClusterIP - covered as default/cluster-internal, (5) Ingress - mentioned as alternative. The agent's answer is missing the ExternalName Service type entirely, which is explicitly listed in the reference as one of the three main types that expose applications outside the cluster. The agent substitutes ExternalIPs instead, which is a different mechanism. This represents incomplete coverage of the reference's key points.",
2101
+ "evidence_quotes": [
2102
+ "Reference lists: NodePort, LoadBalancer, ExternalName, and ClusterIP/Ingress alternative",
2103
+ "Agent covers: NodePort (\u2713), LoadBalancer (\u2713), ExternalIPs (\u2717 - not in reference), ClusterIP (\u2713), Ingress (\u2713)",
2104
+ "Missing: ExternalName (maps Service to external DNS name via CNAME record)"
2105
+ ],
2106
+ "score": 1,
2107
+ "judge_id": "claude-haiku-4-5-20251001_completeness",
2108
+ "rubric_version": "c71cdcf39c72489486b81a1306f3e5199cd5e3b6011f530302b6da979ce84f20",
2109
+ "prompt_seed": 0,
2110
+ "system_output_hash": "64bfb5acf94d98b960c9d679463c7852613e55e1ce5883781f50b4b7814d9b3b",
2111
+ "cost_usd": 0.0036983999999999997,
2112
+ "latency_ms": 8602.465078001842,
2113
+ "item_id": "k8s_025"
2114
+ }
2115
+ ]