diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index 8a4bd632da2a44e5a8103ac64c034d92fcf03373..78ef3552e85b5b201313af0329a16f60669c24de 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -9,6 +9,12 @@ on: jobs: test: runs-on: ubuntu-latest + # Explicit empty env: prevents accidental dependency on injected + # secrets. Tests use MockProvider and require no API keys; if a + # future test imports a provider that needs a key, it will fail + # in CI and in any contributor fork the same way (no silent + # divergence based on whether secrets are present). + env: {} steps: - uses: actions/checkout@v4 diff --git a/.gitignore b/.gitignore index fc14753b76850448e5195d4fe88d760607cbef9d..dfdfa8c550987d4d13d829e79f94fb155424eb52 100644 --- a/.gitignore +++ b/.gitignore @@ -24,6 +24,13 @@ venv/ logs/ *.jsonl +# Evidence-bearing measurement artifacts referenced from DECISIONS.md. +# Narrow exception to the *.jsonl ignore above. Add new measurement files +# explicitly here so the audit-trail intent stays opt-in. +!measurements/*.jsonl +# Calibration jury/permute sidecars (per-member detail for κ ablation table). +!results/*.jsonl + # Opaque binary artifacts — no PDFs in the repo today, and any that # appear here are almost always local reference material (downloaded # papers, vendor docs) that should not be committed. If a PDF ever diff --git a/DECISIONS.md b/DECISIONS.md index 1ed3a220eb0bb9f04846ecf5c74794a6b46b9b54..4cef71f4bf127f28ff364d1f5432ed7d9570d79e 100644 --- a/DECISIONS.md +++ b/DECISIONS.md @@ -2116,3 +2116,704 @@ the actual container filesystem would have caught it pre-deploy. Such a test is out of scope for v1 (adds ~5 min to CI plus Docker build infrastructure) but is the right long-term mitigation for this class of bug. + +## LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call + +The continuous-score single-call judges in `agent_bench/evaluation/metrics.py` +(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted +and replaced by the per-dimension Judge layer at +`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle. + +**Design doc:** `docs/plans/2026-05-04-judge-layer-v1-design.md`. + +**Why this is a supersession, not a refactor.** The new layer differs from +the old on six axes: discrete-anchored scale (vs continuous 0–1), +reasoning-before-score JSON ordering (vs score-first), per-dimension +judges (vs combined faithfulness/correctness), full provenance per call +(judge_id + rubric_version + system_output_hash + prompt_seed; old had +none), composable variance wrappers (rubric_permute, jury — old was +single-call), and an intentional abstain-vs-raise discipline (vs silent +`None` from a bare `except Exception`). + +**Evidence backing the supersession claim** — the calibration κ table +quantifies the new layer's agreement with hand-labels across 6 ablation +rows (baseline + 3 variance ablations + permute + 2-judge jury). The +files defending this entry's claim, by file path: + +- `measurements/2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3 + dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the + literature ceiling). Lands in Phase 10. +- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`, + `_baseline_no_anchors.json`, `_baseline_no_abstain.json`, + `_permute.json`, `_jury_kappa_weighted.json` — per-row predictions. + Land in Phase 11. +- `docs/_generated/kappa_table.md` — generated κ ablation table copy- + pasted into the writeup. Lands in Phase 11. +- `docs/judge-design.md` — interpretive writeup with the closing + "when NOT to use LLM-judge" position. Lands in Phase 12. + +**Config-knob preservation.** `evaluation.judge_provider` is unchanged +across all 5 YAML configs; new `evaluation.judge_dimensions` field +defaults to the three v1 dimensions. Zero user-facing config migration. + +**Out of scope (v1.1+).** Mistral self-hosted as the third jury member, +Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2 +prompt optimization, citation_faithfulness in the default +judge_dimensions, AC2 sympy-derived parity tests. + +## Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05 + +The Opus stress-test pass over the 30 calibration items × 3 dimensions +disagreed with the single-rater human gold on **22 of 30 groundedness +items** (8/30 agreement). Relevance and completeness agreed at 28/30 and +25/30 respectively. The groundedness disagreement is consistent in +direction — every disagreed-on item is `human=1, opus=0` — and has a +single root cause. + +**Root cause: reference-scope drift between rubric author and labeler.** +`agent_bench/evaluation/rubrics/groundedness.md` defines the reference +scope as the gold snippets attached to each item: + +> The judge sees only the gold snippets — not the retrieved chunks. A +> claim that happens to be true in the world but is not entailed by the +> snippets fails groundedness. + +The single-rater notes on the disagreed-on items describe checking +against the broader documentation, not against `source_snippets`: +"supported by the corpus", "supported by the docs", "supported by the +provided dependency snippet". For items like `k8s_006` the gold snippet +is one sentence ("A ConfigMap is an API object used to store +non-confidential data in key-value pairs"), while the agent's answer +correctly synthesizes seven or eight additional claims from the full +`k8s_configmap.md`. Those claims are true in the world and well-supported +by the full doc, but **not entailed by the one snippet**. Opus applied +the strict-snippet rubric; the human rater applied a corpus-supported +rubric. + +**Why this blocks `make calibrate` against the current gold.** The κ +ablation compares Haiku and GPT-4o-mini judges against the human gold. +A judge that correctly applies the strict-snippet rubric will disagree +with miscalibrated gold; a judge that's too lenient will agree. The +ablation rewards leniency and punishes rigor — the opposite of the +intended measurement. This is the same failure mode codified earlier in +this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune +compensation when the measurement is wrong, not the intended effect. + +**Why the rubric stays as written, not relaxed to "corpus-supported".** +Strict-snippet groundedness measures *RAG behavior*: did the agent +synthesize from what it retrieved? Corpus-supported groundedness +measures *LLM general knowledge passing through a RAG harness*: did the +agent happen to be correct? The first is what this benchmark is for; +the second is what `agent_bench/evaluation/metrics.py` measured before +supersession. Relaxing the rubric to "corpus-supported" would silently +re-introduce the failure mode the supersession entry above just removed. + +**Decision — three-step correction lands before `make calibrate` runs:** + +1. **Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.** + Add an explicit reference-scope line and one anchored example + contrasting "supported by the snippet" vs "true in the world but + not in the snippet". Audit-trail requirement: the v1.1 writeup will + cite "rubric clarified between v1.0 and v1.1", and the git history + needs to back that claim. +2. **Re-label the 22 disagreed-on groundedness items** in + `measurements/2026-05-04-judge-calibration-labels.jsonl` against the + clarified rubric, snippet-only. **Do not mechanically copy Opus's + labels.** The labels remain the human single-rater's; what changes is + the rubric being applied. Mechanical copy would turn the κ table + into "judge vs Opus", which is not what the writeup claims it + measures. +3. **Recompute `make calibrate` against the corrected gold** and emit + `docs/_generated/kappa_table.md` from the v1.1 labels. + +**Evidence files for the v1.1 writeup section:** + +- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — 90 Opus + labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min + org rate limit, ~$0.20, ~14 min wall, zero infra-abstains). +- `measurements/2026-05-04-judge-calibration-labels.jsonl` — original + v1.0 single-rater gold; will be diffed against v1.1 corrected gold to + quantify the re-label delta. +- `agent_bench/evaluation/rubrics/groundedness.md` — pre/post diff is + the rubric clarification. + +**Pre-labeling observations also worth recording for the writeup +methodology section:** + +- `q021` (fastapi · calculation) answered the CORS preflight question + correctly (600 / 60 = 10 minutes) with `sources: []` and + `ranked_sources: []` — the agent did the arithmetic without retrieval + and emitted an answer consistent with the snippet without having + retrieved it. Methodologically interesting for the + citation-faithfulness story (Block 2.7) if it ships: an answer can be + correct without being grounded-by-citation. +- `q025` (fastapi · multi_hop) answer was truncated mid-token by the + orchestrator's max_tokens limit. The labels reflect what the system + produced, not a mentally-patched complete version. The completeness + rubric does not currently anchor "truncated response" as a level — + v1.1 rubric work should add an anchor. +- Several K8s items embed external knowledge that's correct but not in + the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container + success; `k8s_009` describes Roles vs ClusterRoles by their semantics). + The clarified groundedness rubric should pick **strict** on this case + (claim must be supportable by the retrieved spans, not just consistent + with them) and the anchored example should show that ruling. + +**Methodology framing for the writeup.** The Opus stress-test was added +specifically to catch hand-labeled-gold fragility before the κ table is +published. It caught it. The writeup's calibration section should +disclose the rubric clarification, quantify the re-label delta on +groundedness, and report κ against the v1.1 corrected gold — that is a +more credible story than a first-try clean κ table would have been. + +**Outcome — 2026-05-05 calibrate run on v1.1 gold.** All 6 ablation rows +ran cleanly after three coupled production-code fixes that landed on the +same branch as the rubric clarification: (1) markdown fence stripping in +`agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because +Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens` +512 → 1024 because v1.1 anchored examples elicit longer model reasoning, +(3) calibration runner v1.0 omitted `item_id` from prediction records; +fixed in v1.1 with backfill of the 6 already-written row files via +`hash → item_id` map (no re-spend). Probe-one-cell-before-sweep saved a +fourth $0.50 wasted run after the fence-strip change — the methodology +note in `feedback_judge_probe_before_sweep.md` was earned by this +session's two failed full-row attempts that paid ~$1.15 for unparseable +output before the diagnosis converged. + +The κ table at `docs/_generated/kappa_table.md` (regenerated on +2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for +completeness — see report.py `_DIM_METRIC`) shows three findings +that the writeup interprets rather than reports verbatim: + +**v1.1 finding 1 — relevance is not "judges fail" territory.** +Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the +29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00. +AC1 is the load-bearing statistic on relevance and groundedness; both +metrics agree on completeness where the gold (23×2 / 5×1) is balanced. + +**v1.1 finding 2 — `no_cot completeness` agreement is real, not +selective abstain.** AC1 = κ = 1.000 at n=24. The 2 absent cells +(`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit +retry exhaustion), both gold=`2`, neither in baseline's disagreement +set. On the 24 scored cells, all 4 baseline-with-CoT disagreements +(3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to +agreement when CoT is removed. The interview-relevant claim is the +*opposite* of the conventional CoT-helps story: CoT-before-score on +3-point completeness lets the judge over-emphasize partial coverage +and rationalize `1` when the human gold sides with the holistic +"covers the points" reading. + +**v1.1 finding 3 — `jury_kappa_weighted` underperformed baseline on +completeness, with a precise mechanism.** Per-member analysis from +`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`: +Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%; +gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw +26.9% — systematically harsh on the 3-point scale, almost never +scoring `2`. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% — +matches gpt-4o-mini alone exactly because the jury verdict reduces +to gpt-4o-mini's verdict on every disputed cell. + +The mechanism is *missing-weight + round-down* compounding, not +weighted voting in the usual sense. `scripts/run_calibration.py +::_load_weights_from_baseline` is a documented v1 stub that returns +weight = 1.0 for every judge_id present in baseline. baseline.json +contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini +gets 1.0 from `jury.py`'s missing-key fallback (with a logged +`jury_missing_weight_fallback_to_one` warning per call). Equal +weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean +of 1.5; the `_discretize_mean` rule is `frac > 0.5 → ceil else floor`, +and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict +wins every disputed cell. The v1 design doc's risks subsection listed +"jury κ worse than the better individual judge — (a) kappa-weighting +wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired +*both* branches simultaneously: branch (a) because the weighting is a +stub returning equal weights, and branch (b) because round-down at +exact 0.5 ties hands the verdict to the lower-scoring member. + +The deeper structural point is that weighting alone cannot rescue a +systematically miscalibrated member. Even held-out validation that +correctly assigned gpt-4o-mini's true low weight on completeness +would still let it dominate disputed ties unless its weight were +driven near zero — and at that point exclusion is more honest than +near-zero inclusion. The conservative-on-binary "ties to lower" rule +also doesn't transfer cleanly to ordinal scales: on completeness, +"conservative" means scoring *toward incomplete*, which is precisely +the direction of gpt-4o-mini's bias. + +**v1.2 fix list (four items, expanding the earlier two-item list):** + +1. **Held-out jury weights.** Replace the + `_load_weights_from_baseline` stub with a real κ-derived + computation, evaluated on a *held-out validation set* — not the + same calibration row whose κ is being measured against the gold. + Closes the circular-weighting hole. +2. **Symmetric member coverage in the weights source.** Missing-member + fallback to weight = 1.0 amplifies an unweighted member rather than + suppressing it. Either every jury member must have a weight in the + source file or the run must abort. The `jury_missing_weight_ + fallback_to_one` warning fired loudly on every call this run; in + v1.2 it should be a hard error. +3. **Per-dimension member exclusion when individual κ falls below a + threshold.** gpt-4o-mini at κ = 0.020 on completeness should not be + in the completeness jury at all. Weights below a floor (suggested + κ < 0.2) should be treated as exclusion, not as small-weight + inclusion. Held-out validation fixes circular weighting; it does + not fix systematic member bias. +4. **Per-dimension tie-break rule.** v1's `_discretize_mean` rule + (ties to lower) was selected for conservative behavior on binary + scales, where "conservative" means scoring 0 on uncertainty. On + 3-point completeness, "conservative" means scoring toward + *incomplete*, which interacts badly with member miscalibration. + v1.2 should select the tie-break rule per-dimension based on the + rubric's conservative direction, not globally. + +**Evidence files:** `docs/_generated/kappa_table.md` (regenerated with +AC1 for groundedness/relevance, κ for completeness); +`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl` +(per-member sidecar where the gpt-4o-mini completeness bias is +visible per item); `results/calibration_v1_judge_baseline.json` +(weights source — note the absence of any gpt-4o-mini-2024-07-18 +entries, which is why the missing-weight fallback fires). + +## v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria + +**Date:** 2026-05-06. **Status:** in-flight; this entry is the pre-experiment +contract that pins down what counts as success before the re-aggregation +runs, so the outcome can't be negotiated post-hoc. + +**Sharpened diagnostic — extraction-vs-reasoning split, not just "model is +biased".** Re-reading the per-member sidecar (item-level, not aggregate) +on the gpt-4o-mini completeness disputes shows a more specific failure +mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2 +/ gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes` +field correctly extracts the paraphrased coverage from the agent answer +— and then its `reasoning` field denies that those very quotes constitute +coverage. k8s_002 is the cleanest instance: the model quotes the strings +"declarative updates" and "sticky identity" into evidence, then writes +"the answer does not explicitly mention 'declarative updates' and 'sticky +identity'". The score follows the reasoning, not the evidence. The +mechanism is that the model's *post-extraction reasoning step* applies a +literal-string-match standard to the answer text while the rubric +requires "paraphrase allowed" — i.e., the structured-output discipline +forced an extraction step that the reasoning step then contradicted on +autopilot. This is a known failure mode in chain-of-thought judges and +shows up more in smaller models because the reasoning step has less +capacity to integrate the rubric's instruction with the literal-text +comparison the model is running by default. The artifact for the writeup +is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three +side-by-side reasoning + evidence_quotes excerpts). + +**Pragmatic v1.1 weights-source decision.** The v1.2 fix-list above +specifies a held-out validation set for jury weights — methodologically +clean but requires either splitting N=30 (loses statistical power on +both halves) or labeling more items (eats interview prep time). v1.1 +chooses pragmatic: weights computed from the same calibration set used +for κ reporting, with the circularity flagged in the writeup. Reason: +(a) the alternative is splitting N=30, (b) the per-member κ values used +as weights are internally consistent, (c) v1.2 will use a held-out 20- +item set. The writeup will contain a sentence acknowledging the +circularity rather than hiding it. + +**v1.1 elevated fix-list (subset of the v1.2 list above).** Items 2 +(symmetric coverage / hard-error) is elevated unconditionally. Item 1 +(real κ-derived weights) is elevated in pragmatic form (same set with +circularity caveat). Items 3 (per-dimension exclusion) and 4 (per- +dimension tie-break) remain v1.2 unless B's outcome forces them up. + +**Pre-committed B success criteria.** Plan B is "re-aggregate the existing +164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl` +with corrected κ-derived weights, no new API spend." The outcome maps +deterministically to one of three predefined responses, picked *before* +B runs: + +- **Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥ + 0.05** (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story: + "v1's weights-source bug masked correct aggregation; once both bugs + (asymmetric coverage + missing-weight fallback) are fixed, the jury + improves on baseline. Per-dimension exclusion remains a v1.2 design + pattern but is not needed at v1.1." This is the strong story. +- **Outcome 2 — jury κ within ±0.05 of Haiku-baseline** (i.e., 0.366 ≤ + jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity + but the jury isn't doing meaningful work on completeness — gpt-4o- + mini's near-zero weight makes it effectively excluded by aggregation. + This is *soft exclusion via weighting*; v1.2 will make exclusion + explicit." Defensible but less clean. +- **Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05** (i.e., + jury κ < 0.366). Writeup story: "weights-source fix is necessary but + not sufficient; even at near-zero weight gpt-4o-mini's verdict tips + disputed (1, 2) ties due to the round-down rule. v1.1 escalates to + per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1. + +**Why the predefined-criteria framing matters.** "I ran B, looked at the +number, decided it was good enough" is the same data with a weaker frame +than "I predefined the success criteria before running the experiment, B +landed at outcome X, which mapped to predefined response Y". The latter +demonstrates evaluation maturity in the writeup; the former invites +post-hoc reading of the outcome. + +**B outcome — 2026-05-06.** Plan B re-aggregated the existing 164 sidecar +rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on +completeness; clipped at 0 from raw κ values). Result: **jury κ on +completeness = 0.416**, exactly matching Haiku-baseline. Δ = 0.000; +maps to **Outcome 2 (soft exclusion via weighting)**. Per the +pre-committed response, v1.1 stops here and writes up; per-dimension +member exclusion (item C / v1.2 fix #3) is not escalated to v1.1. + +Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1) +with corrected weights aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 = +1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the +correct verdict. v1's two compounding bugs (asymmetric source returning +weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt- +4o-mini) jointly forced equal weights, and equal-weights with the same +round-up rule produced `(2 × 1 + 1 × 1) / 2 = 1.5`, which has frac +exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every +disputed cell. The bug fixes recover the right verdict purely +mechanically; no judge model behavior changes. + +The empirical reading: the weighting is *not doing meaningful work* — +gpt-4o-mini's near-zero weight effectively excludes it on completeness, +and the jury's κ matches Haiku-alone exactly because Haiku's verdict +wins every disputed cell. This is "soft exclusion via weighting"; v1.2's +explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the +exclusion visible in the jury config rather than emergent from κ-derived +weight collapse. + +**v1.1 code changes (this commit):** +- `agent_bench/evaluation/variance/jury.py` — silent missing-weight + fallback to 1.0 → hard `ValueError`. Two existing tests that asserted + the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`, + `test_kappa_weighted_logs_warning_on_missing_weight`) updated to + assert the new contract. +- `scripts/run_calibration.py::_load_weights_from_baseline` → + `_compute_kappa_weights` — replaces the v1 stub with real per-judge + Cohen's κ on the dimension; hard-errors when any expected member is + missing from the source. Clips κ < 0 to weight = 0 (soft exclusion). +- `configs/calibration/rows/jury_kappa_weighted.yaml` — `weights_source` + re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only, + asymmetric coverage) to + `calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges, + same calibration set with documented circularity). +- `tests/scripts/test_run_calibration_dispatch.py` — two new tests cover + `_compute_kappa_weights`: (a) computes real κ (high-agreement judge → + weight=1.0, chance-agreement judge → 0); (b) hard-errors on + asymmetric source coverage. +- `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` — new + predictions row produced by re-aggregating the existing sidecar + offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`). + `docs/_generated/kappa_table.md` regenerated with this row alongside + the broken v1 row, giving the writeup a clean before/after diff + (completeness: 0.014 → 0.416, n=26). +- `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` — the + three side-by-side reasoning + evidence_quotes excerpts (q006 / + k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split + diagnostic finding. + +The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1 +(pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2. + +## Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria) + +**Date:** 2026-05-06. **Status:** in-flight; this entry pins down the +hypothesis and success criteria before the experiment runs. + +**Hypothesis sharpened by the 1A direction-of-bias finding.** GPT-4o- +mini's completeness disagreements are 17/19 gold=2/pred=1 with zero +up-mistakes across 26 items spanning two corpora — direction-aware noise, +not balanced random labeling. The model is consistently applying *some* +rule stricter than the rubric requires. The hypothesis under test: that +stricter rule is "literal-string match required, paraphrase doesn't +count," and the bias is fixable by recency-positioning the rubric's +"paraphrase allowed" instruction adjacent to the commit-to-score +decision instead of leaving it 500+ tokens upstream in the rubric body. + +**The intervention is positional, not lexical.** The current +`CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`) +sends the rubric body, then the gold reference, then the system answer, +then a one-line "Score this answer..." instruction immediately followed +by the JSON schema clause. The rubric body's "paraphrase allowed" clause +appears in the introductory paragraphs, hundreds of tokens before the +score decision. The intervention adds one sentence between the system +answer and the score instruction: + +> *"Note: a paraphrase that captures the same meaning as a gold-answer +> point counts as covered. Score on content equivalence, not surface +> form."* + +This is the recency-positioning hypothesis: the model loses the +paraphrase conditioning across the rubric anchors and the reasoning +step. Restating the instruction adjacent to the score decision tests +whether the bias is positionally correctable. + +**Selected 5 disputed items** (representative of the gold=2 / Haiku=2 / +gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`, +`k8s_018`. All four are pure paraphrase-coverage cases (the system +answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored +1 with the extraction-vs-reasoning split documented in +`measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`). + +**Pre-committed 3A success criteria.** + +- **Fixed (≥3/5 shift from 1 → 2):** Recency-positioning is sufficient. + Re-run GPT-4o-mini on the full 26 disputed items with the corrected + prompt, recompute κ, update the writeup table. Story: "rubric- + engineering matters more than judge model choice for ordinal scales — + recency-positioning the paraphrase instruction recovered N% of + disputed items." The completeness story becomes actionable, not + diagnostic-only. +- **Partially fixed (1–2/5 shift):** Inconclusive at N=5 (binomial- + significance line is ~3+). Re-run on the full 26 disputed items + (~$0.20) to get a clean number; write up whatever the full-26 says. +- **Not fixed (0/5 shift):** The instruction is being received and + ignored — the model can't act on it under reasoning load. Escalate + to 4A (GPT-4o full on the same 5 items) to verify the small-model- + specific claim. Story: "repositioning the paraphrase instruction + adjacent to the score decision did not shift any of 5 disputed items; + GPT-4o handled the same prompts. The bias is small-model-specific, + not prompt-fixable." + +The 3/5 threshold is the binomial-significance line at this N — random +shifting under the null produces 0 or 1 changes most of the time. Pre- +committing avoids the "2 shifted, that's kind of a fix" negotiation. + +**On the 1A relevance finding — confirmed.** Both judges essentially +correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ +degeneracy is structural under 29/30 prevalence at class-2; AC1 + +raw agreement is the right reporting. No further investigation on +relevance. Writeup paragraph is one short sentence: prevalence-induced +degeneracy → AC1 is load-bearing. + +## Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1) + +**Date:** 2026-05-06. **Status:** complete; the v1.1.1 prompt is now +permanent in `agent_bench/evaluation/judges/completeness.py`. + +**3A 5-item probe:** 3/5 disputed items shifted 1 → 2 (q006, q011, +k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre- +committed threshold (≥3/5 → "fixed"), so the protocol triggered the +full-26 re-run on gpt-4o-mini only (Haiku held as control to make the +v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on +the affected judge). + +**Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):** + +| | n | raw | κ | AC1 | +|------------------------------|----|--------|--------|--------| +| v1.1 gpt-4o-mini | 26 | 26.9% | +0.020 | +0.006 | +| **v1.1.1 gpt-4o-mini** | 28 | **42.9%** | **+0.000** | **+0.232** | +| v1.1 Haiku (control) | 26 | 84.6% | +0.416 | +0.792 | + +**Per-item delta (v1.1 → v1.1.1):** 7 items shifted up (1 → 2 or 1 → 2), +0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2 +items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016, +k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1 +in v1.1, now over-credited at pred=2). Net per-item correctness delta: ++5 items. + +**Cohen's κ is misleading on this comparison.** v1.1.1 raw agreement +rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from +0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly +*lower* than v1.1's 0.020. The mechanism is prevalence-rebalancing in +the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19, +2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer +to gold's `{1:5, 2:23}` over n=28). Cohen's κ = `(P_o - P_e)/(1 - P_e)`; +when marginals become more diverse, P_e (chance agreement) rises in +lockstep with P_o (observed agreement), and κ deflates. AC1 uses +prevalence-robust chance correction (`P_e = (1/(q-1)) Σ pi_k(1-pi_k)`) +and reads the actual signal. + +This is the same trap that motivated AC1 over κ on the relevance and +groundedness rows of the original κ table, surfacing here at a +different distribution boundary. The κ table footer already explains +why per-dimension metric selection matters; v1.1.1's outcome +demonstrates the trap *induced by the intervention itself*. + +**Effect on the jury aggregate.** With κ-derived weights and gpt-4o- +mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury +verdict on completeness is now mathematically equivalent to Haiku-alone +on every item (gpt's contribution is multiplied by zero). Jury κ stays +at 0.416, identical to v1.1's corrected aggregate. The intervention's +per-member improvement is *invisible at the jury level* under this +weighting scheme. + +**Methodological consequence — v1.2 fix-list addition.** The v1.2 fix- +list now expands by one item: + +5. **Prevalence-robust weights for prevalence-skewed dimensions.** + v1.1's `_compute_kappa_weights` uses Cohen's κ for every dimension, + which has a *self-defeating property* on prevalence-skewed gold: + improving a member can lower its weight even as it gets more + accurate. + + **Mechanism.** Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where + `P_e = Σ_k P(gold=k) × P(pred=k)` is the chance-agreement term + computed from the marginal distributions. P_e is *not* invariant to + the predictor's marginal distribution — when a member's predictions + become more diverse (less concentrated at one class), P_e *rises* + as the marginals approach gold's marginals. Concretely: when an + intervention moves a member's pred distribution from concentrated- + at-one-class toward gold's distribution, P_o and P_e rise together + in lockstep. The numerator `P_o - P_e` stays small, and κ deflates + even as raw accuracy improves. This is the same prevalence-induced + degeneracy that motivated AC1 over κ on relevance/groundedness rows + in the κ table — it surfaces in jury weighting at any + distribution-shifting intervention's boundary. + + **Empirically observed in v1.1.1.** The recency-positioning + intervention shifted gpt-4o-mini completeness pred dist from + `{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's + `{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% → 42.9%. + AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×). + Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o + ≈ 0.43. v1.1's `_compute_kappa_weights` clips the new κ at zero, + producing weight = 0 — and the jury aggregate loses access to a + member that was empirically improved. The intervention's per- + member improvement is invisible at the jury level under κ-weighting. + + **Architectural decomposition for v1.2.** The right separation: + - **Per-dimension metric for κ table reporting** (already in v1.1 + via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`). + - **Per-dimension weight metric for jury aggregation** (new in + v1.2, reuses `_DIM_METRIC`). Use κ where the gold's prevalence + supports it, AC1 where κ degenerates. Same lookup, same per- + dimension policy at both reporting and weighting layers. + - **Per-dimension membership as explicit configuration override** + for members that are structurally inappropriate (v1.2 fix #3, + unchanged) — distinct from "low score on the chosen metric," + which is handled by the weight floor. + + **Why this is non-obvious.** A reader's first instinct is that + "weight by κ" is a sensible default — κ is *the* standard inter- + rater statistic. The self-defeating property is invisible until + you observe a real intervention that shifts marginals; in static + conditions (no intervention, fixed prompts), the κ-weight choice + is benign. The v1.1.1 outcome is the first time the agent-bench + calibration set has produced an intervention-induced marginal + shift on the same gold; the failure mode wouldn't have been + visible in v1.0's static calibration sweep. + +**v1.1.1 code changes (this commit):** +- `agent_bench/evaluation/judges/completeness.py` — adds + `PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system + answer and the score instruction. Comment cites the 3A probe. +- `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt` + — extends to assert the recency clause appears AND is positioned + between the answer and the score instruction (position is load- + bearing, not just lexical inclusion). +- `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl` + — merged sidecar: v1.1 groundedness/relevance rows (unchanged + judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku + completeness rows. +- `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the + 5-item probe artifact with reasoning + evidence_quotes for each. +- `scripts/_dev/probe_3a_paraphrase_recency.py`, + `scripts/_dev/rerun_completeness_v1_1_1.py` — reproducers; not + part of the production calibration runner. + +**No changes to the κ table.** The jury aggregate κ on completeness is +unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy +described above; adding a `jury_kappa_weighted_v1_1_1` row with +identical numbers would be visual noise. The v1.1.1 finding lives in +the writeup body, not the table — the per-member AC1 improvement +(0.006 → 0.232) is the headline number, surfaced as a separate +paragraph next to the κ table rather than inside it. + +**Total spend through Plan 3A:** $0.0013 (3A probe) + $0.0075 (full-26 +re-run) = $0.0088. + +## Plan 4A — GPT-4o (full) on the v1.1.1 residual + +**Date:** 2026-05-06. **Status:** complete. Run after the writeup- +framing review surfaced that v1.1.1's "fixed" verdict was overclaim- +prone — 5/19 items were recovered, 14 remained unchanged and +uncharacterized. 4A was originally scoped as conditional on 3A *not* +being fixed (per the predefined sequencing rule), but became valuable +as a *post-3A* diagnostic to characterize the residual: is it small- +model-specific or rubric-under-specified? + +**Scope.** GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged +items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1 +production prompt (paraphrase recency clause active). The first two +(k8s_006, k8s_018) are the items that didn't shift in the original 3A +5-item probe — we have gpt-4o-mini's reasoning on those items *with* +the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt +varying only the model. q011, q012, k8s_001 cover the broader +fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A +checks whether GPT-4o agrees with gold or with Haiku). + +**Result: 5/5 correct.** All 5 items scored 2 by GPT-4o, matching gold +exactly. Cost: $0.0011 reported (caveat: pricing config falls back to +gpt-4o-mini rates for unlisted models, so actual cost is closer to +$0.005–0.01 — the reported number under-reports by ~5–10×). + +**Sharpened mechanism — criteria-invention, not just literal-match.** +The original 3-example artifact (q006, k8s_002, k8s_018) was framed +as gpt-4o-mini "applying a literal-string-match standard" while +correctly extracting paraphrased coverage into evidence_quotes. 4A's +side-by-side reasoning on `k8s_018` shows a distinct second mechanism: + + - **gpt-4o-mini (v1.1.1, score 1):** "It mentions some key points + from the reference... but does not explicitly state that the new + fields in `autoscaling/v2` are preserved as annotations when using + `autoscaling/v1`, nor does it mention the need to use + `autoscaling/v2` directly for memory or custom metric scaling for + a Deployment or StatefulSet." + - **gpt-4o (4A, score 2):** "The answer covers all the key points + from the reference. It mentions that the current stable version is + autoscaling/v2, which supports scaling on memory and custom + metrics, similar to the reference. It also notes that + autoscaling/v1 only supports CPU-based scaling, aligning with the + reference's points." + +The reference for k8s_018 specifies three points: (1) autoscaling/v2 +is the current stable API, (2) it adds memory metrics support beyond +v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's +reasoning step *invents additional criteria* the reference does not +require ("preserved as annotations when using autoscaling/v1," "use +autoscaling/v2 directly for ... a Deployment or StatefulSet") and then +deducts against them, scoring 1. GPT-4o reads the reference's three +points and scores against exactly those, scoring 2. + +This is a *capacity* finding distinct from the paraphrase-recency +finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing +it toward paraphrase semantics, manufactures additional gold criteria +during scoring that aren't in the reference. Recency-positioning the +"paraphrase allowed" clause doesn't address this — the bias isn't +"missed paraphrase," it's "invented extra requirements." Two failure +modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces. + +**Implication for v1.2.** With 5/5 confirmed, v1.2 fix #3 (per- +dimension membership) gets clean empirical support: gpt-4o-mini is +the wrong tool for 3-point completeness with paraphrase semantics, and +no amount of prompt engineering on this rubric is going to bridge the +capacity gap. The right v1.2 path is one of: + + - **Exclude gpt-4o-mini from completeness scoring** (per-dim + membership; jury reduces to single-judge Haiku on completeness; + explicit and visible in config). + - **Replace gpt-4o-mini with GPT-4o on completeness** (per-dim + judge selection; jury keeps two members but the second is a + frontier-class model on the dimension that needs it). + +Both are defensible v1.2 designs. The choice depends on cost +budget — gpt-4o is ~10× the per-call cost of gpt-4o-mini. For +agent-bench's calibration set scale (~30 items × per-row), even gpt- +4o is trivially cheap; for production deployment evaluating thousands +of agent outputs, the cost trade-off matters more. + +**4A artifact:** `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl` +(per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs +with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for +the side-by-side analysis above). + +**Updated honest framing for the writeup.** "v1.1.1 addressed one +identified failure mode (paraphrase-instruction-loss across reasoning, +recovered 5/19 disputed items via positional change). 4A confirmed the +residual 14 are a distinct failure mode (capacity-limited criteria +invention during the reasoning step) — GPT-4o handles all 5 sampled +residuals at the same v1.1.1 prompt, so the failure is small-model- +specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge +membership / model selection) is the right escalation; the rubric +itself doesn't need changes." + +**Total session spend:** $0.0099 reported (~$0.013–0.018 actual after +gpt-4o pricing correction). diff --git a/Makefile b/Makefile index da39f17c487bcfb61ee206d20cd3227e0eaf1cc8..aaa2e31f2f1fc85270840184793c6a357580d6dc 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,6 @@ PYTHON ?= /usr/local/opt/python@3.11/bin/python3.11 -.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate +.PHONY: install test lint serve ingest ingest-k8s evaluate-fast evaluate-full benchmark evaluate-langchain calibrate evaluate-judges docker modal-deploy modal-stop vllm-up benchmark-all k8s-dev k8s-prod tf-plan tf-validate install: $(PYTHON) -m pip install -e ".[dev]" @@ -34,6 +34,21 @@ benchmark: evaluate-langchain: $(PYTHON) scripts/run_langchain_eval.py --provider openai +calibrate: ## Run full calibration pipeline (system outputs → all rows → strict κ table). Costs ~$2 in API calls. + $(PYTHON) scripts/run_calibration.py generate-outputs + @for cfg in configs/calibration/rows/*.yaml; do \ + echo "==> running judges for $$cfg"; \ + $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \ + done + $(PYTHON) scripts/run_calibration.py build-table --strict + +evaluate-judges: ## Re-run all rows + build-table against existing system_outputs (no regeneration). Costs ~$1. + @for cfg in configs/calibration/rows/*.yaml; do \ + echo "==> running judges for $$cfg"; \ + $(PYTHON) scripts/run_calibration.py run-judges --row-config=$$cfg || exit 1; \ + done + $(PYTHON) scripts/run_calibration.py build-table --strict + docker: docker-compose -f docker/docker-compose.yaml up --build diff --git a/README.md b/README.md index af09e4df507e38a94df5acc519e921406efcf2ae..d07e5fcec7f34c431fc46a27c389b7ad91a814d7 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ app_port: 7860 Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down. -`444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI` +`443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI` ## Benchmark Results @@ -249,7 +249,7 @@ security: - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data) - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist) - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums -- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers +- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
API Reference @@ -311,12 +311,25 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3 ## Testing ```bash -make test # 444 deterministic tests, no API keys needed +make test # 523 deterministic tests, no API keys needed make lint # ruff + mypy ``` All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads. CI-safe. +### Targets that cost money + +These Make targets call paid LLM APIs. Run locally; they are excluded from CI. + +| Target | Requires API key | Approximate cost | What it produces | +|---|---|---|---| +| `make evaluate-full` | OpenAI or Anthropic | $0.01–0.10 per run | Full-corpus harness run with L1 + L2 judges; results in `results/{run_label}.json`. Cost scales with item count × judge dimensions: in-scope items get all 3 (groundedness + relevance + completeness), out-of-scope items get relevance only (~$0.0001/item). | +| `make calibrate` | Anthropic + OpenAI | ~$2 per full run | Generates frozen system outputs, scores all 6 ablation rows, builds `docs/_generated/kappa_table.md` | +| `make evaluate-judges` | Anthropic + OpenAI | ~$1 per run | Re-runs the 6 rows against existing system outputs (no regeneration) | +| `make evaluate-langchain` | OpenAI or Anthropic | $0.01–0.05 per run | LangChain baseline harness for the comparison report | + +Set keys via `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` environment variables. CI does not have these (test job uses `MockProvider`). + ## Design Decisions See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, security architecture tradeoffs, and more. @@ -334,4 +347,4 @@ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF | **PII redaction** | None | None | Regex + optional NER | | **Output validation** | None | None | PII leakage + URL + blocklist | | **Audit logging** | None | None | JSONL, HMAC-hashed IPs | -| Tests | 97 | 205 | 288 | +| Tests | 97 | 205 | 443 | diff --git a/agent_bench/core/config.py b/agent_bench/core/config.py index 1d02600a2d3a99d00ad41805019b5b0c34fac281..cc3081de46a2516e513e7402c7870f14ab24e531 100644 --- a/agent_bench/core/config.py +++ b/agent_bench/core/config.py @@ -88,6 +88,9 @@ class MemoryConfig(BaseModel): class EvaluationConfig(BaseModel): judge_provider: str = "openai" golden_dataset: str = "agent_bench/evaluation/datasets/tech_docs_golden.json" + # New in judge-layer v1: which dimensions to score with L2 LLM judges. + # citation_faithfulness is opt-in v1 (default-on v1.1). + judge_dimensions: list[str] = ["groundedness", "relevance", "completeness"] _VALID_TIERS = {"heuristic", "classifier"} diff --git a/agent_bench/core/provider.py b/agent_bench/core/provider.py index dcbcfa73bc802d77d87fccf3b996c85446c74804..d755710ae26e574cd8eec1d6c958cd1ab699bc64 100644 --- a/agent_bench/core/provider.py +++ b/agent_bench/core/provider.py @@ -192,9 +192,17 @@ class MockProvider(LLMProvider): class OpenAIProvider(LLMProvider): - """OpenAI API provider pinned to a dated gpt-4o-mini snapshot.""" + """OpenAI API provider pinned to a dated gpt-4o-mini snapshot. - def __init__(self, config: AppConfig | None = None) -> None: + The ``model`` parameter overrides the default pin (used by the + calibration runner so a row config's ``model_id`` is what actually + gets called — without an override, ``judge_id`` would be a label + that disagrees with the API request, breaking provenance). + """ + + def __init__( + self, config: AppConfig | None = None, *, model: str | None = None + ) -> None: try: from openai import AsyncOpenAI except ImportError as e: @@ -205,7 +213,7 @@ class OpenAIProvider(LLMProvider): self.config = config or load_config() api_key = os.environ.get("OPENAI_API_KEY", "") self.client = AsyncOpenAI(api_key=api_key) - self.model = "gpt-4o-mini-2024-07-18" + self.model = model or "gpt-4o-mini-2024-07-18" model_pricing = self.config.provider.models.get(self.model) self._input_cost = model_pricing.input_cost_per_mtok if model_pricing else 0.15 self._output_cost = model_pricing.output_cost_per_mtok if model_pricing else 0.60 @@ -410,9 +418,17 @@ def format_messages_anthropic( class AnthropicProvider(LLMProvider): - """Anthropic Claude provider.""" + """Anthropic Claude provider. - def __init__(self, config: AppConfig | None = None) -> None: + The ``model`` parameter overrides the default pin (used by the + calibration runner so a row config's ``model_id`` is what actually + gets called — without an override, ``judge_id`` would be a label + that disagrees with the API request, breaking provenance). + """ + + def __init__( + self, config: AppConfig | None = None, *, model: str | None = None + ) -> None: try: from anthropic import AsyncAnthropic except ImportError as e: @@ -425,7 +441,7 @@ class AnthropicProvider(LLMProvider): self.config = config or load_config() api_key = os.environ.get("ANTHROPIC_API_KEY", "") self.client = AsyncAnthropic(api_key=api_key) - self.model = "claude-haiku-4-5-20251001" + self.model = model or "claude-haiku-4-5-20251001" model_pricing = self.config.provider.models.get(self.model) self._input_cost = ( model_pricing.input_cost_per_mtok if model_pricing else 0.80 diff --git a/agent_bench/evaluation/calibration/__init__.py b/agent_bench/evaluation/calibration/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..abcbca5344a661e4844c50014504409edd5c9ca5 --- /dev/null +++ b/agent_bench/evaluation/calibration/__init__.py @@ -0,0 +1,9 @@ +"""Hand-rolled inter-rater agreement metrics + calibration report generator.""" + +from agent_bench.evaluation.calibration.metrics import ( + bootstrap_ci, + cohen_kappa, + gwets_ac2, +) + +__all__ = ["bootstrap_ci", "cohen_kappa", "gwets_ac2"] diff --git a/agent_bench/evaluation/calibration/metrics.py b/agent_bench/evaluation/calibration/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..20fb95ddd8e5a8781030b256728f101fd1bbc0a9 --- /dev/null +++ b/agent_bench/evaluation/calibration/metrics.py @@ -0,0 +1,173 @@ +"""Hand-rolled Cohen's kappa, Gwet's AC2, bootstrap CI. + +Hand-rolled (not sklearn) for two reasons: +1. agent-bench's identity is "built from primitives" — adding sklearn + for one function (and transitively numpy + scipy + threadpoolctl + + joblib) contradicts that. +2. The hand-roll demonstrates formula understanding in a way that + sklearn.metrics.cohen_kappa_score does not. + +Fixture-tested against sklearn run *outside* the project venv — +see tests/evaluation/test_calibration_metrics.py and +scripts/_dev/generate_kappa_fixtures.py. +""" + +from __future__ import annotations + +import random +from collections.abc import Callable +from typing import Literal + + +def cohen_kappa( + y1: list, + y2: list, + weights: Literal[None, "linear", "quadratic"] = None, +) -> float: + """Cohen's κ = (P_o - P_e) / (1 - P_e). + + Supports unweighted, linear-weighted, and quadratic-weighted variants + for ordinal scales. y1 and y2 must be parallel lists of label values + (int or str). Both must have the same length. + """ + if len(y1) != len(y2): + raise ValueError( + f"y1 and y2 must have same length; got {len(y1)} vs {len(y2)}" + ) + if not y1: + raise ValueError("Empty input — kappa undefined") + + labels = sorted({*y1, *y2}, key=str) + k = len(labels) + label_idx = {lab: i for i, lab in enumerate(labels)} + + cm = [[0] * k for _ in range(k)] + for a, b in zip(y1, y2): + cm[label_idx[a]][label_idx[b]] += 1 + + n = len(y1) + + if weights is None: + w = [[1.0 if i == j else 0.0 for j in range(k)] for i in range(k)] + elif weights == "linear": + if k <= 1: + w = [[1.0]] + else: + w = [ + [1.0 - abs(i - j) / (k - 1) for j in range(k)] for i in range(k) + ] + elif weights == "quadratic": + if k <= 1: + w = [[1.0]] + else: + w = [ + [1.0 - ((i - j) / (k - 1)) ** 2 for j in range(k)] for i in range(k) + ] + else: + raise ValueError(f"Invalid weights {weights!r}") + + p_o = sum(w[i][j] * cm[i][j] for i in range(k) for j in range(k)) / n + + row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)] + col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)] + + p_e = sum( + w[i][j] * row_marg[i] * col_marg[j] for i in range(k) for j in range(k) + ) + + if p_e >= 1.0: + return 1.0 + return (p_o - p_e) / (1.0 - p_e) + + +def gwets_ac2( + y1: list, + y2: list, + weights: Literal[None] = None, +) -> float: + """Gwet's AC1 — chance-corrected agreement using mean marginals. + + AC1 = (P_o - P_e) / (1 - P_e) + where P_e = (1/(q-1)) * Σ pi_k * (1 - pi_k) + and pi_k is the mean marginal probability for category k. + + Despite the function name, v1 only supports the *unweighted* (AC1) + formula. The weighted AC2 variant has multiple inconsistent definitions + in the literature (Gwet 2008 vs Gwet 2014); without a sklearn analogue + to cross-check against (sklearn ships κ but not AC1/AC2), shipping a + weighted formula without a fixture is a methodology hazard. Pass + weights=None or omit; passing 'linear' or 'quadratic' raises + NotImplementedError. Fix the formula + fixture in v1.1 (out of scope + per the design's Out-of-Scope section). + """ + if weights is not None: + raise NotImplementedError( + "Weighted Gwet's AC2 is not implemented in v1. The unweighted " + "AC1 formula is correct and tested; the weighted variant has " + "literature inconsistency that needs a pinned fixture before " + "shipping. Pass weights=None or use cohen_kappa(weights=...)." + ) + if len(y1) != len(y2): + raise ValueError("y1 and y2 length mismatch") + if not y1: + raise ValueError("Empty input") + + labels = sorted({*y1, *y2}, key=str) + k = len(labels) + label_idx = {lab: i for i, lab in enumerate(labels)} + + cm = [[0] * k for _ in range(k)] + for a, b in zip(y1, y2): + cm[label_idx[a]][label_idx[b]] += 1 + n = len(y1) + + p_o = sum(cm[i][i] for i in range(k)) / n # diagonal sum (unweighted) + + row_marg = [sum(cm[i][j] for j in range(k)) / n for i in range(k)] + col_marg = [sum(cm[i][j] for i in range(k)) / n for j in range(k)] + pi = [(row_marg[i] + col_marg[i]) / 2 for i in range(k)] + + if k <= 1: + return 1.0 + # AC1 chance term: (1/(q-1)) * Σ pi_k * (1 - pi_k) + p_e_ac1 = sum(pi[i] * (1 - pi[i]) for i in range(k)) / (k - 1) + + if p_e_ac1 >= 1.0: + return 1.0 + return (p_o - p_e_ac1) / (1.0 - p_e_ac1) + + +def bootstrap_ci( + y1: list, + y2: list, + metric_fn: Callable[[list, list], float], + n_iter: int = 1000, + ci: float = 0.95, + seed: int = 42, +) -> tuple[float, float, float]: + """Bootstrap confidence interval for an inter-rater metric. + + Returns (point_estimate, ci_lo, ci_hi). Resamples with replacement + n_iter times and takes the (1-ci)/2 and (1+ci)/2 percentiles. + """ + if len(y1) != len(y2): + raise ValueError("length mismatch") + n = len(y1) + rng = random.Random(seed) + point = metric_fn(y1, y2) + samples: list[float] = [] + for _ in range(n_iter): + idx = [rng.randrange(n) for _ in range(n)] + s1 = [y1[i] for i in idx] + s2 = [y2[i] for i in idx] + try: + samples.append(metric_fn(s1, s2)) + except (ValueError, ZeroDivisionError): + # Degenerate resample (e.g., all one label) — skip + continue + samples.sort() + if not samples: + return point, point, point + lo_idx = int(((1 - ci) / 2) * len(samples)) + hi_idx = int(((1 + ci) / 2) * len(samples)) - 1 + return point, samples[lo_idx], samples[hi_idx] diff --git a/agent_bench/evaluation/calibration/report.py b/agent_bench/evaluation/calibration/report.py new file mode 100644 index 0000000000000000000000000000000000000000..e92a7414af8e1014e779cb7a169e2df2dd882353 --- /dev/null +++ b/agent_bench/evaluation/calibration/report.py @@ -0,0 +1,325 @@ +"""generate_kappa_table — joins predictions ⋈ labels by (item_id, dimension, +system_output_hash); computes per-row κ + bootstrap CI + abstain breakdown; +emits markdown table at docs/_generated/kappa_table.md. +""" + +from __future__ import annotations + +import glob as _glob +import json +from collections import defaultdict +from collections.abc import Callable +from pathlib import Path + +import structlog + +from agent_bench.evaluation.calibration.metrics import ( + bootstrap_ci, + cohen_kappa, + gwets_ac2, +) +from agent_bench.evaluation.judges.base import ( + ABSTAIN_REASON_OUT_OF_RANGE, + ABSTAIN_REASON_PROVIDER_EXHAUSTED, + ABSTAIN_REASON_SCHEMA_PARSE, +) + +logger = structlog.get_logger() + +ABSTAIN_THRESHOLD = 0.20 # strictly greater than fires the flag + +# Per-dimension headline metric. Cohen's κ degenerates under the prevalence +# imbalance produced by the v1.1 strict-snippet groundedness rubric (1×score=1, +# ~25×score=0) and by the inherent skew on relevance (29×score=2, 1×score=1): +# both Po and Pe approach 1.0, the formula collapses to ~0/0, and the rendered +# κ reads as 0.000 even when raw agreement is >95%. Gwet's AC1 (gwets_ac2 with +# weights=None per metrics.py) uses mean marginals and stays informative under +# imbalance. Completeness has a more balanced gold (23×2, 5×1, 2×Unknown) so +# Cohen's κ is the conventional choice there. The metric per dim is rendered +# explicitly in the footer so a writeup reader sees the methodology choice. +# Type annotation prevents a mypy 1.20.x INTERNAL ERROR triggered by the +# tuple-unpack of `_DIM_METRIC.get(dim, default)` further down. Without it +# mypy fails to infer the metric_fn callable signature consistently across +# the dict literal and the fallback default, and crashes with no real +# user-facing type error to fix. +_MetricFn = Callable[[list, list], float] +_DIM_METRIC: dict[str, tuple[str, _MetricFn]] = { + "groundedness": ("AC1", gwets_ac2), + "relevance": ("AC1", gwets_ac2), + "completeness": ("κ", cohen_kappa), +} + +# Filename marker for jury / permute sidecar files. Any prediction file whose +# basename contains this token is per-member detail, not aggregate predictions, +# and is excluded from the κ table. Pinned here so a future extension change +# (jsonl → json) is caught at the contract site rather than at report time. +_SIDECAR_BASENAME_MARKER = "_members." + + +def _classify_abstain(reasoning: str) -> str: + if reasoning.startswith(ABSTAIN_REASON_PROVIDER_EXHAUSTED): + return "provider_exhausted" + if reasoning.startswith(ABSTAIN_REASON_SCHEMA_PARSE): + return "schema_parse" + if reasoning.startswith(ABSTAIN_REASON_OUT_OF_RANGE): + return "out_of_range" + return "genuine" + + +def generate_kappa_table( + *, + predictions_glob: str, + labels_path: str, + output_path: str, + strict: bool = False, +) -> None: + """Aggregate predictions across rows + dimensions into one markdown table. + + On hash mismatch: ALWAYS raises (both modes), with first-item expected + /actual hashes plus full mismatched-id list. + On missing prediction or label: WARN+exclude in default mode; RAISE in strict. + On undefined κ: render '—' with a footnote (both modes). + On abstain rate > 20%: render κ + footnote with cause breakdown (both modes). + """ + labels: list[dict] = [] + for line in Path(labels_path).read_text().splitlines(): + line = line.strip() + if not line: + continue + labels.append(json.loads(line)) + + label_by_key: dict[tuple[str, str], dict] = { + (label_rec["item_id"], label_rec["dimension"]): label_rec + for label_rec in labels + } + + pred_files = sorted(_glob.glob(predictions_glob)) + if not pred_files: + raise ValueError(f"No prediction files matched: {predictions_glob}") + + rows: list[dict] = [] + for pf in pred_files: + # Skip sidecars (per-member detail, not aggregate predictions). + # Match the basename marker, not a specific extension, so a future + # jsonl → json migration of jury._DEFAULT_SIDECAR_TEMPLATE doesn't + # silently start contaminating the κ table. + if _SIDECAR_BASENAME_MARKER in Path(pf).name: + continue + row_label = ( + Path(pf).stem.replace("calibration_v1_judge_", "") + ) + preds = json.loads(Path(pf).read_text()) + + # Hash-mismatch detection (always raises) + mismatches: list[tuple[str, str, str]] = [] + for p in preds: + key = (p["item_id"], p["dimension"]) + if key in label_by_key: + expected = label_by_key[key]["system_output_hash"] + actual = p["system_output_hash"] + if expected != actual: + mismatches.append((p["item_id"], expected, actual)) + if mismatches: + first_id, first_exp, first_act = mismatches[0] + raise ValueError( + f"Hash mismatch in {pf}: item {first_id!r} " + f"label.system_output_hash={first_exp!r} but " + f"prediction.system_output_hash={first_act!r}. " + f"Full mismatched-id list ({len(mismatches)}): " + f"{[m[0] for m in mismatches]}. " + f"Labels are stale relative to predictions — regenerate one or " + f"the other so hashes align." + ) + + preds_by_dim: dict[str, list[dict]] = defaultdict(list) + for p in preds: + preds_by_dim[p["dimension"]].append(p) + + labels_by_dim: dict[str, list[dict]] = defaultdict(list) + for label_rec in labels: + labels_by_dim[label_rec["dimension"]].append(label_rec) + + for dim in sorted(preds_by_dim.keys()): + # Resolve dimension's headline metric once per dim, instead of + # tuple-unpacking _DIM_METRIC.get(...) at each use site below. + # The repeated unpack pattern triggered a mypy 1.19+ INTERNAL + # ERROR; one resolution call here is also less code. + metric_name, metric_fn = _DIM_METRIC.get( + dim, ("κ", cohen_kappa) + ) + + preds_d = {p["item_id"]: p for p in preds_by_dim[dim]} + labs_d = { + label_rec["item_id"]: label_rec + for label_rec in labels_by_dim.get(dim, []) + } + + common = sorted(set(preds_d) & set(labs_d)) + missing_pred = sorted(set(labs_d) - set(preds_d)) + missing_lab = sorted(set(preds_d) - set(labs_d)) + if missing_pred or missing_lab: + msg = ( + f"row={row_label} dim={dim} " + f"missing_predictions={missing_pred} " + f"missing_labels={missing_lab}" + ) + if strict: + raise ValueError(f"strict mode: missing items: {msg}") + logger.warning("calibration_report_missing", message=msg) + + y_pred: list = [] + y_lab: list = [] + abstains = 0 + abstain_causes: dict[str, int] = { + "provider_exhausted": 0, + "schema_parse": 0, + "out_of_range": 0, + "genuine": 0, + } + for iid in common: + p = preds_d[iid] + label_rec = labs_d[iid] + if p["score"] == "Unknown" or label_rec["score"] == "Unknown": + abstains += 1 + if p["score"] == "Unknown": + abstain_causes[ + _classify_abstain(p.get("reasoning", "")) + ] += 1 + continue + y_pred.append(int(p["score"])) + y_lab.append(int(label_rec["score"])) + + n_eligible = len(y_pred) + abstain_rate = abstains / max(len(common), 1) + + if n_eligible < 3: + rows.append( + { + "row": row_label, + "dim": dim, + "metric": metric_name, + "kappa": None, + "ci_lo": None, + "ci_hi": None, + "n_eligible": n_eligible, + "abstains": abstains, + "abstain_rate": abstain_rate, + "abstain_causes": abstain_causes, + "footnote": ( + f"{metric_name} undefined: insufficient " + f"agreement-eligible items (N={n_eligible})" + ), + } + ) + continue + + try: + kappa = metric_fn(y_lab, y_pred) + point, lo, hi = bootstrap_ci( + y_lab, y_pred, metric_fn, n_iter=1000, seed=42 + ) + except (ValueError, ZeroDivisionError): + rows.append( + { + "row": row_label, + "dim": dim, + "metric": metric_name, + "kappa": None, + "ci_lo": None, + "ci_hi": None, + "n_eligible": n_eligible, + "abstains": abstains, + "abstain_rate": abstain_rate, + "abstain_causes": abstain_causes, + "footnote": ( + f"{metric_name} undefined: insufficient " + f"variance after exclusion" + ), + } + ) + continue + + # Detect degenerate κ (perfectly constant labels → P_e=1 → kappa + # was clamped to 1.0 in metrics.py, but with no observed + # disagreement the result is statistically meaningless) + if len(set(y_lab)) <= 1 and len(set(y_pred)) <= 1: + rows.append( + { + "row": row_label, + "dim": dim, + "metric": metric_name, + "kappa": None, + "ci_lo": None, + "ci_hi": None, + "n_eligible": n_eligible, + "abstains": abstains, + "abstain_rate": abstain_rate, + "abstain_causes": abstain_causes, + "footnote": ( + f"{metric_name} undefined: all labels and " + f"predictions in a single category (no variance " + f"to measure)" + ), + } + ) + continue + + footnote = "" + if abstain_rate > ABSTAIN_THRESHOLD: + breakdown = ", ".join( + f"{int(100 * v / abstains)}% {k.replace('_', ' ')}" + for k, v in abstain_causes.items() + if v > 0 + ) + footnote = ( + f"{metric_name} computed on N={n_eligible} of " + f"{len(common)} items; high abstain rate " + f"({100 * abstain_rate:.1f}% — breakdown: {breakdown}) " + f"suggests rubric ambiguity." + ) + + rows.append( + { + "row": row_label, + "dim": dim, + "metric": metric_name, + "kappa": kappa, + "ci_lo": lo, + "ci_hi": hi, + "n_eligible": n_eligible, + "abstains": abstains, + "abstain_rate": abstain_rate, + "abstain_causes": abstain_causes, + "footnote": footnote, + } + ) + + out = ["# κ ablation table — calibration v1\n"] + out.append( + "Headline metric per dimension: " + ", ".join( + f"**{d} → {m}**" for d, (m, _) in _DIM_METRIC.items() + ) + ". " + "AC1 (Gwet 2008, unweighted) is used on dimensions whose v1.1 gold " + "is prevalence-skewed enough to make Cohen's κ degenerate " + "(groundedness 1×`1`/29×`0`, relevance 29×`2`/1×`1`); both metrics " + "produce ≥0.95 raw agreement on those rows but Cohen's κ collapses " + "to ≈0 because Pe approaches 1. Completeness uses Cohen's κ — its " + "gold (23×`2`/5×`1`) is balanced enough for κ to behave normally." + ) + out.append("") + out.append("| Row | Dimension | Metric | Agreement (95% CI) | N | Abstain rate | Notes |") + out.append("|---|---|---|---|---|---|---|") + for r in rows: + if r["kappa"] is None: + kcell = " — " + else: + kcell = f"{r['kappa']:.3f} ({r['ci_lo']:.3f}, {r['ci_hi']:.3f})" + rate = f"{100 * r['abstain_rate']:.1f}%" + out.append( + f"| {r['row']} | {r['dim']} | {r['metric']} | {kcell} | " + f"{r['n_eligible']} | {rate} | {r['footnote']} |" + ) + + Path(output_path).parent.mkdir(parents=True, exist_ok=True) + Path(output_path).write_text("\n".join(out) + "\n") + logger.info("kappa_table_written", path=output_path, rows=len(rows)) diff --git a/agent_bench/evaluation/datasets/calibration_v1.json b/agent_bench/evaluation/datasets/calibration_v1.json new file mode 100644 index 0000000000000000000000000000000000000000..e66281f2c3051bee57cfeca734ed4cca2a69f795 --- /dev/null +++ b/agent_bench/evaluation/datasets/calibration_v1.json @@ -0,0 +1,158 @@ +{ + "version": "v1", + "system_config_git_sha": "3a2ed359eb16437cf95987b1fca47281a37fb74c", + "sample_seed": 20260504, + "notes": "30-item stratified calibration set per the design doc. Spare slots filled from K8s simple_w_condition and multi_hop (typically highest-variance R@5 strata).", + "items": [ + { + "id": "q021", + "corpus": "fastapi", + "stratum": "calculation" + }, + { + "id": "q010", + "corpus": "fastapi", + "stratum": "out_of_scope" + }, + { + "id": "q027", + "corpus": "fastapi", + "stratum": "out_of_scope" + }, + { + "id": "q006", + "corpus": "fastapi", + "stratum": "retrieval" + }, + { + "id": "q011", + "corpus": "fastapi", + "stratum": "retrieval" + }, + { + "id": "q012", + "corpus": "fastapi", + "stratum": "retrieval" + }, + { + "id": "q023", + "corpus": "fastapi", + "stratum": "retrieval" + }, + { + "id": "q025", + "corpus": "fastapi", + "stratum": "retrieval" + }, + { + "id": "k8s_002", + "corpus": "k8s", + "stratum": "comparison" + }, + { + "id": "k8s_014", + "corpus": "k8s", + "stratum": "comparison" + }, + { + "id": "k8s_016", + "corpus": "k8s", + "stratum": "comparison" + }, + { + "id": "k8s_004", + "corpus": "k8s", + "stratum": "false_premise" + }, + { + "id": "k8s_022", + "corpus": "k8s", + "stratum": "false_premise" + }, + { + "id": "k8s_024", + "corpus": "k8s", + "stratum": "false_premise" + }, + { + "id": "k8s_003", + "corpus": "k8s", + "stratum": "multi_hop" + }, + { + "id": "k8s_017", + "corpus": "k8s", + "stratum": "multi_hop" + }, + { + "id": "k8s_018", + "corpus": "k8s", + "stratum": "multi_hop" + }, + { + "id": "k8s_019", + "corpus": "k8s", + "stratum": "multi_hop" + }, + { + "id": "k8s_025", + "corpus": "k8s", + "stratum": "set" + }, + { + "id": "k8s_001", + "corpus": "k8s", + "stratum": "simple" + }, + { + "id": "k8s_006", + "corpus": "k8s", + "stratum": "simple" + }, + { + "id": "k8s_007", + "corpus": "k8s", + "stratum": "simple" + }, + { + "id": "k8s_009", + "corpus": "k8s", + "stratum": "simple" + }, + { + "id": "k8s_005", + "corpus": "k8s", + "stratum": "simple_w_condition" + }, + { + "id": "k8s_012", + "corpus": "k8s", + "stratum": "simple_w_condition" + }, + { + "id": "k8s_013", + "corpus": "k8s", + "stratum": "simple_w_condition" + }, + { + "id": "k8s_015", + "corpus": "k8s", + "stratum": "spare_comparison" + }, + { + "id": "k8s_023", + "corpus": "k8s", + "stratum": "spare_false_premise" + }, + { + "id": "k8s_020", + "corpus": "k8s", + "stratum": "spare_multi_hop" + }, + { + "id": "k8s_011", + "corpus": "k8s", + "stratum": "spare_simple_w_condition" + } + ] +} diff --git a/agent_bench/evaluation/datasets/tech_docs_golden.json b/agent_bench/evaluation/datasets/tech_docs_golden.json index 617db6517c79eadc8ba08c84036ba30ed3b4e8ae..4e62aaf60d575962b072501c96a00bc425dc791e 100644 --- a/agent_bench/evaluation/datasets/tech_docs_golden.json +++ b/agent_bench/evaluation/datasets/tech_docs_golden.json @@ -2,8 +2,15 @@ { "id": "q001", "question": "How do you define a path parameter in FastAPI?", - "expected_answer_keywords": ["curly braces", "path", "function parameter", "URL"], - "expected_sources": ["fastapi_path_params.md"], + "expected_answer_keywords": [ + "curly braces", + "path", + "function parameter", + "URL" + ], + "expected_sources": [ + "fastapi_path_params.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, @@ -12,8 +19,15 @@ { "id": "q002", "question": "What is the default page size for pagination in FastAPI and what is the maximum allowed?", - "expected_answer_keywords": ["20", "100", "default", "maximum"], - "expected_sources": ["fastapi_pagination.md"], + "expected_answer_keywords": [ + "20", + "100", + "default", + "maximum" + ], + "expected_sources": [ + "fastapi_pagination.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, @@ -22,8 +36,15 @@ { "id": "q003", "question": "How does FastAPI handle CORS and what is the default max_age for preflight caching?", - "expected_answer_keywords": ["CORSMiddleware", "600", "seconds", "preflight"], - "expected_sources": ["fastapi_middleware.md"], + "expected_answer_keywords": [ + "CORSMiddleware", + "600", + "seconds", + "preflight" + ], + "expected_sources": [ + "fastapi_middleware.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, @@ -32,8 +53,14 @@ { "id": "q004", "question": "What algorithm and expiry time does the FastAPI security example use for JWT tokens?", - "expected_answer_keywords": ["HS256", "30", "minutes"], - "expected_sources": ["fastapi_security.md"], + "expected_answer_keywords": [ + "HS256", + "30", + "minutes" + ], + "expected_sources": [ + "fastapi_security.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -42,8 +69,15 @@ { "id": "q005", "question": "What is the recommended formula for calculating the number of Gunicorn workers for a FastAPI deployment?", - "expected_answer_keywords": ["2", "CPU", "cores", "1"], - "expected_sources": ["fastapi_deployment.md"], + "expected_answer_keywords": [ + "2", + "CPU", + "cores", + "1" + ], + "expected_sources": [ + "fastapi_deployment.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -52,18 +86,35 @@ { "id": "q006", "question": "How does dependency caching work in FastAPI, and how can you disable it?", - "expected_answer_keywords": ["cache", "once", "use_cache", "False"], - "expected_sources": ["fastapi_dependencies.md"], + "expected_answer_keywords": [ + "cache", + "once", + "use_cache", + "False" + ], + "expected_sources": [ + "fastapi_dependencies.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, - "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call." + "reference_answer": "FastAPI caches dependency results so each dependency is called only once per request. Caching can be disabled by setting use_cache=False in the Depends() call.", + "source_snippets": [ + "By default, if the same dependency is used multiple times within a single request (e.g., both a route and a sub-dependency use `Depends(get_db)`), FastAPI caches the result and calls the dependency only once. To disable caching and force a fresh call each time, use `Depends(get_db, use_cache=False)`." + ] }, { "id": "q007", "question": "If a paginated endpoint returns 20 items per page and there are 10,000 items total, how many total pages are there? And if the page size is changed to 30, how many pages would there be?", - "expected_answer_keywords": ["500", "334", "ceil", "pages"], - "expected_sources": ["fastapi_pagination.md"], + "expected_answer_keywords": [ + "500", + "334", + "ceil", + "pages" + ], + "expected_sources": [ + "fastapi_pagination.md" + ], "category": "calculation", "difficulty": "medium", "requires_calculator": true, @@ -72,7 +123,11 @@ { "id": "q008", "question": "Does FastAPI support automatic Kubernetes deployment?", - "expected_answer_keywords": ["not", "does not contain", "no information"], + "expected_answer_keywords": [ + "not", + "does not contain", + "no information" + ], "expected_sources": [], "category": "out_of_scope", "difficulty": "easy", @@ -82,7 +137,11 @@ { "id": "q009", "question": "How does FastAPI integrate with Apache Kafka for event streaming?", - "expected_answer_keywords": ["not", "does not contain", "no information"], + "expected_answer_keywords": [ + "not", + "does not contain", + "no information" + ], "expected_sources": [], "category": "out_of_scope", "difficulty": "easy", @@ -92,38 +151,69 @@ { "id": "q010", "question": "Can FastAPI generate GraphQL schemas natively?", - "expected_answer_keywords": ["not", "does not contain", "no information"], + "expected_answer_keywords": [ + "not", + "does not contain", + "no information" + ], "expected_sources": [], "category": "out_of_scope", "difficulty": "easy", "requires_calculator": false, - "reference_answer": "" + "reference_answer": "", + "source_snippets": [] }, { "id": "q011", "question": "What is the default Swagger UI endpoint in FastAPI?", - "expected_answer_keywords": ["/docs", "Swagger", "interactive"], - "expected_sources": ["fastapi_openapi.md"], + "expected_answer_keywords": [ + "/docs", + "Swagger", + "interactive" + ], + "expected_sources": [ + "fastapi_openapi.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, - "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface." + "reference_answer": "The default Swagger UI endpoint in FastAPI is /docs, which provides an interactive API documentation interface.", + "source_snippets": [ + "| `/docs` | Swagger UI -- interactive API explorer |", + "Every FastAPI application exposes three documentation-related endpoints by default:" + ] }, { "id": "q012", "question": "How do you raise an HTTP error in a FastAPI route handler?", - "expected_answer_keywords": ["HTTPException", "status_code", "detail"], - "expected_sources": ["fastapi_error_handling.md"], + "expected_answer_keywords": [ + "HTTPException", + "status_code", + "detail" + ], + "expected_sources": [ + "fastapi_error_handling.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, - "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error." + "reference_answer": "You raise an HTTP error in FastAPI by raising an HTTPException with a status_code and a detail message describing the error.", + "source_snippets": [ + "The `HTTPException` class is the primary way to return error responses from route handlers:", + "When raised, `HTTPException` immediately terminates request processing and returns the specified status code and detail message. The `detail` parameter can be a string, list, or dictionary -- FastAPI serializes it to JSON automatically." + ] }, { "id": "q013", "question": "How do you define a request body in FastAPI?", - "expected_answer_keywords": ["Pydantic", "BaseModel", "JSON"], - "expected_sources": ["fastapi_request_body.md"], + "expected_answer_keywords": [ + "Pydantic", + "BaseModel", + "JSON" + ], + "expected_sources": [ + "fastapi_request_body.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, @@ -132,8 +222,14 @@ { "id": "q014", "question": "What testing tools does FastAPI use, and what class provides the test client?", - "expected_answer_keywords": ["TestClient", "pytest", "Starlette"], - "expected_sources": ["fastapi_testing.md"], + "expected_answer_keywords": [ + "TestClient", + "pytest", + "Starlette" + ], + "expected_sources": [ + "fastapi_testing.md" + ], "category": "retrieval", "difficulty": "easy", "requires_calculator": false, @@ -142,8 +238,15 @@ { "id": "q015", "question": "How does FastAPI manage application configuration and environment variables?", - "expected_answer_keywords": ["BaseSettings", "pydantic", "env", "environment"], - "expected_sources": ["fastapi_configuration.md"], + "expected_answer_keywords": [ + "BaseSettings", + "pydantic", + "env", + "environment" + ], + "expected_sources": [ + "fastapi_configuration.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -152,8 +255,15 @@ { "id": "q016", "question": "What is the minimum response size for GZip compression middleware in FastAPI, and how do you enable it?", - "expected_answer_keywords": ["500", "bytes", "GZipMiddleware", "minimum_size"], - "expected_sources": ["fastapi_middleware.md"], + "expected_answer_keywords": [ + "500", + "bytes", + "GZipMiddleware", + "minimum_size" + ], + "expected_sources": [ + "fastapi_middleware.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -162,8 +272,15 @@ { "id": "q017", "question": "How do yield dependencies work in FastAPI and what is the maximum number supported per request?", - "expected_answer_keywords": ["yield", "cleanup", "finally", "32"], - "expected_sources": ["fastapi_dependencies.md"], + "expected_answer_keywords": [ + "yield", + "cleanup", + "finally", + "32" + ], + "expected_sources": [ + "fastapi_dependencies.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -172,8 +289,15 @@ { "id": "q018", "question": "What are the three documentation endpoints FastAPI exposes by default and what OpenAPI version does it use?", - "expected_answer_keywords": ["/docs", "/redoc", "/openapi.json", "3.1"], - "expected_sources": ["fastapi_openapi.md"], + "expected_answer_keywords": [ + "/docs", + "/redoc", + "/openapi.json", + "3.1" + ], + "expected_sources": [ + "fastapi_openapi.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -182,8 +306,15 @@ { "id": "q019", "question": "How does FastAPI handle WebSocket connections, and what must be called before sending data?", - "expected_answer_keywords": ["accept", "WebSocket", "send", "receive"], - "expected_sources": ["fastapi_websockets.md"], + "expected_answer_keywords": [ + "accept", + "WebSocket", + "send", + "receive" + ], + "expected_sources": [ + "fastapi_websockets.md" + ], "category": "retrieval", "difficulty": "medium", "requires_calculator": false, @@ -192,8 +323,16 @@ { "id": "q020", "question": "For a server with 4 CPU cores, how many Gunicorn workers should be configured using the recommended formula?", - "expected_answer_keywords": ["9", "workers", "2", "CPU", "1"], - "expected_sources": ["fastapi_deployment.md"], + "expected_answer_keywords": [ + "9", + "workers", + "2", + "CPU", + "1" + ], + "expected_sources": [ + "fastapi_deployment.md" + ], "category": "calculation", "difficulty": "medium", "requires_calculator": true, @@ -202,18 +341,35 @@ { "id": "q021", "question": "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?", - "expected_answer_keywords": ["10", "minutes"], - "expected_sources": ["fastapi_middleware.md"], + "expected_answer_keywords": [ + "10", + "minutes" + ], + "expected_sources": [ + "fastapi_middleware.md" + ], "category": "calculation", "difficulty": "easy", "requires_calculator": true, - "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10)." + "reference_answer": "With a CORS max_age of 600 seconds, the browser caches preflight results for 10 minutes (600 / 60 = 10).", + "source_snippets": [ + "| `max_age` | `600` | Seconds the browser caches preflight results |" + ] }, { "id": "q022", "question": "How do route ordering and dependency injection interact when building a secure FastAPI application with scoped endpoints?", - "expected_answer_keywords": ["order", "Depends", "Security", "scopes"], - "expected_sources": ["fastapi_path_params.md", "fastapi_dependencies.md", "fastapi_security.md"], + "expected_answer_keywords": [ + "order", + "Depends", + "Security", + "scopes" + ], + "expected_sources": [ + "fastapi_path_params.md", + "fastapi_dependencies.md", + "fastapi_security.md" + ], "category": "retrieval", "difficulty": "hard", "requires_calculator": false, @@ -222,18 +378,40 @@ { "id": "q023", "question": "How would you set up a FastAPI application with custom error handling, CORS middleware, and structured testing including dependency overrides?", - "expected_answer_keywords": ["HTTPException", "CORSMiddleware", "TestClient", "override"], - "expected_sources": ["fastapi_error_handling.md", "fastapi_middleware.md", "fastapi_testing.md"], + "expected_answer_keywords": [ + "HTTPException", + "CORSMiddleware", + "TestClient", + "override" + ], + "expected_sources": [ + "fastapi_error_handling.md", + "fastapi_middleware.md", + "fastapi_testing.md" + ], "category": "retrieval", "difficulty": "hard", "requires_calculator": false, - "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests." + "reference_answer": "Custom error handling is set up by raising HTTPException or registering exception handlers, CORS is configured by adding CORSMiddleware with allowed origins, and testing uses TestClient with app.dependency_overrides to replace dependencies during tests.", + "source_snippets": [ + "The `HTTPException` class is the primary way to return error responses from route handlers:", + "Cross-Origin Resource Sharing (CORS) is configured using `CORSMiddleware` from Starlette:" + ] }, { "id": "q024", "question": "Explain how to deploy a FastAPI app with Docker using Gunicorn workers, health checks, and environment-based configuration via Pydantic Settings.", - "expected_answer_keywords": ["Docker", "Gunicorn", "health", "BaseSettings", "env"], - "expected_sources": ["fastapi_deployment.md", "fastapi_configuration.md"], + "expected_answer_keywords": [ + "Docker", + "Gunicorn", + "health", + "BaseSettings", + "env" + ], + "expected_sources": [ + "fastapi_deployment.md", + "fastapi_configuration.md" + ], "category": "retrieval", "difficulty": "hard", "requires_calculator": false, @@ -242,17 +420,32 @@ { "id": "q025", "question": "How would you build a paginated API with cursor-based navigation, response model validation, and background task processing for analytics logging?", - "expected_answer_keywords": ["cursor", "response_model", "BackgroundTasks"], - "expected_sources": ["fastapi_pagination.md", "fastapi_response_model.md", "fastapi_background_tasks.md"], + "expected_answer_keywords": [ + "cursor", + "response_model", + "BackgroundTasks" + ], + "expected_sources": [ + "fastapi_pagination.md", + "fastapi_response_model.md", + "fastapi_background_tasks.md" + ], "category": "retrieval", "difficulty": "hard", "requires_calculator": false, - "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency." + "reference_answer": "Cursor-based pagination uses an opaque cursor token for navigation instead of page numbers. Response models are validated using the response_model parameter on route decorators, and analytics logging is handled asynchronously via FastAPI's BackgroundTasks dependency.", + "source_snippets": [ + "Cursor-based pagination uses an opaque token (cursor) pointing to the last item in the previous page. This avoids the performance degradation of large offsets:" + ] }, { "id": "q026", "question": "Does FastAPI have built-in support for database migrations like Alembic?", - "expected_answer_keywords": ["not", "does not contain", "no information"], + "expected_answer_keywords": [ + "not", + "does not contain", + "no information" + ], "expected_sources": [], "category": "out_of_scope", "difficulty": "easy", @@ -262,11 +455,16 @@ { "id": "q027", "question": "How does FastAPI handle automatic load balancing across multiple servers?", - "expected_answer_keywords": ["not", "does not contain", "no information"], + "expected_answer_keywords": [ + "not", + "does not contain", + "no information" + ], "expected_sources": [], "category": "out_of_scope", "difficulty": "easy", "requires_calculator": false, - "reference_answer": "" + "reference_answer": "", + "source_snippets": [] } ] diff --git a/agent_bench/evaluation/harness.py b/agent_bench/evaluation/harness.py index 23d1e0c6dfca440f20094ebd53adf428883152a2..a49812339ee96ab73813fcd186e68441f9b5aa4e 100644 --- a/agent_bench/evaluation/harness.py +++ b/agent_bench/evaluation/harness.py @@ -8,8 +8,13 @@ from pathlib import Path from pydantic import BaseModel, Field from agent_bench.agents.orchestrator import Orchestrator +from agent_bench.core.config import load_config from agent_bench.core.provider import LLMProvider from agent_bench.core.types import TokenUsage +from agent_bench.evaluation.judges.base import Rubric, ScoreResult +from agent_bench.evaluation.judges.completeness import CompletenessJudge +from agent_bench.evaluation.judges.groundedness import GroundednessJudge +from agent_bench.evaluation.judges.relevance import RelevanceJudge from agent_bench.evaluation.metrics import ( calculator_used_when_expected, citation_accuracy, @@ -21,6 +26,18 @@ from agent_bench.evaluation.metrics import ( tool_call_count, ) +# Annotated as type[Judge] would lose concrete-class info and trigger +# mypy's "cannot instantiate abstract class" on the dispatch site below. +# The dict's runtime values are concrete, instantiable subclasses; the +# explicit type alias below preserves that information. +_JUDGE_CLASS_BY_DIMENSION: dict[ + str, type[GroundednessJudge] | type[RelevanceJudge] | type[CompletenessJudge] +] = { + "groundedness": GroundednessJudge, + "relevance": RelevanceJudge, + "completeness": CompletenessJudge, +} + class GoldenQuestion(BaseModel): id: str @@ -70,9 +87,13 @@ class EvalResult(BaseModel): # Raw answer for reporting answer: str = "" retrieved_sources: list[str] = [] - # LLM judge (None if not run) - faithfulness: float | None = None - correctness: float | None = None + # New in judge-layer v1: per-dimension judge scores. Empty when no + # judge_provider is configured. With a provider, OOS items receive + # relevance only (refusal-vs-engagement is the L2 signal worth + # measuring); reference-based dimensions (groundedness, completeness) + # are skipped on OOS. Completeness is also skipped when + # reference_answer is empty regardless of category. + judge_scores: dict[str, ScoreResult] = Field(default_factory=dict) def load_golden_dataset(path: str | Path) -> list[GoldenQuestion]: @@ -149,21 +170,46 @@ async def run_evaluation( retrieved_sources=ranked_sources, ) - # Optional LLM judge - if judge_provider is not None and q.category != "out_of_scope": - from agent_bench.evaluation.metrics import answer_correctness, answer_faithfulness - - result.faithfulness = await answer_faithfulness( - answer=agent_response.answer, - source_chunks=agent_response.source_chunks, - judge_provider=judge_provider, - ) - if q.reference_answer: - result.correctness = await answer_correctness( - answer=agent_response.answer, - reference_answer=q.reference_answer, + # Optional L2 LLM-judge layer (per-dimension; gated per-dim). + # + # OOS items get relevance scoring (a non-refusal answer to an OOS + # question is exactly what relevance is designed to detect — the + # rubric's "refusal that ignores the question" example covers this + # case). Groundedness and completeness are skipped on OOS because + # neither has a meaningful reference (no source_snippets, no + # reference_answer for OOS items). + # + # This per-dimension gating matches the calibration runner's + # behavior so the κ table's distribution of scored items lines up + # with what the production harness produces. Diverging gates would + # mean the calibration κ for relevance was estimated on items the + # production harness never sees, breaking the supersession's + # empirical backing. + if judge_provider is not None: + cfg = load_config() + rubric_dir = Path(__file__).resolve().parent / "rubrics" + is_oos = q.category == "out_of_scope" + for dim in cfg.evaluation.judge_dimensions: + if dim not in _JUDGE_CLASS_BY_DIMENSION: + continue # citation_faithfulness opt-in; not in default loop + # Per-dimension OOS gating: skip reference-based dimensions + # (groundedness, completeness) on OOS; allow relevance. + if is_oos and dim != "relevance": + continue + # CompletenessJudge is reference-based on q.reference_answer; + # scoring an empty reference is guaranteed-noisy and burns + # tokens. Pre-supersession code had the same gate (correctness + # was conditional on reference_answer being non-empty). + if dim == "completeness" and not q.reference_answer: + continue + rubric = Rubric.from_markdown_file(rubric_dir / f"{dim}.md") + judge = _JUDGE_CLASS_BY_DIMENSION[dim]( judge_provider=judge_provider, + rubric=rubric, + model_id=getattr(judge_provider, "model", "unknown"), ) + score_result = await judge.score(q, agent_response) + result.judge_scores[dim] = score_result results.append(result) diff --git a/agent_bench/evaluation/judges/__init__.py b/agent_bench/evaluation/judges/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..8e060f3f56b18dce1f08943c6414c0f882e626a1 --- /dev/null +++ b/agent_bench/evaluation/judges/__init__.py @@ -0,0 +1,25 @@ +"""Discrete-scale per-dimension LLM judges with anchored rubrics.""" + +from agent_bench.evaluation.judges.base import ( + ABSTAIN_REASON_GENUINE, + ABSTAIN_REASON_OUT_OF_RANGE, + ABSTAIN_REASON_PROVIDER_EXHAUSTED, + ABSTAIN_REASON_SCHEMA_PARSE, + Judge, + MockJudge, + Rubric, + RubricLevel, + ScoreResult, +) + +__all__ = [ + "ABSTAIN_REASON_GENUINE", + "ABSTAIN_REASON_OUT_OF_RANGE", + "ABSTAIN_REASON_PROVIDER_EXHAUSTED", + "ABSTAIN_REASON_SCHEMA_PARSE", + "Judge", + "MockJudge", + "Rubric", + "RubricLevel", + "ScoreResult", +] diff --git a/agent_bench/evaluation/judges/base.py b/agent_bench/evaluation/judges/base.py new file mode 100644 index 0000000000000000000000000000000000000000..8733f1078ab0136f1adb64f5feaf2abce1cfc3a9 --- /dev/null +++ b/agent_bench/evaluation/judges/base.py @@ -0,0 +1,628 @@ +"""Judge ABC, ScoreResult, Rubric, MockJudge, abstain-reason constants. + +The Judge layer supersedes the continuous-scale answer_faithfulness / +answer_correctness functions in agent_bench/evaluation/metrics.py. See +docs/plans/2026-05-04-judge-layer-v1-design.md for the supersession +rationale and the six-axis comparison table. +""" + +from __future__ import annotations + +import hashlib +import json as _json +import random +import re +import time +from abc import ABC, abstractmethod +from pathlib import Path +from typing import TYPE_CHECKING, Literal, Self + +import structlog +import yaml +from pydantic import BaseModel, Field + +from agent_bench.core.provider import ( + ProviderRateLimitError, + ProviderTimeoutError, +) +from agent_bench.core.types import Message, Role + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.core.provider import LLMProvider + from agent_bench.evaluation.harness import GoldenQuestion + +logger = structlog.get_logger() + +# --- Abstain-reason constants --- +# +# Failure-as-abstain ScoreResults carry a reasoning string with one of +# these prefixes. The calibration report pattern-matches against these +# constants for the four-way breakdown in the >20% abstain-rate flag. +# Genuine model abstain (rubric-allowed) uses the empty-string sentinel. + +ABSTAIN_REASON_PROVIDER_EXHAUSTED = "judge_call_failed_after_retry: " +ABSTAIN_REASON_SCHEMA_PARSE = "schema_parse_failed_after_retry: " +ABSTAIN_REASON_OUT_OF_RANGE = "score_out_of_range_after_retry: " +ABSTAIN_REASON_GENUINE = "" + + +class ScoreResult(BaseModel): + """One judge call's result. Self-contained provenance — no run + metadata cross-reference needed for κ aggregation. + + Field order matters: reasoning + evidence_quotes come BEFORE score + in both Pydantic field order and the JSON schema sent to the model, + so the score conditions on the reasoning rather than being + post-hoc rationalized. + """ + + # Reasoning-first ordering — load-bearing for the JSON schema + reasoning: str + evidence_quotes: list[str] = Field(default_factory=list) + score: int | Literal["Unknown"] + + # Provenance + judge_id: str + rubric_version: str + prompt_seed: int = 0 + system_output_hash: str + + # Operations + cost_usd: float + latency_ms: float + + @property + def abstained(self) -> bool: + return self.score == "Unknown" + + +_FENCE_PATTERN = re.compile(r"^```[^\n]*\n.*?^```\n?", re.MULTILINE | re.DOTALL) + + +def _mask_code_fences(text: str) -> str: + """Replace fenced code blocks (``` ... ```) with same-length whitespace, + preserving newlines so byte offsets align with the original. Used by + the rubric loader to skip fenced ``## Score N`` literals when scanning + for structural level headers. + """ + + def _replace(match: re.Match[str]) -> str: + return "".join("\n" if c == "\n" else " " for c in match.group(0)) + + return _FENCE_PATTERN.sub(_replace, text) + + +class RubricLevel(BaseModel): + """One score level in a rubric, with anchored examples. + + Parsed from markdown sections under `## Score N` headers. The + `examples` list contains the H3 sub-sections (`### Example X`) + each with a thinking-trace explanation of why that output got + that score. + """ + + score: int + description: str + examples: list[str] # raw markdown of `### Example` sections + + +class Rubric(BaseModel): + """A scoring rubric loaded from a markdown file with YAML frontmatter. + + Construction validates aggressively: scale ∈ {binary, three_point}, + levels arity matches scale, every level has at least one anchored + example. ValidationError raises with file path + field path so a + Day-1 rubric typo doesn't surface as a Day-2 judge.score crash with + API budget already spent. + """ + + dimension: Literal[ + "groundedness", "relevance", "completeness", "citation_faithfulness" + ] + scale: Literal["binary", "three_point"] + reference_based: bool + abstain_allowed: bool + levels: list[RubricLevel] + body_markdown: str + + @property + def source_hash(self) -> str: + """SHA-256 of the canonical body. Immutable per file content, + independent of git state. Used as ScoreResult.rubric_version. + """ + return hashlib.sha256(self.body_markdown.encode("utf-8")).hexdigest() + + @classmethod + def from_markdown_file(cls, path: Path | str) -> Self: + path = Path(path) + body = path.read_text(encoding="utf-8") + + # Parse YAML frontmatter delimited by --- ... --- + fm_match = re.match(r"^---\n(.+?)\n---\n(.*)$", body, re.DOTALL) + if not fm_match: + raise ValueError( + f"Rubric {path.name}: missing YAML frontmatter " + f"(expected --- ... --- block at top of file)" + ) + try: + frontmatter = yaml.safe_load(fm_match.group(1)) or {} + except yaml.YAMLError as e: + raise ValueError( + f"Rubric {path.name}: frontmatter YAML parse error: {e}" + ) from e + + required = {"dimension", "scale", "reference_based", "abstain_allowed"} + missing = required - frontmatter.keys() + if missing: + raise ValueError( + f"Rubric {path.name}: frontmatter missing fields: {sorted(missing)}" + ) + + scale = frontmatter["scale"] + if scale not in ("binary", "three_point"): + raise ValueError( + f"Rubric {path.name}: invalid scale {scale!r}; " + f"must be 'binary' or 'three_point'" + ) + + # Parse levels by ## Score N headers. Mask fenced code blocks first + # so a literal "## Score N" inside an example's code fence is not + # interpreted as a structural level header. The mask preserves byte + # offsets (replacing non-newline chars with spaces) so we can slice + # the original `body_no_fm` at the masked-text header positions to + # recover level bodies with their fenced content intact. + body_no_fm = fm_match.group(2) + masked_body = _mask_code_fences(body_no_fm) + header_pattern = re.compile(r"^## Score (\d+)\n", re.MULTILINE) + header_matches = list(header_pattern.finditer(masked_body)) + raw_levels: list[tuple[int, str]] = [] + for i, m in enumerate(header_matches): + start = m.end() + end = ( + header_matches[i + 1].start() + if i + 1 < len(header_matches) + else len(body_no_fm) + ) + raw_levels.append((int(m.group(1)), body_no_fm[start:end])) + + expected_arity = 2 if scale == "binary" else 3 + if len(raw_levels) != expected_arity: + raise ValueError( + f"Rubric {path.name}: arity mismatch — scale {scale!r} " + f"requires {expected_arity} levels, found {len(raw_levels)}" + ) + + # Parse examples (### Example) per level + levels: list[RubricLevel] = [] + for score, level_body in raw_levels: + example_pattern = re.compile( + r"^### (Example .+?)\n(.*?)(?=^### |\Z)", re.MULTILINE | re.DOTALL + ) + examples = [m.group(0) for m in example_pattern.finditer(level_body)] + if not examples: + raise ValueError( + f"Rubric {path.name}: level Score {score} has no " + f"anchored example (expected at least one ### Example header)" + ) + description = level_body.split("###", 1)[0].strip() + levels.append( + RubricLevel(score=score, description=description, examples=examples) + ) + + return cls( + dimension=frontmatter["dimension"], + scale=scale, + reference_based=bool(frontmatter["reference_based"]), + abstain_allowed=bool(frontmatter["abstain_allowed"]), + levels=levels, + body_markdown=body, + ) + + def render_prompt(self, *, level_permutation_seed: int = 0) -> str: + """Render the rubric body for inclusion in a judge prompt. + + If level_permutation_seed > 0, levels are reordered deterministically + using a seeded PRNG. seed=0 returns the canonical order. + """ + if level_permutation_seed == 0: + return self.body_markdown + rng = random.Random(level_permutation_seed) + permuted_levels = list(self.levels) + rng.shuffle(permuted_levels) + # Reconstruct: keep frontmatter + intro paragraphs intact; + # reorder the ## Score N sections. + fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL) + if not fm_match: + return self.body_markdown # defensive — should never happen post-construction + head = fm_match.group(1) + rest = fm_match.group(2) + intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0] + permuted_body = head + intro + "\n".join( + f"## Score {lvl.score}\n{lvl.description}\n" + "\n".join(lvl.examples) + for lvl in permuted_levels + ) + return permuted_body + + def strip_anchors(self) -> Self: + """Return a new Rubric with anchored examples removed from every + level (and a regenerated body_markdown that omits the ``### Example`` + sections). Used by the calibration runner's `use_anchors=false` + ablation row to measure the contribution of anchored examples. + + source_hash naturally diverges because body_markdown changes — so + ScoreResults from the stripped rubric carry a different + rubric_version, and the calibration report can bucket them + correctly without requiring a separate provenance field. + """ + fm_match = re.match(r"^(---\n.+?\n---\n)(.*)$", self.body_markdown, re.DOTALL) + head = fm_match.group(1) if fm_match else "" + rest = fm_match.group(2) if fm_match else self.body_markdown + intro = re.split(r"^## Score ", rest, maxsplit=1, flags=re.MULTILINE)[0] + # Render each level with its description but no examples. + stripped_body = head + intro + "\n".join( + f"## Score {lvl.score}\n{lvl.description}\n" for lvl in self.levels + ) + stripped_levels = [ + RubricLevel(score=lvl.score, description=lvl.description, examples=[]) + for lvl in self.levels + ] + return type(self)( + dimension=self.dimension, + scale=self.scale, + reference_based=self.reference_based, + abstain_allowed=self.abstain_allowed, + levels=stripped_levels, + body_markdown=stripped_body, + ) + + +class Judge(ABC): + """Per-dimension LLM judge. Concrete subclasses implement score() + for one rubric dimension; they are thin (~30 lines) and not + factored against a shared base method (see design doc for why). + + Three calibration knobs are accepted at construction so the + calibration runner can run baseline-vs-ablation rows from the same + code path without monkey-patching: + + - ``use_cot`` (default True) — when False, the JSON schema requested + from the model omits the ``reasoning`` and ``evidence_quotes`` + fields, ablating the chain-of-thought-before-score discipline. + - ``abstain_allowed_override`` (default None) — when set, overrides + the rubric's ``abstain_allowed`` flag for this judge's calls. Used + by the ``baseline_no_abstain`` ablation row. + - The ``use_anchors`` knob is implemented by passing a stripped + rubric (via ``Rubric.strip_anchors()``) at construction time, not + via a separate flag here — that way ScoreResult.rubric_version + naturally distinguishes anchored vs stripped variants. + """ + + def __init__( + self, + judge_provider: "LLMProvider", + rubric: Rubric, + model_id: str, + *, + use_cot: bool = True, + abstain_allowed_override: bool | None = None, + ) -> None: + self.judge_provider = judge_provider + self.rubric = rubric + self.model_id = model_id + self.use_cot = use_cot + self.abstain_allowed_override = abstain_allowed_override + # judge_id format: ``{model_id}_{dimension}`` — load-bearing for + # the calibration report's per-judge κ breakdown. Ablation knobs + # do NOT enter the judge_id; the row label + ScoreResult. + # rubric_version (which differs for stripped anchors) carry that + # signal. This keeps the per-judge bucketing stable across + # baseline + ablation rows for the same model. + self.judge_id = f"{model_id}_{rubric.dimension}" + + @property + def effective_abstain_allowed(self) -> bool: + """Whether abstain is permitted for this judge's calls; the + override (when set) takes precedence over the rubric's flag. + """ + if self.abstain_allowed_override is not None: + return self.abstain_allowed_override + return self.rubric.abstain_allowed + + def _json_schema_clause(self, valid_scores_str: str) -> str: + """Render the trailing JSON-schema instruction for the prompt. + + With ``use_cot=True`` (default) the schema asks for reasoning + and evidence_quotes before the score, so the model's response + conditions the score on the reasoning. With ``use_cot=False`` + only the score field is requested — used for the ``no_cot`` + ablation row. + """ + if self.use_cot: + return ( + f'JSON object: {{"reasoning": "...", ' + f'"evidence_quotes": [...], "score": {valid_scores_str}}}.' + ) + return f'JSON object: {{"score": {valid_scores_str}}}.' + + @abstractmethod + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + """Score one (item, output) pair against this judge's rubric. + + Returns a ScoreResult whose system_output_hash is computed from + (item.id, output.answer, sorted(output.sources)). Failures map + to abstain via the abstain-reason constants; provider non- + retryable errors raise (caller bug, not noise). + """ + ... + + +class MockJudge(Judge): + """Pre-baked-verdict judge for deterministic tests. No API calls. + + Constructor takes verdicts: dict[item_id, ScoreResult]. score() + raises LookupError on missing keys — never returns a default — + so test fixtures are self-checking. A separate fixture-validation + test (test_mockjudge_coverage.py) walks item.id across all goldens + and asserts every MockJudge instance has coverage for the items + its tests reference. + + Mirrors the MockProvider pattern at agent_bench/core/provider.py. + """ + + def __init__(self, verdicts: dict[str, ScoreResult]) -> None: + # MockJudge does not need provider/rubric/model_id; supply + # placeholder values so the ABC's __init__ doesn't matter. + self.judge_provider = None # type: ignore[assignment] + self.rubric = None # type: ignore[assignment] + self.model_id = "mock" + self.judge_id = "mock_judge" + self._verdicts = verdicts + + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + if item.id not in self._verdicts: + raise LookupError( + f"MockJudge has no pre-baked verdict for item_id {item.id!r}; " + f"available: {sorted(self._verdicts.keys())[:5]}" + + (" ..." if len(self._verdicts) > 5 else "") + ) + return self._verdicts[item.id] + + +# --- _call_judge_with_retry helper --- + +_STRICT_REPROMPT_SUFFIX = ( + "\n\nSTRICT FORMATTING NOTE: respond ONLY with a JSON object matching " + "the schema; reasoning first, then evidence_quotes, then score. " + "Do not wrap the JSON in a markdown code fence." +) + + +_MARKDOWN_FENCE_RE = re.compile(r"^\s*```(?:json|JSON)?\s*\n(.*?)\n```\s*$", re.DOTALL) + + +def _strip_markdown_fence(text: str) -> str: + """Strip a leading/trailing ```json ... ``` markdown fence if present. + + Some chat models wrap structured JSON in a markdown code fence even + when the prompt asks for a bare JSON object. The judge parser uses + json.loads on the raw content, which fails at char 0 on the literal + backtick. This helper unwraps the fence so the parse can proceed. + Idempotent: returns text unchanged if no fence is present. + """ + m = _MARKDOWN_FENCE_RE.match(text.strip()) + return m.group(1) if m else text + + +async def _call_judge_with_retry( + *, + provider: "LLMProvider", + prompt: str, + valid_scores: set[int], + judge_id: str, + rubric_version: str, + prompt_seed: int, + system_output_hash: str, + item_id: str, + abstain_allowed: bool = True, + max_tokens: int = 1024, +) -> ScoreResult: + """Send prompt to provider; one retry with strict reprompt on + schema-parse / score-out-of-range; abstain on persistent failure + or provider exhaustion. Re-raises unknown exceptions (caller bugs). + + max_tokens defaults to 1024 (was 512 pre-v1.1). The v1.1 groundedness + rubric ships with calibration anchors whose verbose thinking traces + elicit longer model reasoning in turn; 512 truncated the JSON + response mid-reasoning and caused 78/82 schema_parse_failed + abstains in the first run after the rubric clarification. 1024 leaves + enough headroom; bump again if a future rubric revision pushes + reasoning longer. + """ + accumulated_cost = 0.0 + accumulated_latency = 0.0 + + for attempt in range(2): # 2 = original + one retry + send_prompt = prompt if attempt == 0 else prompt + _STRICT_REPROMPT_SUFFIX + start = time.perf_counter() + try: + response = await provider.complete( + [Message(role=Role.USER, content=send_prompt)], + temperature=0.0, + max_tokens=max_tokens, + ) + except (ProviderRateLimitError, ProviderTimeoutError) as e: + return ScoreResult( + reasoning=f"{ABSTAIN_REASON_PROVIDER_EXHAUSTED}{type(e).__name__}: {e}", + evidence_quotes=[], + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency + (time.perf_counter() - start) * 1000, + ) + # Other exceptions (caller bugs like 401, 400) propagate. + accumulated_cost += response.usage.estimated_cost_usd + accumulated_latency += (time.perf_counter() - start) * 1000 + last_raw = response.content[:300] + + # Parse — reasoning and evidence_quotes are optional so judges + # configured with use_cot=False (which prompt for {"score": ...} + # only) don't fail parsing on the missing key. + # + # Some models (observed on Haiku 4.5 under the v1.1 rubric) wrap + # their JSON in a ```json ... ``` markdown fence. Strip the fence + # before parsing rather than abstaining on a syntactically valid + # but conventionally formatted response. + content = _strip_markdown_fence(response.content) + try: + data = _json.loads(content) + reasoning = str(data.get("reasoning", "")) + evidence_quotes = list(data.get("evidence_quotes", [])) + raw_score = data["score"] + except (_json.JSONDecodeError, KeyError, TypeError) as e: + cause = ABSTAIN_REASON_SCHEMA_PARSE + if attempt == 0: + logger.warning( + "judge_first_attempt_failure", + judge_id=judge_id, + item_id=item_id, + provider=type(provider).__name__, + failure_cause=cause, + attempt_index=1, + ) + continue + return ScoreResult( + reasoning=f"{cause}raw={last_raw!r} parse_error={e}", + evidence_quotes=[], + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + + # Score validation + if raw_score == "Unknown": + if not abstain_allowed: + cause = ABSTAIN_REASON_OUT_OF_RANGE + if attempt == 0: + logger.warning( + "judge_first_attempt_failure", + judge_id=judge_id, + item_id=item_id, + provider=type(provider).__name__, + failure_cause=cause, + attempt_index=1, + ) + continue + return ScoreResult( + reasoning=( + f"{cause}model returned 'Unknown' but rubric " + f"abstain_allowed=False" + ), + evidence_quotes=[], + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + # Genuine abstain — no prefix, no retry + return ScoreResult( + reasoning=reasoning, + evidence_quotes=evidence_quotes, + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + + try: + score_int = int(raw_score) + except (ValueError, TypeError): + cause = ABSTAIN_REASON_OUT_OF_RANGE + if attempt == 0: + logger.warning( + "judge_first_attempt_failure", + judge_id=judge_id, + item_id=item_id, + provider=type(provider).__name__, + failure_cause=cause, + attempt_index=1, + ) + continue + return ScoreResult( + reasoning=f"{cause}non-int score: {raw_score!r}", + evidence_quotes=[], + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + + if score_int not in valid_scores: + cause = ABSTAIN_REASON_OUT_OF_RANGE + if attempt == 0: + logger.warning( + "judge_first_attempt_failure", + judge_id=judge_id, + item_id=item_id, + provider=type(provider).__name__, + failure_cause=cause, + attempt_index=1, + ) + continue + return ScoreResult( + reasoning=( + f"{cause}model returned {score_int}, valid levels " + f"{sorted(valid_scores)}" + ), + evidence_quotes=[], + score="Unknown", + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + + # Success + return ScoreResult( + reasoning=reasoning, + evidence_quotes=evidence_quotes, + score=score_int, + judge_id=judge_id, + rubric_version=rubric_version, + prompt_seed=prompt_seed, + system_output_hash=system_output_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) + + raise RuntimeError("_call_judge_with_retry: unreachable code path") diff --git a/agent_bench/evaluation/judges/citation_faithfulness.py b/agent_bench/evaluation/judges/citation_faithfulness.py new file mode 100644 index 0000000000000000000000000000000000000000..19e79a86ef35a4c17df88c5d0e5a3f60c958770b --- /dev/null +++ b/agent_bench/evaluation/judges/citation_faithfulness.py @@ -0,0 +1,188 @@ +"""CitationFaithfulnessJudge — binary, per-(claim,citation) all-or-nothing.""" + +from __future__ import annotations + +import re +from typing import TYPE_CHECKING, Literal + +import structlog + +from agent_bench.evaluation.judges.base import ( + Judge, + ScoreResult, + _call_judge_with_retry, +) +from agent_bench.evaluation.judges.groundedness import _system_output_hash + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + +logger = structlog.get_logger() + +_CITATION_PATTERN = re.compile(r"\[source:\s*([^\]]+)\]") + + +def _extract_claims_with_citations(answer: str) -> list[tuple[str, str]]: + """Return list of (claim_text, cited_source) pairs. + + A "claim" is the sentence (including its terminating punctuation) + immediately preceding a [source:] citation. Prior citation tags + inside `before` are stripped so multi-citation answers yield clean + claim strings. + """ + pairs: list[tuple[str, str]] = [] + for match in _CITATION_PATTERN.finditer(answer): + cited = match.group(1).strip() + before = answer[: match.start()] + # Strip prior [source:...] tags so they don't pollute the claim + before_clean = _CITATION_PATTERN.sub("", before) + last_end = max( + before_clean.rfind("."), before_clean.rfind("!"), before_clean.rfind("?") + ) + if last_end >= 0: + prev_end = max( + before_clean.rfind(".", 0, last_end), + before_clean.rfind("!", 0, last_end), + before_clean.rfind("?", 0, last_end), + ) + claim = before_clean[prev_end + 1 : last_end + 1].strip() + else: + claim = before_clean.strip() + pairs.append((claim, cited)) + return pairs + + +class CitationFaithfulnessJudge(Judge): + """Aggregates per-(claim, citation) judgments into one item-level + binary ScoreResult. Per-pair detail is in evidence_quotes. + + All-or-nothing aggregation: any unfaithful citation → score 0. + The rubric documents the rule explicitly. + """ + + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + pairs = _extract_claims_with_citations(output.answer) + # Map cited source name to its retrieved chunk text via output.source_chunks + # (assumes index alignment with output.sources, matching harness + # convention). If the same source appears multiple times in the + # sources list with distinct chunks (legitimate when multiple + # retrievals match the same doc), `setdefault` keeps only the first + # — every "[source: X]" claim then evaluates against that one chunk, + # a false-failure risk. Warn so the operator notices. + source_names = [s.source for s in output.sources] + if len(set(source_names)) < len(source_names): + from collections import Counter + + duplicates = sorted( + name for name, n in Counter(source_names).items() if n > 1 + ) + logger.warning( + "citation_faithfulness_lossy_source_lookup", + item_id=item.id, + duplicate_source_names=duplicates, + detail=( + "source name appears multiple times in output.sources " + "with distinct chunks; only the first chunk will be " + "associated with the name during citation evaluation." + ), + ) + source_to_chunk: dict[str, str] = {} + for src_ref, chunk in zip(output.sources, output.source_chunks): + source_to_chunk.setdefault(src_ref.source, chunk) + + per_pair_results: list[ScoreResult] = [] + sys_hash = _system_output_hash( + item.id, output.answer, [s.source for s in output.sources] + ) + + if not pairs: + return ScoreResult( + reasoning="no_citations_in_answer", + evidence_quotes=[], + score=1, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=sys_hash, + cost_usd=0.0, + latency_ms=0.0, + ) + + accumulated_cost = 0.0 + accumulated_latency = 0.0 + any_unfaithful = False + for claim, cited in pairs: + # Empty claim → leading-citation case (e.g., answer starts with + # "[source: a.md] ..." with no prior content). There is no claim + # to evaluate against the chunk; the well-defined verdict is + # vacuously faithful. Skip the API call; record a synthetic + # ScoreResult so per-pair detail still appears in evidence_quotes. + if not claim: + per_pair_results.append( + ScoreResult( + reasoning="empty_claim_vacuously_faithful", + evidence_quotes=[], + score=1, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=sys_hash, + cost_usd=0.0, + latency_ms=0.0, + ) + ) + continue + chunk = source_to_chunk.get(cited, "") + schema_clause = self._json_schema_clause('0 or 1 or "Unknown"') + prompt = ( + f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n" + f"---\n\n" + f"## Claim (from agent's answer)\n{claim}\n\n" + f"## Cited chunk content\n{chunk}\n\n" + f"Does the cited chunk support the claim? Respond with ONLY a " + f"{schema_clause}" + ) + sub_result = await _call_judge_with_retry( + provider=self.judge_provider, + prompt=prompt, + valid_scores={0, 1}, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=sys_hash, + item_id=f"{item.id}::{cited}", + abstain_allowed=self.effective_abstain_allowed, + ) + per_pair_results.append(sub_result) + accumulated_cost += sub_result.cost_usd + accumulated_latency += sub_result.latency_ms + if sub_result.score == 0: + any_unfaithful = True + + aggregate_score: int | Literal["Unknown"] = 0 if any_unfaithful else 1 + # Any sub-call abstain → propagate Unknown (consistent with strict-quorum) + if any(r.abstained for r in per_pair_results): + aggregate_score = "Unknown" + + return ScoreResult( + reasoning=( + f"all_or_nothing aggregate over {len(per_pair_results)} (claim, citation) pairs; " + f"unfaithful={sum(1 for r in per_pair_results if r.score == 0)}, " + f"abstained={sum(1 for r in per_pair_results if r.abstained)}" + ), + evidence_quotes=[r.reasoning[:120] for r in per_pair_results], + score=aggregate_score, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=sys_hash, + cost_usd=accumulated_cost, + latency_ms=accumulated_latency, + ) diff --git a/agent_bench/evaluation/judges/completeness.py b/agent_bench/evaluation/judges/completeness.py new file mode 100644 index 0000000000000000000000000000000000000000..b2d6766a0dd8557324e669bd27d3f8695977e144 --- /dev/null +++ b/agent_bench/evaluation/judges/completeness.py @@ -0,0 +1,62 @@ +"""CompletenessJudge — three-point, reference-based on item.reference_answer.""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from agent_bench.evaluation.judges.base import ( + Judge, + ScoreResult, + _call_judge_with_retry, +) +from agent_bench.evaluation.judges.groundedness import _system_output_hash + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + + +# v1.1.1: recency-positioned restatement of the rubric's "paraphrase +# allowed" semantics. Earned by the 3A probe (3/5 disputed items shifted +# 1→2 on gpt-4o-mini) which validated that gpt-4o-mini's directional +# downward bias on 3-point completeness was prompt-positionally +# correctable rather than model-intrinsic. The clause appears immediately +# before the score instruction so the conditioning isn't lost across the +# rubric body and the reasoning step. See DECISIONS "Plan 3A" entry. +PARAPHRASE_RECENCY_CLAUSE = ( + "Note: a paraphrase that captures the same meaning as a gold-answer " + "point counts as covered. Score on content equivalence, not surface form." +) + + +class CompletenessJudge(Judge): + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"') + prompt = ( + f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n" + f"---\n\n" + f"## Reference answer (gold)\n{item.reference_answer}\n\n" + f"## Answer to score\n{output.answer}\n\n" + f"{PARAPHRASE_RECENCY_CLAUSE}\n\n" + f"Score this answer against the rubric above. Respond with ONLY a " + f"{schema_clause}" + ) + return await _call_judge_with_retry( + provider=self.judge_provider, + prompt=prompt, + valid_scores={0, 1, 2}, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=_system_output_hash( + item.id, output.answer, [s.source for s in output.sources] + ), + item_id=item.id, + abstain_allowed=self.effective_abstain_allowed, + ) diff --git a/agent_bench/evaluation/judges/groundedness.py b/agent_bench/evaluation/judges/groundedness.py new file mode 100644 index 0000000000000000000000000000000000000000..4048d3bddf09fef9e110464afd9025c0c63af63b --- /dev/null +++ b/agent_bench/evaluation/judges/groundedness.py @@ -0,0 +1,57 @@ +"""GroundednessJudge — binary, reference-based on item.source_snippets.""" + +from __future__ import annotations + +import hashlib +from typing import TYPE_CHECKING + +from agent_bench.evaluation.judges.base import ( + Judge, + ScoreResult, + _call_judge_with_retry, +) + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + + +def _system_output_hash(item_id: str, answer: str, sources: list[str]) -> str: + sorted_sources = sorted(sources) + canonical = f"{item_id}\x00{answer}\x00{','.join(sorted_sources)}" + return hashlib.sha256(canonical.encode("utf-8")).hexdigest() + + +class GroundednessJudge(Judge): + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + snippets_block = "\n".join( + f"[{i + 1}] {s}" for i, s in enumerate(item.source_snippets) + ) + schema_clause = self._json_schema_clause('0 or 1 or "Unknown"') + prompt = ( + f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n" + f"---\n\n" + f"## Gold source snippets\n{snippets_block}\n\n" + f"## Answer to score\n{output.answer}\n\n" + f"Score this answer against the rubric above. Respond with ONLY a " + f"{schema_clause}" + ) + return await _call_judge_with_retry( + provider=self.judge_provider, + prompt=prompt, + valid_scores={0, 1}, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=_system_output_hash( + item.id, output.answer, [s.source for s in output.sources] + ), + item_id=item.id, + abstain_allowed=self.effective_abstain_allowed, + ) diff --git a/agent_bench/evaluation/judges/relevance.py b/agent_bench/evaluation/judges/relevance.py new file mode 100644 index 0000000000000000000000000000000000000000..672da2926ae9f2372f04447aa55fbda9083fe69c --- /dev/null +++ b/agent_bench/evaluation/judges/relevance.py @@ -0,0 +1,48 @@ +"""RelevanceJudge — three-point, reference-free.""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from agent_bench.evaluation.judges.base import ( + Judge, + ScoreResult, + _call_judge_with_retry, +) +from agent_bench.evaluation.judges.groundedness import _system_output_hash + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + + +class RelevanceJudge(Judge): + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + *, + prompt_seed: int = 0, + ) -> ScoreResult: + schema_clause = self._json_schema_clause('0 or 1 or 2 or "Unknown"') + prompt = ( + f"{self.rubric.render_prompt(level_permutation_seed=prompt_seed)}\n\n" + f"---\n\n" + f"## Question\n{item.question}\n\n" + f"## Answer to score\n{output.answer}\n\n" + f"Score this answer against the rubric above. Respond with ONLY a " + f"{schema_clause}" + ) + return await _call_judge_with_retry( + provider=self.judge_provider, + prompt=prompt, + valid_scores={0, 1, 2}, + judge_id=self.judge_id, + rubric_version=self.rubric.source_hash, + prompt_seed=prompt_seed, + system_output_hash=_system_output_hash( + item.id, output.answer, [s.source for s in output.sources] + ), + item_id=item.id, + abstain_allowed=self.effective_abstain_allowed, + ) diff --git a/agent_bench/evaluation/metrics.py b/agent_bench/evaluation/metrics.py index dfb7911d48221fcf9a0f44602cc85b151fa6b5d5..135b9e06536ae41733c47985e4329bade29efabe 100644 --- a/agent_bench/evaluation/metrics.py +++ b/agent_bench/evaluation/metrics.py @@ -1,15 +1,19 @@ -"""Deterministic and LLM-judge evaluation metrics.""" +"""Deterministic evaluation metrics. + +The continuous-scale LLM-judge functions (answer_faithfulness, +answer_correctness, _judge_call) were removed in the judge-layer v1 +supersession. The replacement lives at agent_bench/evaluation/judges/ +as discrete-anchored, per-dimension judges with κ-validated calibration. +See docs/plans/2026-05-04-judge-layer-v1-design.md for the rationale. +""" from __future__ import annotations -import json import re import structlog from agent_bench.agents.orchestrator import AgentResponse -from agent_bench.core.provider import LLMProvider -from agent_bench.core.types import Message, Role logger = structlog.get_logger() @@ -125,84 +129,4 @@ def calculator_used_when_expected( return "calculator" in response.tools_used -# --- LLM-judge metrics (costs money, manual) --- - -_FAITHFULNESS_PROMPT = """\ -You are evaluating whether an AI assistant's answer \ -is fully supported by the provided source passages. - -Source passages: -{chunks} - -Answer to evaluate: -{answer} - -Score the answer's faithfulness to the sources from 0.0 to 1.0: -- 1.0: Every claim is directly supported by the sources -- 0.5: Some claims are supported, others are extrapolated -- 0.0: The answer contradicts or is entirely unsupported - -Respond with ONLY a JSON object: -{{"score": 0.8, "reasoning": "brief explanation"}}""" - -_CORRECTNESS_PROMPT = """\ -You are evaluating whether an AI assistant's answer \ -is factually correct compared to a reference answer. - -Reference answer: -{reference} - -Answer to evaluate: -{answer} - -Score correctness from 0.0 to 1.0: -- 1.0: All key facts match the reference -- 0.5: Some facts are correct, some are missing or wrong -- 0.0: The answer is factually incorrect - -Respond with ONLY a JSON object: -{{"score": 0.8, "reasoning": "brief explanation"}}""" - - -async def answer_faithfulness( - answer: str, - source_chunks: list[str], - judge_provider: LLMProvider, -) -> float | None: - """LLM-judged: is the answer supported by the sources? 0.0-1.0.""" - chunks_text = "\n\n".join(f"[{i + 1}] {c}" for i, c in enumerate(source_chunks)) - prompt = _FAITHFULNESS_PROMPT.format(chunks=chunks_text, answer=answer) - - return await _judge_call(prompt, judge_provider) - - -async def answer_correctness( - answer: str, - reference_answer: str, - judge_provider: LLMProvider, -) -> float | None: - """LLM-judged: is the answer factually correct vs reference? 0.0-1.0.""" - prompt = _CORRECTNESS_PROMPT.format(reference=reference_answer, answer=answer) - - return await _judge_call(prompt, judge_provider) - - -async def _judge_call(prompt: str, provider: LLMProvider) -> float | None: - """Make a judge call and parse the JSON response.""" - try: - response = await provider.complete( - [Message(role=Role.USER, content=prompt)], - temperature=0.0, - max_tokens=256, - ) - data = json.loads(response.content) - score = float(data["score"]) - reasoning = data.get("reasoning", "") - logger.info("llm_judge_result", score=score, reasoning=reasoning) - return max(0.0, min(1.0, score)) - except (json.JSONDecodeError, KeyError, ValueError, TypeError) as e: - logger.warning("llm_judge_parse_error", error=str(e), raw=response.content[:200]) - return None - except Exception as e: - logger.error("llm_judge_call_error", error=str(e)) - return None +# LLM-judge metrics moved to agent_bench/evaluation/judges/ in judge-layer v1. diff --git a/agent_bench/evaluation/report.py b/agent_bench/evaluation/report.py index d0ff5f368049b8533b028cfabc4dea77ca91d288..57725eecb6bc3914266c132992c2d245cc025645 100644 --- a/agent_bench/evaluation/report.py +++ b/agent_bench/evaluation/report.py @@ -52,9 +52,18 @@ def generate_report( total_cost = sum(r.tokens_used.estimated_cost_usd for r in results) avg_cost = total_cost / max(len(results), 1) - # Optional faithfulness - faith_scores = [r.faithfulness for r in positive if r.faithfulness is not None] - avg_faith = _safe_avg(faith_scores) if faith_scores else None + # Optional groundedness (replaces continuous faithfulness in v1). + # Discrete-anchored binary 0/1; abstain ('Unknown' score) is excluded + # from the average. The float() cast narrows ScoreResult.score from + # `int | Literal["Unknown"]` to float for _safe_avg — abstained=False + # already guarantees the value is int but mypy doesn't propagate that. + grounded_scores: list[float] = [ + float(r.judge_scores["groundedness"].score) # type: ignore[arg-type] + for r in positive + if "groundedness" in r.judge_scores + and not r.judge_scores["groundedness"].abstained + ] + avg_grounded = _safe_avg(grounded_scores) if grounded_scores else None lines.append("| Metric | Value |") lines.append("|--------|-------|") @@ -65,8 +74,8 @@ def generate_report( lines.append(f"| Citation Accuracy | {avg_citation:.2f} |") lines.append(f"| Grounded Refusal Rate | {refusal_rate}/{len(negative)} |") lines.append(f"| Calculator Accuracy | {calc_correct}/{len(calc_qs)} |") - if avg_faith is not None: - lines.append(f"| Answer Faithfulness (LLM) | {avg_faith:.2f} |") + if avg_grounded is not None: + lines.append(f"| Answer Groundedness (LLM judge) | {avg_grounded:.2f} |") lines.append(f"| Latency p50 | {p50:,.0f} ms |") lines.append(f"| Latency p95 | {p95:,.0f} ms |") lines.append(f"| Cost per query | ${avg_cost:.4f} |") diff --git a/agent_bench/evaluation/rubrics/citation_faithfulness.md b/agent_bench/evaluation/rubrics/citation_faithfulness.md new file mode 100644 index 0000000000000000000000000000000000000000..64d7933ef125d5f8991d4493d234b8485ddb8bc3 --- /dev/null +++ b/agent_bench/evaluation/rubrics/citation_faithfulness.md @@ -0,0 +1,57 @@ +--- +dimension: citation_faithfulness +scale: binary +reference_based: true +abstain_allowed: true +--- + +# Citation faithfulness (binary, all-or-nothing aggregation per item) + +For each [source: X.md] citation in the answer, is the cited chunk's +content actually relevant to the claim it supports? This is stricter +than the deterministic citation_accuracy metric, which only checks +that the cited chunk_id appears in the retrieved set — citation +faithfulness checks the **relevance** of the chunk to the claim. + +**Aggregation rule (item-level):** any unfaithful citation in the +answer → item score = 0. A single bad citation in a multi-citation +answer is a real failure that all-or-nothing surfaces; treating it as +partial would obscure the failure mode. + +## Score 0 + +The cited chunk's content does not support the adjacent claim. + +### Example A — citation drift + +Claim: "The default port is 8080." +Cited chunk content: "The dashboard supports OAuth and SAML authentication." + +Score=0 because the chunk talks about authentication, not the port. +The citation is misleading even though the claim happens to be true. + +### Example B — wrong topic citation + +Claim: "StatefulSet pods get ordinal indices." +Cited chunk content: "Deployments support rolling updates with maxSurge and maxUnavailable parameters." + +Score=0 — the cited chunk is about Deployments, not StatefulSets. +The citation does not support the claim about StatefulSet ordinals. + +## Score 1 + +The cited chunk's content directly supports the adjacent claim. + +### Example C — single accurate citation + +Claim: "The default port is 8080." +Cited chunk content: "The dashboard listens on port 8080 by default." + +Score=1. + +### Example D — paraphrase-supported citation + +Claim: "Each pod has a stable hostname." +Cited chunk content: "StatefulSet pods receive hostnames derived from the StatefulSet name plus their ordinal, and these hostnames persist across reschedules." + +Score=1 — the chunk supports the claim via paraphrase. diff --git a/agent_bench/evaluation/rubrics/completeness.md b/agent_bench/evaluation/rubrics/completeness.md new file mode 100644 index 0000000000000000000000000000000000000000..ff17caf601f55ab9199dde7ce6f3e927bbf398bb --- /dev/null +++ b/agent_bench/evaluation/rubrics/completeness.md @@ -0,0 +1,71 @@ +--- +dimension: completeness +scale: three_point +reference_based: true +abstain_allowed: true +--- + +# Completeness (three-point) + +Score how much of the gold reference answer is covered by the agent's +answer. This is reference-based — the judge sees the gold reference +and the agent's answer; score on **coverage of facts** in the +reference, not on additional facts the agent may have included. + +The judge does not penalize the agent for adding correct extra detail +(that's a separate concern). Score only on what fraction of the +reference's points are present. + +## Score 0 + +None of the reference's key points are present in the answer. + +### Example A — answer addresses different facts + +Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage." +Answer: "Kubernetes uses YAML manifests to declare resources." + +Score=0 — none of the three reference points (ordinal, hostname, storage) appear. + +### Example B — refusal that covers nothing + +Reference: "The default port is 8080." +Answer: "I cannot find that information." + +Score=0 — the reference's single point (port=8080) is not in the answer. + +## Score 1 + +Some but not all of the reference's points are present. + +### Example C — partial coverage + +Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage." +Answer: "StatefulSet pods get ordinal indices." + +Score=1 — one of three points covered. + +### Example D — half a comparison + +Reference: "Deployments manage stateless replicas; StatefulSets manage stateful pods with stable identities." +Answer: "Deployments manage stateless replicas with rolling updates." + +Score=1 — Deployment side covered, StatefulSet side missing. + +## Score 2 + +All of the reference's key points are present (paraphrase allowed). + +### Example E — full coverage with paraphrase + +Reference: "StatefulSet pods receive ordinal indices, stable hostnames, and persistent storage." +Answer: "Each pod gets an ordinal number, a stable DNS name, and storage that survives restarts." + +Score=2 — all three points covered with paraphrase. + +### Example F — full coverage of single-fact reference + +Reference: "The default port is 8080." +Answer: "Port 8080." + +Score=2 — the only reference point is covered. diff --git a/agent_bench/evaluation/rubrics/groundedness.md b/agent_bench/evaluation/rubrics/groundedness.md new file mode 100644 index 0000000000000000000000000000000000000000..21d9ac756c4b7f586a1adee63e900966feb9a5d1 --- /dev/null +++ b/agent_bench/evaluation/rubrics/groundedness.md @@ -0,0 +1,142 @@ +--- +dimension: groundedness +scale: binary +reference_based: true +abstain_allowed: true +--- + +# Groundedness (binary) + +Score whether **every claim** in the agent's answer is entailed by the gold +source snippets attached to this item. Paraphrase is allowed; what matters +is content equivalence, not surface form. + +## Reference scope (strict, clarified in v1.1) + +Reference scope is the **gold snippets only**, not the broader corpus, not +the retrieved chunks, not the LLM's general domain knowledge. A claim that +is factually correct in the world but not entailed by the snippets **must +score 0**. The "must" forecloses the "well, mostly grounded" reading: a +single ungrounded claim in an otherwise solid answer fails the binary +rubric. + +The strict-entailment posture is a methodological choice. A claim that is +correct because the model happened to know it isn't grounded — it's lucky. +Strict-snippet groundedness measures *retrieval-grounded behavior*, not +LLM general knowledge passing through a RAG harness. + +## Trivial inference is entailment + +Some surface-form variations of a snippet's content are entailment, not +new claims. The test is **content equivalence**, not surface form: + +- **Paraphrase.** "X causes Y" ↔ "Y is caused by X". +- **Unit conversion.** "600 seconds" ↔ "10 minutes". +- **Syntactic variation.** Pluralization, tense, voice, declarative ↔ imperative. +- **Canonical name of the snippet's concept.** When the snippet describes + a field, header, or API element by configuration syntax (e.g., a + `max_age` table row), the canonical name (`Access-Control-Max-Age` HTTP + header) is the same content in different surface form. This is a + separate carve-out from pure paraphrase: it admits domain knowledge + tightly bound to the snippet's referent. + +> **v1.2 debt.** The trivial-inference clause — especially the +> canonical-name carve-out — is the strictest-rubric concession most +> likely to require revision in v1.2. If labelers find themselves +> applying it broadly to rescue answers from score-0, the clause is +> too permissive and should be tightened. + +**When to abstain (`"Unknown"`)**: if the answer is a refusal ("I don't +know" / "not in the documentation") and there is nothing to ground, score +abstain rather than 1. + +## Score 0 + +At least one claim in the answer is not entailed by any snippet, after +applying the trivial-inference clause. + +### Example A — calibration anchor `k8s_006` (dramatic over-extension) + +Question: "What is a ConfigMap in Kubernetes and what kind of data should you store in it?" + +Snippet: "A ConfigMap is an API object used to store non-confidential data in key-value pairs." + +Answer (excerpted): The agent gives a comprehensive multi-section answer +covering (i) the definition, (ii) three consumption methods (env vars, +command-line args, volumes), (iii) a warning not to store +passwords/tokens/certificates, (iv) a recommendation to use Secrets +instead, and (v) details about `data` and `binaryData` fields. + +Thinking trace: Score = 0. Only the definition (i) is entailed by the +snippet. Claims (ii)–(v) are factually correct against the underlying +`k8s_configmap.md` doc, but **none are entailed by the one-sentence +snippet**. The snippet does not describe consumption methods, security +guidance, or schema fields. The strict-conjunction rule applies: even +though most of the answer is well-supported by the broader corpus, the +gold-snippet scope is what the rubric measures, and the answer goes +dramatically beyond it. + +### Example B — calibration anchor `q006` (subtle embellishment) + +Question: "How does dependency caching work in FastAPI, and how can you disable it?" + +Snippet: "By default, if the same dependency is used multiple times within +a single request (e.g., both a route and a sub-dependency use +`Depends(get_db)`), FastAPI caches the result and calls the dependency +only once. To disable caching and force a fresh call each time, use +`Depends(get_db, use_cache=False)`." + +Answer (excerpted): Most of the answer is entailed by the snippet — the +default-caching behavior, the once-per-request semantics, the +`use_cache=False` disable mechanism, and an example. But the answer also +states that caching is "particularly useful for expensive operations like +database connections." + +Thinking trace: Score = 0. The "particularly useful for expensive +operations like database connections" claim is **not entailed** by the +snippet. The snippet describes *what* caching does (a mechanism); the +claim adds *why* it is useful (a use-case argument the snippet does not +make). The example uses `Depends(get_db)` as an identifier, not as a +normative argument that caching is useful for database connections. This +is the subtle case: most of the answer is good, but a single embellishment +that imports general LLM knowledge ("expensive operations like database +connections") fails strict snippet entailment. This is the boundary case +the rubric exists to catch — easier to mis-label than `k8s_006` because +the failure is one claim in the middle of a mostly-grounded answer. + +## Score 1 + +Every claim in the answer is entailed by at least one snippet, including +trivial inferences (paraphrase, unit conversion, syntactic variation, +canonical-name substitution). + +### Example C — calibration anchor `q021` (trivial inference + canonical name) + +Question: "If the CORS max_age is 600 seconds, how many minutes does the browser cache preflight results?" + +Snippet (table row): `| max_age | 600 | Seconds the browser caches preflight results |` + +Answer: "If the CORS `max_age` is 600 seconds, the browser caches +preflight results for 10 minutes. The `max_age` directive in the +`Access-Control-Max-Age` header specifies the maximum time (in seconds) +that the results of a preflight request can be cached. So 600 seconds +÷ 60 seconds/minute = 10 minutes." + +Thinking trace: Score = 1. Every claim is entailed under the +trivial-inference clause: + +- "max_age=600 → 10 minutes" — unit conversion. +- "Access-Control-Max-Age header" — canonical-name carve-out: the snippet + describes `max_age` by its configuration syntax; the answer uses the + canonical HTTP header name for the same field. +- "specifies the maximum time (in seconds)" — paraphrase of "Seconds the + browser caches preflight results". +- "600 ÷ 60 = 10 minutes" — arithmetic, the same trivial-inference class + as unit conversion. + +The canonical-name carve-out is doing the heaviest lifting in this +example. Without it, "Access-Control-Max-Age" would be ungrounded +(domain knowledge not in the snippet text). With it, the answer is a +clean strict-snippet pass. This is exactly the v1.2-debt sentence above +— if many future labels rescue score-1 via canonical-name appeals, the +clause is over-rescuing and should be tightened. diff --git a/agent_bench/evaluation/rubrics/relevance.md b/agent_bench/evaluation/rubrics/relevance.md new file mode 100644 index 0000000000000000000000000000000000000000..f58e009cb8d09bc323d91279775aaefc1d34300d --- /dev/null +++ b/agent_bench/evaluation/rubrics/relevance.md @@ -0,0 +1,74 @@ +--- +dimension: relevance +scale: three_point +reference_based: false +abstain_allowed: true +--- + +# Relevance (three-point) + +Does the agent's answer address the user's question? This is reference-free +— the judge sees only the question and the answer, not gold snippets or a +reference answer. Score the topic-match, not the truth-value. + +## Score 0 + +Off-topic. The answer addresses a different question, is unintelligible, +or is a refusal that does not engage with the question's premise. + +### Example A — wrong topic + +Question: "How do I deploy to Kubernetes?" +Answer: "Python virtual environments isolate dependencies between projects." + +Score=0 — the answer is about Python venvs, not Kubernetes deployment. + +### Example B — refusal that ignores the question + +Question: "What's the default replica count for a StatefulSet?" +Answer: "I cannot help with that request." + +Score=0 — the refusal does not engage with the StatefulSet topic. A +proper grounded refusal ("the documentation does not specify a default +replica count for StatefulSets") would score higher. + +## Score 1 + +Partially relevant. The answer touches the question's topic but misses +the core ask, or addresses a related-but-different question. + +### Example C — adjacent but off-target + +Question: "How do I deploy a StatefulSet?" +Answer: "Kubernetes runs containerized workloads on a cluster of nodes." + +Score=1 because it's about Kubernetes but doesn't address StatefulSet +deployment specifically. + +### Example D — answers a sibling question + +Question: "What's the difference between Deployment and StatefulSet?" +Answer: "A Deployment manages stateless replicas with rolling updates." + +Score=1 because it describes Deployment but doesn't compare it to +StatefulSet — only half the question is addressed. + +## Score 2 + +Directly addresses the question's core ask. + +### Example E — on-target single-fact answer + +Question: "What's the default port for kubelet?" +Answer: "Port 10250." + +Score=2 because it directly answers the question. + +### Example F — on-target comparison + +Question: "What's the difference between Deployment and StatefulSet?" +Answer: "Deployments manage stateless, interchangeable pods with rolling +updates; StatefulSets manage stateful pods with stable identities, +ordered rollouts, and persistent per-pod storage." + +Score=2 — both sides of the comparison are addressed. diff --git a/agent_bench/evaluation/variance/__init__.py b/agent_bench/evaluation/variance/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..9ce61a6b557e730e1348978c6f5d4fa77bd64ae0 --- /dev/null +++ b/agent_bench/evaluation/variance/__init__.py @@ -0,0 +1,9 @@ +"""Variance-control wrappers around Judge instances.""" + +from agent_bench.evaluation.variance.jury import Jury, jury +from agent_bench.evaluation.variance.rubric_permute import ( + PermutedJudge, + rubric_permute, +) + +__all__ = ["Jury", "PermutedJudge", "jury", "rubric_permute"] diff --git a/agent_bench/evaluation/variance/jury.py b/agent_bench/evaluation/variance/jury.py new file mode 100644 index 0000000000000000000000000000000000000000..e4e145014505b32080d08856c7881393fc3c4e18 --- /dev/null +++ b/agent_bench/evaluation/variance/jury.py @@ -0,0 +1,181 @@ +"""Jury — multi-judge aggregator with strict-quorum default and sidecar.""" + +from __future__ import annotations + +import asyncio +from pathlib import Path +from typing import TYPE_CHECKING, Literal + +import structlog + +from agent_bench.evaluation.judges.base import Judge, ScoreResult +from agent_bench.evaluation.variance.rubric_permute import _aggregate_scores + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + +_DEFAULT_SIDECAR_TEMPLATE = "results/calibration_v1_judge_{aggregation}_members.jsonl" + +logger = structlog.get_logger() + + +def _discretize_mean(mean: float, scale: str) -> int: + """Discretize a float mean to a discrete level per scale, ties → lower + (mirrors `_aggregate_scores`'s policy without going through int(round()) + which would invoke Python's banker's rounding and silently violate the + tie-breaking contract). + """ + if scale == "binary": + return 1 if mean > 0.5 else 0 + floor = int(mean) + frac = mean - floor + return floor + 1 if frac > 0.5 else floor + + +class Jury: + """Aggregates a list of Judge instances into one ScoreResult per item. + + Strict quorum default (quorum = len(judges)): any member abstain → + aggregate abstain. The parameter exists in v1 so v1.1's 3-judge jury + can shift to quorum=2 (majority) without rearchitecting failure + semantics. + + Per-member ScoreResults always written to sidecar (successes and + failure-as-abstains alike). Provider non-retryable exceptions in + any member raise immediately, cancelling sibling gather tasks. + """ + + def __init__( + self, + judges: list[Judge], + aggregation: Literal["mean", "kappa_weighted"], + weights: dict[str, float] | None = None, + quorum: int | None = None, + sidecar_path: Path | str | None = None, + ) -> None: + if not judges: + raise ValueError("jury requires at least one judge") + if aggregation == "kappa_weighted" and not weights: + raise ValueError( + "kappa_weighted aggregation requires explicit weights " + "(computed offline on calibration set; not at jury construction)" + ) + self.judges = judges + self.aggregation = aggregation + self.weights = weights or {} + self.quorum = quorum if quorum is not None else len(judges) + self.sidecar_path = ( + Path(sidecar_path) + if sidecar_path is not None + else Path(_DEFAULT_SIDECAR_TEMPLATE.format(aggregation=aggregation)) + ) + self.judge_id = f"jury_v1_{aggregation}" + + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + ) -> ScoreResult: + # return_exceptions=False → first exception cancels siblings + member_results: list[ScoreResult] = await asyncio.gather( + *[j.score(item, output) for j in self.judges], + return_exceptions=False, + ) + + # Sidecar (append; one line per member per call) + self.sidecar_path.parent.mkdir(parents=True, exist_ok=True) + with self.sidecar_path.open("a", encoding="utf-8") as f: + for r in member_results: + f.write(r.model_dump_json() + "\n") + + successful = [r for r in member_results if not r.abstained] + sys_hash = member_results[0].system_output_hash + + if len(successful) < self.quorum: + return ScoreResult( + reasoning=( + f"jury_below_quorum: {len(successful)}/{len(self.judges)} " + f"members succeeded; required {self.quorum}" + ), + evidence_quotes=[], + score="Unknown", + judge_id=self.judge_id, + rubric_version=member_results[0].rubric_version, + prompt_seed=0, + system_output_hash=sys_hash, + cost_usd=sum(r.cost_usd for r in member_results), + latency_ms=max(r.latency_ms for r in member_results), + ) + + # Aggregate over successful members + scores = [int(r.score) for r in successful] + scale = self.judges[0].rubric.scale + applied_weights: list[float] = [] + if self.aggregation == "mean": + agg = _aggregate_scores(scores, scale) + else: # kappa_weighted + # Weight successful members by judge_id. v1.1: missing weight is + # a hard error (was a silent fallback to 1.0 in v1, which let an + # asymmetric weights source amplify the unweighted member rather + # than suppressing it — see the v1.1 jury-rescue entry in + # DECISIONS.md for the calibration evidence). + missing = [r.judge_id for r in successful if r.judge_id not in self.weights] + if missing: + raise ValueError( + f"jury kappa_weighted: weights dict missing entries for " + f"member judge_ids {sorted(set(missing))}. Configured " + f"weights cover {sorted(self.weights.keys())}. " + f"v1.1 requires symmetric coverage — every jury member " + f"must have an explicit weight in the source. The v1 " + f"silent fallback to 1.0 was a documented contract " + f"violation that masked the source's asymmetric coverage." + ) + for r in successful: + applied_weights.append(self.weights[r.judge_id]) + weighted_sum = sum(s * w for s, w in zip(scores, applied_weights)) + weight_total = sum(applied_weights) + weighted_mean = ( + weighted_sum / weight_total if weight_total > 0 else 0.0 + ) + # Discretize via the shared ties-to-lower policy (NOT int(round()) + # which uses banker's rounding and would diverge from the `mean` + # path on half-integer aggregates). + agg = _discretize_mean(weighted_mean, scale) + + # Reasoning string reports the per-member weights actually applied + # (not the constructor's dict — the dict may be missing entries that + # silently fell back to 1.0; printing the constructor's dict would + # conceal that fallback from anyone debugging a calibration row). + weights_str = applied_weights if self.aggregation == "kappa_weighted" else "n/a" + return ScoreResult( + reasoning=( + f"jury_{self.aggregation}: " + f"members={[r.score for r in successful]}, " + f"weights={weights_str}" + ), + evidence_quotes=[], + score=agg, + judge_id=self.judge_id, + rubric_version=member_results[0].rubric_version, + prompt_seed=0, + system_output_hash=sys_hash, + cost_usd=sum(r.cost_usd for r in member_results), + latency_ms=max(r.latency_ms for r in member_results), + ) + + +def jury( + judges: list[Judge], + aggregation: Literal["mean", "kappa_weighted"], + weights: dict[str, float] | None = None, + quorum: int | None = None, + sidecar_path: Path | str | None = None, +) -> Jury: + return Jury( + judges=judges, + aggregation=aggregation, + weights=weights, + quorum=quorum, + sidecar_path=sidecar_path, + ) diff --git a/agent_bench/evaluation/variance/rubric_permute.py b/agent_bench/evaluation/variance/rubric_permute.py new file mode 100644 index 0000000000000000000000000000000000000000..b3f78fb17979cbc90ddabfc3a7f5362f05bd5683 --- /dev/null +++ b/agent_bench/evaluation/variance/rubric_permute.py @@ -0,0 +1,109 @@ +"""rubric_permute — runs the same judge with permuted rubric levels and aggregates.""" + +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING, Literal + +from agent_bench.evaluation.judges.base import Judge, ScoreResult + +if TYPE_CHECKING: + from agent_bench.agents.orchestrator import AgentResponse + from agent_bench.evaluation.harness import GoldenQuestion + + +def _aggregate_scores(scores: list[int], scale: str) -> int: + """Discretize aggregated score per scale. + + Binary: threshold 0.5 with ties → 0 (conservative). + Three-point: round to nearest with ties → lower level (conservative). + """ + mean = sum(scores) / len(scores) + if scale == "binary": + return 1 if mean > 0.5 else 0 + floor = int(mean) + frac = mean - floor + if frac > 0.5: + return floor + 1 + return floor + + +class PermutedJudge: + """Wraps a Judge; runs N permutations with different prompt_seeds. + + Aggregation: + - Any abstain in any permutation → aggregate score = "Unknown". + - Otherwise, discretize the per-permutation scores per scale. + + Per-permutation ScoreResults are written to the sidecar JSONL on + every score() call (one batch per call, append-mode JSONL across calls). + """ + + def __init__( + self, + judge: Judge, + n: int = 2, + seeds: list[int] | None = None, + sidecar_path: Path | str | None = None, + ) -> None: + self.judge = judge + self.n = n + self.seeds = seeds if seeds is not None else list(range(1, n + 1)) + if len(self.seeds) != n: + raise ValueError(f"seeds length {len(self.seeds)} != n {n}") + self.sidecar_path = Path(sidecar_path) if sidecar_path else None + self.judge_id = f"{judge.judge_id}_perm{n}" + + async def score( + self, + item: "GoldenQuestion", + output: "AgentResponse", + ) -> ScoreResult: + per_perm_results: list[ScoreResult] = [] + for seed in self.seeds: + r = await self.judge.score(item, output, prompt_seed=seed) + per_perm_results.append(r) + + if self.sidecar_path is not None: + self.sidecar_path.parent.mkdir(parents=True, exist_ok=True) + with self.sidecar_path.open("a", encoding="utf-8") as f: + for r in per_perm_results: + f.write(r.model_dump_json() + "\n") + + any_abstain = any(r.abstained for r in per_perm_results) + if any_abstain: + score: int | Literal["Unknown"] = "Unknown" + reasoning = ( + f"any_abstain_propagated: " + f"{sum(1 for r in per_perm_results if r.abstained)}/{self.n} " + f"permutations abstained" + ) + else: + score = _aggregate_scores( + [int(r.score) for r in per_perm_results], + self.judge.rubric.scale, + ) + reasoning = ( + f"perm_mean over {self.n} seeds: {[r.score for r in per_perm_results]}" + ) + + return ScoreResult( + reasoning=reasoning, + evidence_quotes=[], + score=score, + judge_id=self.judge_id, + rubric_version=self.judge.rubric.source_hash, + prompt_seed=0, + system_output_hash=per_perm_results[0].system_output_hash, + cost_usd=sum(r.cost_usd for r in per_perm_results), + latency_ms=sum(r.latency_ms for r in per_perm_results), + ) + + +def rubric_permute( + judge: Judge, + n: int = 2, + seeds: list[int] | None = None, + sidecar_path: Path | str | None = None, +) -> PermutedJudge: + return PermutedJudge(judge=judge, n=n, seeds=seeds, sidecar_path=sidecar_path) diff --git a/agent_bench/serving/static/index.html b/agent_bench/serving/static/index.html index e4c0a9576017e8f19b8cae54edd5bbfd63c396f4..a37b0882570272677d67bf87fe26020cbf2b99c3 100644 --- a/agent_bench/serving/static/index.html +++ b/agent_bench/serving/static/index.html @@ -721,6 +721,141 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"} border: 1px solid var(--rule-2); background: var(--paper); color: var(--ink); } +/* ── Evaluation harness section ───────────── */ +.harness-intro{ + display: grid; grid-template-columns: 1.2fr 1fr; gap: 32px; + align-items: end; margin-bottom: 28px; +} +.harness-intro p{color: var(--ink-2); max-width: 56ch} +.harness-intro .sig{ + font-family: var(--font-mono); font-size: 0.74rem; color: var(--ink-3); + display: flex; flex-direction: column; gap: 4px; +} +.harness-intro .sig b{color: var(--ink); font-weight: 600} + +.rubric-grid{ + display: grid; grid-template-columns: repeat(4, 1fr); + gap: 1px; background: var(--rule); + border: 1px solid var(--rule); +} +.rubric{ + background: var(--paper); padding: 22px 20px; + display: flex; flex-direction: column; gap: 10px; +} +.rubric .dim{ + font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600; + letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink); +} +.rubric .scale{ + font-family: var(--font-mono); font-size: 0.7rem; color: var(--ink-3); + display: flex; gap: 8px; flex-wrap: wrap; +} +.rubric .scale span{border: 1px solid var(--rule); padding: 1px 6px} +.rubric .scale span.on{border-color: var(--ink); color: var(--ink)} +.rubric .desc{ + font-size: 0.86rem; color: var(--ink-2); line-height: 1.5; +} +.rubric .anchor{ + font-family: var(--font-mono); font-size: 0.72rem; + border-left: 2px solid var(--rule-2); padding: 8px 10px; + background: var(--paper-2); color: var(--ink-2); line-height: 1.5; + margin-top: auto; +} +.rubric .anchor b{color: var(--ink); font-weight: 600; font-size: 0.7rem; letter-spacing: 0.06em} + +/* Compact one-row κ summary that lives above Findings (deep table is in appendix) */ +.kappa-summary{ + margin-top: 22px; border: 1px solid var(--rule); + padding: 14px 18px; + display: flex; flex-direction: column; gap: 10px; + background: var(--paper-2); +} +.kappa-summary .ks-head{ + font-family: var(--font-mono); font-size: 0.7rem; font-weight: 600; + letter-spacing: 0.1em; text-transform: uppercase; color: var(--ink-3); +} +.kappa-summary .ks-head .ks-sub{ + letter-spacing: 0.04em; text-transform: none; color: var(--ink-3); + font-weight: 400; margin-left: 4px; +} +.kappa-summary .ks-row{ + display: flex; flex-wrap: wrap; align-items: baseline; gap: 22px; + font-family: var(--font-mono); font-size: 0.85rem; + font-feature-settings: "tnum","zero"; +} +.kappa-summary .ks-stat{display: flex; align-items: baseline; gap: 8px} +.kappa-summary .ks-stat .k{color: var(--ink-3); font-size: 0.78rem} +.kappa-summary .ks-stat .v{color: var(--ink); font-weight: 600} +.kappa-summary .ks-stat .v.win{color: var(--ok)} +.kappa-summary .ks-link{ + margin-left: auto; font-size: 0.78rem; color: var(--ink-2); + border-bottom: 1px solid var(--rule-2); +} +.kappa-summary .ks-link:hover{color: var(--ink); border-color: var(--ink)} + +.kappa-wrap{ + margin-top: 28px; border: 1px solid var(--rule); + display: grid; grid-template-columns: 1.4fr 1fr; +} +.kappa-table{ + border-right: 1px solid var(--rule); + padding: 22px 24px; +} +.kappa-table h4{ + font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600; + letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3); + margin-bottom: 14px; +} +.kappa-table table{width: 100%; border-collapse: collapse; font-family: var(--font-mono); font-size: 0.78rem} +.kappa-table th, .kappa-table td{ + text-align: left; padding: 7px 10px; border-bottom: 1px solid var(--rule); + font-feature-settings: "tnum","zero"; +} +.kappa-table th{ + font-weight: 600; color: var(--ink-3); font-size: 0.68rem; + letter-spacing: 0.08em; text-transform: uppercase; +} +.kappa-table td.num{text-align: right; color: var(--ink)} +.kappa-table td.num.win{color: var(--ok); font-weight: 600} +.kappa-table tr.config-row td{background: var(--paper)} +.kappa-table tr:last-child td{border-bottom: none} +.kappa-note{ + font-family: var(--font-ui); font-size: 0.78rem; color: var(--ink-3); + margin-top: 10px; line-height: 1.5; max-width: 60ch; +} + +.variance{ + padding: 22px 24px; + display: flex; flex-direction: column; gap: 14px; + background: var(--paper-2); +} +.variance h4{ + font-family: var(--font-mono); font-size: 0.72rem; font-weight: 600; + letter-spacing: 0.12em; text-transform: uppercase; color: var(--ink-3); +} +.variance .v-row{ + display: flex; flex-direction: column; gap: 4px; + padding: 12px 14px; background: var(--paper); border: 1px solid var(--rule); +} +.variance .v-row .name{ + font-family: var(--font-mono); font-size: 0.82rem; font-weight: 600; color: var(--ink); +} +.variance .v-row .name code{ + font-family: var(--font-mono); font-size: 0.78rem; color: var(--accent-ink); + background: var(--accent-soft); padding: 1px 5px; +} +.variance .v-row .why{ + font-size: 0.82rem; color: var(--ink-2); line-height: 1.5; +} + +/* Harness responsive overrides — collapse rubric grid + κ split at narrower viewport */ +@media (max-width: 1000px){ + .rubric-grid{grid-template-columns: repeat(2, 1fr)} + .kappa-wrap{grid-template-columns: 1fr} + .kappa-table{border-right: none; border-bottom: 1px solid var(--rule)} + .harness-intro{grid-template-columns: 1fr; gap: 16px} +} + /* ── Responsive ────────────────────────────── */ @media (max-width: 880px){ .grid{grid-template-columns: 1fr} @@ -742,6 +877,7 @@ code, .mono{font-family: var(--font-mono); font-feature-settings: "zero","ss02"}
agent-bench