Spaces:

Nomearod
/

agentbench

Running

App Files Files Community

agentbench / DECISIONS.md

Nomearod

calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific

504a35c 21 days ago

preview code

raw

history blame contribute delete

156 kB

	# Design Decisions

	## Why build from primitives, not LangChain?

	I wanted to demonstrate I understand tool dispatch, memory management,
	and retrieval orchestration at the implementation level. My provider
	abstraction is ~150 lines. When reranking or a second provider is needed,
	I know exactly where it plugs in — because I built every layer.

	## Why one provider in V1?

	The interface supports multiple providers. V1 shipped OpenAI + Mock to
	prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming
	that switching providers is a one-line config change. The orchestrator
	and tools are completely unchanged between providers.

	## Why one domain (technical docs)?

	Clean text produces clean evaluation. Research papers introduce PDF
	parsing noise (tables, figures, formulas) that degrades eval quality
	without adding signal. The framework handles any text corpus — the
	domain is a config swap.

	## Why Reciprocal Rank Fusion, not score normalization?

	BM25 scores and cosine similarities live on different scales.
	Normalizing across scales is brittle: min-max normalization is
	sensitive to outliers, z-score requires distribution assumptions.
	RRF fuses by rank position `1/(k + rank)`, which is robust,
	parameter-light (only k=60), and well-studied. Trade-off: loses
	magnitude information, but for top-5 retrieval this rarely matters.

	## Why ~16 curated docs, not a large corpus?

	Small corpus produces predictable retrieval, reproducible benchmarks,
	and easy debugging. Golden dataset questions map to specific source
	files. A reviewer can verify any result by reading the source. The
	framework scales to larger corpora — the choice is about evaluation
	quality, not capability.

	## Why no reranker in V1?

	Feature-flagged in config (`rag.reranker.enabled: false`). V1
	benchmarks without reranking establish an honest baseline. V2 adds
	cross-encoder reranking and shows the delta.

	## Why no delete in the vector store?

	FAISS flat index doesn't support efficient deletion. For a small
	corpus that changes rarely, rebuild-on-ingest is simpler and
	eliminates consistency bugs.

	## Why async internals, sync user behavior?

	FastAPI and the OpenAI SDK are async-native. Using async for I/O
	avoids blocking the event loop. V2 added SSE streaming (`/ask/stream`)
	for the final synthesis step — tool calls remain non-streamed since
	they complete in ~100ms.

	## Why SQLite-backed conversation sessions

	V1 was stateless by design — no conversation_id, no cross-request
	memory. V2 adds optional SQLite-backed sessions: pass `session_id`
	on `/ask` to persist and load conversation history. When omitted,
	behavior is identical to V1 (stateless). See the dedicated
	DECISIONS.md entry under "Why SQLite for conversation persistence"
	for the full rationale.

	## Why negative evaluation cases?

	A system that always answers sounds confident but may hallucinate.
	5 out-of-scope questions test whether the system refuses gracefully
	when the corpus doesn't contain the answer. Grounded refusal
	requires both refusal language AND zero cited sources — an answer
	that says "not found" but still cites docs is not a valid refusal.

	## Why deterministic eval + optional LLM judge?

	CI needs free, deterministic tests. Retrieval P@5, keyword hit
	rate, citation accuracy, and grounded refusal rate run without
	API keys. LLM-judged faithfulness and correctness are manual
	enrichment steps, run locally, with results committed to the
	benchmark report.

	## Why structlog, not LangSmith/Langfuse?

	Third-party observability contradicts the "built from primitives"
	narrative. structlog provides JSON-structured logs, `/metrics`
	exposes latency + cost. If a team uses LangSmith, adding it is
	a one-day task.

	## Why explicit citation format [source: filename.md]?

	The system prompt mandates inline citations so the evaluation
	harness can parse them with `\[source:\s*(.+?)\]` and check
	against the structured sources list. This catches hallucinated
	citations and measures citation accuracy as a metric.

	## Why per-request retrieval settings via kwargs, not singleton mutation?

	The orchestrator is a singleton shared across concurrent requests.
	Storing `top_k` / `strategy` on `self` causes cross-request state
	bleed. Instead, these are passed as local variables through the
	tool execution kwargs — no shared state is mutated.

	## Why a relevance threshold for grounded refusal

	V1 never refuses — it always retrieves tangentially related content and
	synthesizes an answer. This is a trust failure: users cannot distinguish
	"the system found relevant information" from "the system fabricated from
	vaguely related chunks." Grounded refusal rate was 0/5.

	We add a refusal gate in `SearchTool.execute()` based on the maximum RRF
	score across retrieved chunks. If no chunk scores above the threshold, the
	tool returns "No relevant documents found" — the LLM then refuses via the
	system prompt rather than fabricating from irrelevant content.

	Gate location: The gate fires in `SearchTool.execute()`, not the
	orchestrator. `SearchTool` is where retrieval scores are still available —
	they are dropped before results reach the orchestrator. This also keeps
	the orchestrator unchanged.

	Threshold value: `rag.refusal_threshold: 0.02` is a provisional default
	pending an empirical sweep across the evaluation set. The sweep will test
	values 0.01–0.03 and select the value that maximizes refusal on out-of-scope
	queries without degrading in-scope P@5 and R@5. The actual RRF score
	distribution will be documented here after tuning.

	Interaction with reranking: The refusal gate fires on RRF scores BEFORE
	reranking. It is a go/no-go decision, not a per-chunk filter. If the gate
	passes, the full candidate set proceeds to the reranker. This keeps the
	threshold calibration independent of whether reranking is enabled.

	Default disabled: `refusal_threshold: 0.0` preserves V1 behavior exactly.
	The feature is opt-in until the threshold is tuned.

	Alternative considered: LLM-based relevance judgment ("is this content
	relevant to the query?"). Rejected because it adds latency, cost, and a
	second point of failure. The score-based approach is deterministic, fast,
	and debuggable.

	## Why cross-encoder reranking improves precision

	BM25 retrieves lexically similar but semantically irrelevant chunks.
	RRF fusion mitigates this partially, but noisy BM25 results still
	dilute the top-5 set. P@5 was 0.70 in V1.

	A cross-encoder (`ms-marco-MiniLM-L-6-v2`, ~80MB) scores each
	(query, chunk) pair jointly, capturing semantic relevance that
	bi-encoder similarity misses. The tradeoff is ~100–200ms extra latency
	per query — acceptable given our 4.7s baseline is dominated by LLM
	generation, not retrieval.

	The reranker is enabled by default. Setting `rag.reranker.enabled: false`
	restores V1 behavior exactly. `reranker.top_k` is independent of
	`retrieval.top_k`, so the reranker's output count can be tuned without
	affecting the RRF candidate pool.

	The retriever passes all RRF-fused candidates to the reranker rather
	than a computed subset. The reranker's `top_k` handles truncation.
	This is simpler and more robust than computing an input size from
	per-system candidate counts.

	## Why provider retry with exponential backoff

	OpenAI returns 429 (rate limit) errors under load. Without retry logic,
	a single 429 causes a user-visible failure. We add exponential backoff:
	attempt after 1s, 2s, 4s. After 3 retries, raise `ProviderRateLimitError`
	so the middleware returns a clear 503.

	The retry wraps the raw `openai.RateLimitError` — it must fire BEFORE
	the error gets translated to `ProviderRateLimitError`, otherwise retry
	logic is dead code. Other errors (400, 401, timeout) fail immediately.

	## Why in-memory API rate limiting

	A public-facing API needs abuse protection. We use a simple in-memory
	sliding window limiter: 10 requests/minute per IP. Sufficient for a
	demo deployment; a production system would use Redis.

	Known limitation: the per-IP dict grows without bound across distinct
	IPs. Acceptable for Fly.io with auto-stop (memory resets). If running
	continuously under bot traffic, add a periodic sweep or switch to a
	TTL-based structure.

	Design choices:
	- `/health` and `/metrics` exempt: monitoring should never be rate-limited.
	- `Retry-After` header: follows HTTP 429 spec, lets clients back off.

	## Why SQLite for conversation persistence

	Three options considered:
	1. In-memory dict: Lost on restart.
	2. SQLite: Zero-dependency, file-based, survives restarts.
	3. Redis/PostgreSQL: Adds infrastructure complexity.

	SQLite is right for this scale. `session_id` is optional — when omitted,
	the system behaves identically to V1 (stateless). This preserves backward
	compatibility and keeps benchmark evaluation deterministic.

	The route handler manages session state (load history, store Q+A), not
	the orchestrator. The orchestrator accepts an optional `history` parameter
	but has no knowledge of persistence. This keeps the agent loop testable
	without a database.

	Note: On HF Spaces, SQLite is ephemeral (no persistent storage on free
	tier). For the demo this is acceptable — sessions last until the container
	sleeps. Production would use a volume or managed database.

	## Why a second provider (Anthropic)

	The provider abstraction existed since V1 but only had OpenAI + Mock.
	Adding Anthropic proves the abstraction works across fundamentally
	different APIs:

	- System message: `system=` parameter, not in the messages list
	- Tool definitions: `input_schema` instead of `parameters`
	- Tool results: `tool_result` content blocks in user messages
	- Tool calls: `tool_use` content blocks, not a separate field
	- Stop reason: `tool_use` vs `stop`

	The implementation is a config swap — `provider.default: anthropic` in
	YAML switches the entire system to Claude. The orchestrator, tools,
	evaluation harness, and serving layer are completely unchanged.

	Same retry/timeout handling as OpenAI. Both providers are tested with
	mocked HTTP responses — no API keys needed in CI.

	## Why ranked_sources separate from deduplicated sources?

	The deduplicated `sources` list in `AgentResponse` is for the API
	response. The `ranked_sources` list preserves rank order with
	duplicates for evaluation metrics. P@5 and R@5 need the raw
	retrieval ranking, not the post-processed answer metadata.

	## Why vLLM over TGI / llama.cpp

	vLLM has the widest model support, best throughput via PagedAttention, and a native
	OpenAI-compatible server (`/v1/chat/completions`). TGI is a valid alternative; llama.cpp
	targets different use cases (edge/CPU inference). This is a deliberate choice, not
	ignorance of alternatives.

	## Why Modal for GPU inference

	Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs
	~$0.50 per full 27-question benchmark run. The Docker Compose path (`docker-compose.vllm.yml`)
	is retained for users who have local GPUs or prefer persistent serving.

	## Why split topology (K8s API + Modal GPU)

	The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from
	horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from
	serverless elasticity — Modal scales to zero when idle, scales up on demand with no node
	provisioning. Co-locating both in K8s would require GPU node pools with idle cost,
	node autoscaler latency, and NVIDIA device plugin management. This mirrors a common
	production pattern.

	## Why Helm only, not Kustomize + Helm

	Showing two K8s deployment methods for the same app adds complexity without demonstrating
	distinct skills. Helm with `values-dev.yaml` / `values-prod.yaml` covers
	environment-specific configuration cleanly.

	## Why CPU-based HPA, not custom metrics

	CPU utilization works without a Prometheus adapter or custom metrics server. A production
	improvement would use the Prometheus adapter to scale on p95 latency from the `/metrics`
	endpoint — this requires bridging the JSON metrics to Prometheus exposition format.
	Documented as a follow-up.

	## Why env var fallback in SelfHostedProvider

	Follows the same pattern as OpenAIProvider reading `OPENAI_API_KEY`. The YAML config
	provides defaults; env vars override at runtime. No config loader changes needed.

	## Why lazy tool-call detection, not metadata check

	Checking `/v1/models` metadata for tool-calling support is unreliable — model metadata
	doesn't consistently report this capability. Instead, the provider sends one tool-calling
	request on first `complete()` call with tools and checks if the response contains
	`tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures
	(timeout, 5xx) return `None` and retry on the next call rather than permanently
	downgrading to prompt-based fallback.

	## Why two-tier injection detection, not three

	The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity — "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth.

	Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only — documented, not hidden.

	## Why regex + optional spaCy for PII, not a cloud API

	Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses.

	spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (`try: import spacy`) degrades gracefully with a logged warning — no crash if someone sets `use_ner: true` without installing spaCy.

	## Why append-only JSONL for audit, not SQLite

	One codepath, one format, no config branching. JSONL is append-only by nature — no schema migrations, no transactions, no connection pooling. Log rotation handles size. `jq` provides immediate queryability without building a custom API.

	The original design included an optional SQLite backend and a query endpoint (`GET /admin/audit`). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication — an inconsistency when `/ask` itself has no auth.

	JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned.

	## Why HMAC-SHA256 IP hashing in audit logs

	HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, `AUDIT_HMAC_KEY` env var, or (with a logged warning) a random per-process fallback.

	## Why three output validators, not four

	The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks — PII leakage, URL hallucination, blocklist — are deterministic with clear pass/fail semantics.

	## Why buffer-then-validate for streaming output

	The `/ask/stream` endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative — streaming chunks immediately and appending a safety marker — leaks unsafe content to any client that stops reading after the `done` event.

	## Why no authentication on API endpoints

	The HF Spaces demo is public by design — the `curl` examples in the README work without credentials, which is the point. Adding API key authentication would gate access but break the zero-friction demo experience that makes the project evaluable.

	The security pipeline protects content (injection detection, PII redaction, output validation), not access. This is a deliberate scope boundary: application-layer guardrails ensure the system behaves safely regardless of who calls it, rather than assuming trusted callers. Rate limiting (10 RPM per IP) provides basic abuse protection.

	A production deployment would add authentication (API keys or OAuth) at the infrastructure layer — reverse proxy, API gateway, or middleware. The security pipeline's `getattr(..., None)` pattern means auth can be layered on without modifying the existing security components.

	## Why monitor mode for output validation, not gating?

	Output validation runs post-stream as a monitoring layer. The answer
	streams to the client, then validation runs and emits its verdict. Gating
	(buffer-then-validate) would add 4-5 seconds of dead air while the full
	answer generates — unacceptable streaming UX for a documentation Q&A bot.
	Trade-off: a hallucinated URL or PII fragment could reach the client
	before validation catches it. For this use case (FastAPI docs, no real
	PII in corpus), the risk is near-zero. The dashboard labels this
	"monitored" (not "gated") to be explicit about the posture.

	## Why named residual risks and scope limits, not "fully mitigated" verdicts?

	The OWASP LLM Top 10 (2025) mapping could have been written as a 10-row table where LLM01 and LLM02 read as "addressed" without qualifiers — shorter and cleaner-looking. Rejected because OWASP's own 2025 text is explicit about what an input guardrail can and cannot do, and writing a verdict that contradicts the source the mapping cites would be compliance theater.

	LLM01 Prompt Injection — OWASP 2025 states that RAG and fine-tuning do not fully mitigate prompt injection, and that indirect injection through retrieved content remains a core risk class. "Fully mitigated" is unsupportable for any system retrieving untrusted content into an LLM context window, which is every RAG system including this one. The LLM01 verdict reads "addressed directly with named residual risk"; the residual-risk cell cites OWASP's own "do not fully mitigate" language verbatim.

	LLM02 Sensitive Information Disclosure — OWASP 2025's LLM02 mitigations span four concern classes: access controls, training-data handling, user-consent transparency, and proprietary-information governance. This implementation addresses a narrower output-side subset (output validation for PII leakage, secret formats, and URL hallucination) — not cleanly one of the four concern classes, but a narrower scope than any of them. The verdict reads "addressed directly for the applicable scope"; the scope-limit cell enumerates the four concern classes verbatim and names what addressing the broader concerns would require (multi-tenant or authenticated architecture).

	The tension the entry resolves is honesty-vs-scannability: a mapping that surfaces named residual risks and scope limits is longer and harder to skim than one with uniform "addressed" verdicts, but the scannable version over-claims relative to the cited source. Honest evaluation is the brand. Every verdict cell in SECURITY.md must survive a reviewer reading OWASP 2025 in a second tab.

	See [SECURITY.md § LLM01 Prompt Injection](SECURITY.md#llm01-prompt-injection) and [§ LLM02 Sensitive Information Disclosure](SECURITY.md#llm02-sensitive-information-disclosure) for the verdict cells; this entry covers why the verdict discipline takes the form it does. The LLM01 "do not fully mitigate" phrasing and the LLM02 four-concern-class enumeration are canonical in SECURITY.md; the README tail and landing-page subtitle paraphrase but must preserve the named-residual-risk and scope-limit structure.

	## Why additive SSE stage events?

	The enhanced `/ask/stream` adds `meta` and `stage` event types alongside
	the existing `sources`, `chunk`, and `done` events. Existing consumers
	that only handle the three legacy types are unaffected — they simply
	ignore events with unknown types. This avoids versioning the endpoint
	or breaking the non-streaming `/ask` contract. The `meta` event fires
	first (before any stages) so the frontend can display provider/model
	info immediately.

	## Why vanilla JS for the frontend, not Alpine or React?

	The showcase dashboard has ~5 pieces of reactive state (pipeline stages,
	retrieval results, security badges, stats, chat messages). The SSE
	handler is inherently imperative: receive event, querySelector the
	target node, update classList and textContent. Wrapping this in a
	reactive framework adds a dependency, interview questions about
	"why is there a framework for 5 state variables", and indirection
	that fights the imperative SSE pattern. One `state` object + a few
	`render()` functions handles it in ~150 lines.

	## Phase 1 SSE gate closure — two baselines on record, not one

	The Phase 1 acceptance gate for the SSE backend work (meta event,
	stage events, iteration-aware metadata threading, route-level
	injection/output-validation events) requires re-running
	`make evaluate-fast` and confirming numbers match pre-change state
	on the pinned `gpt-4o-mini-2024-07-18` snapshot. The re-run was
	honored literally rather than substituted with a git-diff
	argument, even though the SSE commits did not touch
	`scripts/evaluate.py`'s legacy code path. Two reasons: the
	re-commitment discipline that kept Fix 1 and Fix 2 honest applies
	equally here, and the legacy path and the `--corpus fastapi` path
	produce materially different baselines that cannot substitute for
	each other.

	**Two distinct baselines now exist at the pinned snapshot, and
	both are on record** — one per prompt path:

	\| Baseline file \| Invocation \| Prompt source \| In-scope P@5 \| In-scope R@5 \| Citation \| Mean calls \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| `results/fastapi_preedit.json` @ `213da36` \| `--corpus fastapi` \| `format_system_prompt("FastAPI")` \| 0.718 \| 0.833 \| 1.000 \| 1.14 \|
	\| `results/fastapi_legacy_baseline_pinned.json` @ this commit \| `make evaluate-fast` (no `--corpus`) \| `tech_docs.yaml` `task.system_prompt` \| 0.655 \| 0.849 \| 1.000 \| 1.45 \|

	Citation accuracy holds at 1.000 on both paths, both in-scope and
	out-of-scope. The retrieval metric deltas (P@5 −0.063, R@5 +0.016,
	KHR +0.045) and behavioral delta (mean tool calls +0.318 in-scope,
	+1.00 out-of-scope) trace to the prompt-path divergence
	(`scripts/evaluate.py:67` reads `task.system_prompt` in the legacy
	branch vs. `format_system_prompt(label)` in the `--corpus` branch),
	not to any change in retrieval, reranking, or refusal-gate code.
	This divergence is the same one the "evaluation-layer multi-corpus
	support lagged the serving-layer refactor" entry documents; the
	narrowed serving-migration deferral tracks its eventual migration.

	Why both baselines are retained. When the serving-migration
	deferral lands and `scripts/evaluate.py`'s legacy branch is removed
	(everything routes through `--corpus fastapi`), the regression gate
	is "post-migration `make evaluate-fast` output matches pre-migration
	`--corpus fastapi` output within pre-committed tolerances." That
	gate requires the `--corpus fastapi` baseline as the comparison
	reference AND the legacy baseline as evidence of the pre-migration
	state that is being retired. Retaining both makes the migration
	auditable and bounds its regression budget; retaining only one
	would force the post-migration run to compare against a baseline
	from a different prompt path, guaranteeing the gate fires on
	prompt divergence rather than on any actual regression.

	Gate verdict: passed. No regression vs pre-SSE legacy path
	expectations (citation 1.000 holds, refusal gate fires on the same
	5 out-of-scope questions, retrieval numbers in sane in-scope
	ranges). Phase 1 SSE backend work is closed from the backend side;
	the frontend's consumption of iteration-aware stage events is
	orthogonal and owned by Week 1 step 7 (showcase UI).

	## Why per-corpus refusal thresholds?

	FastAPI and Kubernetes have different corpus characteristics. FastAPI
	has 16 short, well-structured docs with sparse cross-references —
	relevance tends to concentrate in 1-2 chunks per query. Kubernetes
	has 30-40 docs with heavy cross-referencing between concepts (Pod →
	Deployment → Service → Ingress), which spreads relevance across more
	chunks. A single global refusal threshold would either refuse too
	aggressively on K8s (no single chunk dominates, so the top score
	looks "low") or not aggressively enough on FastAPI (where a
	moderate-scoring chunk might be the only hit and should still refuse).

	`CorpusConfig` carries `refusal_threshold` as a per-corpus field.
	Each threshold gets tuned against its own golden dataset — there
	is no "fair" shared threshold because BEIR showed these are not
	comparable across corpora. Placeholder values ship in default.yaml
	and are replaced by tuned values during the per-corpus evaluation
	sweep.

	## Why corpus and provider toggles compose — corpus_map[corpus][provider]

	The simpler design would have been `corpus_map[corpus]` returning a
	single orchestrator. It ships in 10 fewer lines. It also silently
	breaks the provider toggle in multi-corpus mode: the orchestrator
	inside each corpus cell holds one fixed provider, and clicking
	"Anthropic" in the dashboard keeps running on OpenAI.

	This project's hero-tile metric is the provider comparison (`1.00 API /
	0.14 7B self-hosted`). Breaking the mechanism that demonstrates that
	metric — on a portfolio demo where a reviewer will open DevTools and
	notice — would erode the honest-evaluation brand the whole repo is
	built around. The nested `corpus_map[corpus][provider]` structure
	keeps both toggles functional. Store, retriever, and search tool are
	shared across providers within a corpus (the expensive objects are
	held once per corpus); only the orchestrator varies per provider
	since it holds the LLM client. Per-corpus × per-provider memory
	overhead is an orchestrator struct, not a FAISS index.

	RSS is logged per corpus, not per corpus × provider, because the
	store is what drives memory. The provider multiplier is negligible
	compared to a hybrid index + embedder.

	## Why one parameterized system prompt, not per-corpus templates

	The template is `"You are a technical documentation assistant for
	{corpus_label}..."`. The only corpus-specific element is the label;
	prompt content is identical across corpora: same citation format,
	same refusal language, same grounding instructions. Having two
	separate prompt files would invite drift — someone tweaks the FastAPI
	prompt for a specific failure mode and forgets to update the K8s
	version, and the demo silently answers differently on the two toggles.

	The parameterization is enforced by two tests: (a)
	`format_system_prompt("")` raises `ValueError` so an unresolved
	`{corpus_label}` can never reach the LLM, and (b) a spy on
	`orchestrator.run_stream` asserts FastAPI and K8s requests receive
	different prompts with the correct label substituted.

	The wording deliberately differs from the typical "don't hallucinate"
	RAG template:

	- "refuse the question explicitly" matches our refusal-gate
	mechanism. "Say so politely" is soft language that models interpret
	as "hedge and answer anyway".
	- **"do not infer, do not extrapolate, do not draw on general
	knowledge"** is the three-verb prohibition. "Do not fabricate" is
	empirically easier to slip past because models distinguish
	fabrication (making things up) from extrapolation (drawing
	conclusions from adjacent but non-authoritative context).

	## Why Kubernetes curation targets recruiter-likely questions, not coverage

	The K8s corpus targets ~30-40 pages curated around concepts a
	technical reviewer would naturally type (Pod, Deployment, Service,
	Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that
	stress the reranker. Cluster administration deep-dives, tutorials,
	and kubectl reference are explicitly excluded — they add noise without
	adding reviewer value and hurt retrieval precision when adjacent
	content is thin on concept definitions.

	`data/k8s_docs/SOURCES.md` is a version-controlled curation artifact.
	Each ingested URL has a one-line rationale, a date pulled, and a
	license note. This makes the corpus reproducible and documents the
	curation reasoning for any reviewer who looks closely.

	Trade-off: the corpus is not comprehensive K8s knowledge. A question
	about etcd raft internals will be correctly refused. This is not a
	bug — the refusal is part of the demo story, and "the system knows
	what it doesn't know" is a feature of the grounded-refusal mechanism.

	## Why no cross-corpus score comparison (inspired by BEIR)

	Inspired by BEIR's heterogeneous-benchmark framing (Thakur et al.,
	NeurIPS 2021), which spans 18 datasets across 9 task types, absolute
	retrieval scores are not treated as comparable across FastAPI and
	K8s corpora — score distributions depend on chunk length, vocabulary
	overlap, and corpus density, none of which are held constant across
	domains. Only rank-ordering of system configurations within a single
	corpus is meaningful. Concrete consequences for this repo:

	- Per-corpus evaluation results are reported separately, never
	aggregated into a single "combined" number.
	- The hero-tile citation accuracy (`1.00 API / 0.14 7B self-hosted`)
	stays FastAPI-specific. It is not restated as a cross-corpus average.
	- `make evaluate-fast` accepts a `--corpus` flag but has no "combined"
	mode. Anyone who wants a cross-corpus number has to run twice and
	acknowledge the incomparability in prose.
	- The landing page "Key Findings" cards avoid sentences that compare
	FastAPI and K8s numbers directly.

	The multi-corpus demo is a **surface feature for interactive
	exploration**, not a rebenchmark. The benchmark section of the README
	remains FastAPI-only and cites 27 questions on 16 docs with specific
	chunker settings.

	## K8s golden dataset uses CRAG's 8-type taxonomy as the schema

	The K8s golden dataset uses CRAG's 8-type taxonomy (Yang et al.,
	NeurIPS 2024) as the schema for `question_type`, not as a
	requirement to cover all 8 types. CRAG's taxonomy: `simple`,
	`simple_w_condition`, `set`, `comparison`, `aggregation`,
	`multi_hop`, `post_processing_heavy`, `false_premise`. Temporal
	dynamism is a separate orthogonal property captured as
	`time_sensitive: bool` on the question schema — it is not a CRAG
	category.

	Target distribution across the 25-question K8s golden set:

	- `simple` (5–6): baseline retrieval
	- `simple_w_condition` (3–4): nuanced understanding under conditions
	- `comparison` (3–4): retrieval across concept pages, reranker stress
	- `multi_hop` (5–6): synthesis across 2–4 docs, reranker stress
	- `false_premise` (3–4): grounded refusal mechanism
	- `set` / `aggregation` / `post_processing_heavy` (0–3): included
	only where corpus content naturally supports

	`time_sensitive: bool` flags 2–3 questions targeting version-bounded
	content (feature state, deprecations, API version migration).

	`false_premise` questions come in two flavors (see separate
	"False-premise questions come in two flavors" entry): pure refusal
	(flavor A) and documented negative (flavor B). The K8s set includes
	at least one of each. Flavor A tests the path where retrieval
	correctly returns nothing useful; flavor B tests the path where the
	corpus contains an explicit negative answer and the agent must
	surface it with citation rather than confabulating a positive.

	Rationale for using CRAG as schema (not coverage requirement):
	`false_premise` and `time_sensitive` stress grounded refusal and
	reduce test-set contamination risk; `multi_hop` and `comparison`
	stress the reranker because relevance spreads across multiple
	chunks. The distribution was chosen to exercise the parts of the
	pipeline the benchmark story claims — not to mimic a general-purpose
	QA benchmark.

	The golden dataset JSON schema (v2, backward-compatible with the
	FastAPI flat list) includes:

	- `source_chunk_ids: list[str]` for multi-hop partial credit
	(answer must cite at least one of the expected chunks)
	- `source_snippets: list[str]` for human-readable context during
	review
	- `question_type: str` (CRAG taxonomy value)
	- `is_multi_hop: bool` for filtered reporting
	- Dataset-level header with `corpus`, `version`, `snapshot_date`,
	and pinned `chunker` parameters so the dataset is reproducible
	against a specific K8s docs snapshot

	See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` for the
	full schema and rationale.

	## EU AI Act corpus deferred to v1.2

	EU AI Act compliance mapping is deferred to v1.2. Rationale: v1
	ships two corpora (FastAPI, K8s) to demonstrate the multi-corpus
	architecture; EU AI Act as a third corpus would add ingestion and
	golden-set work without exercising architecturally new surface.
	Scoped as the first v1.2 addition after v1 launch.

	## Cold-start contingency: measure first, lazy-load if needed

	Loading two corpora at startup costs memory and cold-start time. On
	HF Spaces (target deployment), the realistic ceiling is 8-10 GB
	resident RAM and ~60 seconds cold-start before the demo feels broken.

	Policy:

	1. Measure HF Spaces cold-start on Day 1 of deployment.
	2. If cold-start < 60 s: plan validated, no changes.
	3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager,
	K8s lazy on first K8s request). Scoped ~2 hours implementation.

	This contingency is not pre-built. Pre-building a lazy-load path
	that may never ship creates dead code that rots, and the test surface
	for "lazy loading plus corpus routing plus provider switching" is
	non-trivial. The RSS logging in `app.py` (Task 2) emits the exact
	numbers needed to make the decision; the decision is documented here
	so future-me remembers the threshold and doesn't optimize prematurely
	on a hunch.

	## Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause

	The preceding "Cold-start contingency" entry pre-committed a lazy-load
	fix (FastAPI eager, K8s lazy on first request) if the measured cold
	start exceeded 60 seconds. Measurement falsified the entry's core
	assumption: corpus loading is not the dominant cold-start cost.
	The committed fix addresses ~1 % of the observed overshoot. Executing
	it verbatim would honor the gate's letter but not its intent — theater
	dressed as discipline. This entry documents the measurement, the
	falsified assumption, and the new contingency pre-committed at the
	actual cause.

	Measurement (N=3, 2026-04-15, HF Spaces target deployment):

	\| Sample \| Cold start \| Silent Python init \| Visible phase \|
	\|---\|---\|---\|---\|
	\| N=1 \| 113 s \| ~101 s \| ~12 s \|
	\| N=2 \| 89 s \| ~70 s \| ~19 s \|
	\| N=3 \| 129 s \| ~115 s \| ~14 s \|

	- Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
	- Gate fire is unambiguous at both tails. Even the fastest sample
	(89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is
	~115 % over. No boundary ambiguity.
	- Sample-size justification. N=3 is acknowledged as a small sample.
	It is adequate here because (a) the gate-fire conclusion is stable
	across both tails, (b) the "silent Python init dominates variance"
	finding is stable across all three samples (silent phase varies
	70 → 115 s across runs; visible phase varies only 12 → 19 s), and
	(c) the cost of additional samples (manual HF Space restart + ~2 min
	wait + log extraction per sample) exceeds the marginal information
	gain once both tails fire the gate and the variance pattern is stable.
	N=4 would tighten the confidence interval on the median but does not
	change either the gate-fire conclusion or the falsified-assumption
	finding.
	- Variance source named. HF Spaces shared-infrastructure CPU / IO
	contention during Python module imports. The silent-init phase
	varies 45 s across samples (70 → 115 s); the visible phase is stable
	(12–19 s). That is the signature of host-level contention on a
	shared physical node, not code-level variability. An
	exclusively-owned container would plausibly show a tighter bound.
	- Raw log captures (preserved so this entry can be cross-checked
	against the underlying evidence without re-running the measurement):
	`measurements/2026-04-15-coldstart-n1.log`, `-n2.log`, `-n3.log`.

	Where the cost lives. At the median (113 s):

	- Silent Python init phase — ~90 s (≈ 80 % of total): interpreter
	start, module imports (`torch`, `transformers`, `langchain`, `faiss`,
	`fastapi`, `httpx`, the full dependency closure), and initial model
	weight loading (`all-MiniLM-L6-v2` embedder, cross-encoder
	reranker). Not logged — no observability inside the import chain.
	- Visible startup phase — ~15 s (≈ 15 % of total): injection
	classifier init (~10 s, includes the "classifier skipped" warning),
	FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s,
	+25.8 MB RSS), reranker warmup (~2 s).

	**The K8s corpus load — which the pre-committed fix was designed to
	defer — contributes under 1 second of the 113-second median.**
	Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load
	is the same order of magnitude. Corpus loading is simply not where the
	cost lives on this deployment.

	Why we are not executing the pre-committed fix. The preceding
	contingency was written under an empirical assumption about cost
	attribution (corpus loading is the dominant cost). Measurement
	falsified the assumption. Implementing the fix anyway would be a
	mechanical execution of a recipe whose premise has been disproven —
	it checks the gate-honoring box while failing to address the cause.
	That is structurally identical to relaxing-by-redefinition ("60 s was
	too tight"), just in the opposite direction: relaxing by execution.
	The pre-commitment rule's purpose is to prevent motivated reasoning
	about the gate, not to mandate mechanical compliance with a recipe
	whose empirical foundation has collapsed.

	The honest action is (1) accept the measurement as the v1 baseline,
	(2) document the falsified assumption explicitly (this entry),
	(3) re-pre-commit a new contingency at the actual dominant cost with
	an explicit trigger condition so the decision is not relitigated at
	review time, and (4) update the user-facing README surface to reflect
	the measured cold-wake number rather than the optimistic pre-deploy
	estimate.

	v1.1 contingency — pre-committed:

	> If HF Spaces traffic produces more than N cold wakes per day
	> (N to be determined from observed usage patterns after launch, **not
	> estimated in advance**), defer eager loading of (a) the cross-encoder
	> reranker, (b) the sentence-transformers embedder, and (c) the
	> injection classifier tier to first-relevant-request.
	>
	> Estimated work: 4–6 hours (lazy-init wrappers + first-request
	> caching + integration tests for the warm/cold transition).
	>
	> Expected tradeoff: cold wake ~113 s → ~50–60 s (approaches the
	> original 60 s target); **first request after any cold wake incurs
	> +8–15 s** additional latency (model weights load synchronously in
	> the request path), after which subsequent warm requests return to
	> normal ~5 s latency.
	>
	> Trigger is usage-justified, not estimate-justified. Until real
	> traffic data justifies the work, there is nothing to optimize — a
	> recruiter demo that gets one cold wake per day does not pay for
	> 4–6 hours of engineering plus the new first-request-latency failure
	> mode. The trigger threshold N is left unnamed deliberately: naming a
	> number in advance would invite the same falsification pattern this
	> entry is documenting.

	Methodology lesson. When a pre-committed contingency is written
	under an empirical assumption, the contingency only holds if the
	assumption survives measurement. If measurement falsifies the
	assumption, the correct action is to document the falsification,
	accept the observed baseline, and re-pre-commit at the actual cause.
	The wrong action is to execute the original recipe anyway, which
	trades one form of motivated reasoning (threshold relaxation) for
	another (recipe compliance). The underlying discipline — "pre-commit
	your gates and honor them" — does not mean "mechanically run the
	pre-committed fix regardless of what it addresses." It means "honor
	the gate's intent, which is to prevent motivated reasoning about
	pass/fail."

	Post-hoc refinement (2026-04-22) — three latency regimes observed.
	Follow-up warm-latency measurements after the initial cold-start
	characterization distinguish a middle regime that the README's
	"~2 min cold / ~5 s warm" phrasing collapses:

	\| Regime \| Latency \| Cause \|
	\|---\|---\|---\|
	\| Cold-start (container spin-up) \| 89–129 s \| Python imports, model load (per the table above) \|
	\| Wake-from-idle (first `/ask` after container is up) \| ~6.7 s (≈ +2 s over warm) \| Residual lazy init on first request \|
	\| Steady-state warm \| ~5 s (mean 4.74 s, n=5) \| LLM API round-trip + retrieval \|

	Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state
	warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim
	captures regimes 1 and 3; the ~2 s first-hit penalty between them is
	a refinement of the warm-cold boundary, not a contradiction. The
	v1.1 projection of "+8–15 s first request after any cold wake"
	measures the delta from this ~2 s current baseline, not from a
	zero-penalty starting point — the true v1.1 UX cost over current
	behavior is closer to +6–13 s.

	## False-premise questions come in two flavors

	When authoring golden-dataset questions whose premise is wrong, the
	question can point at one of two genuinely different failure modes.
	Both are valid; they test different pipeline paths and should be
	labeled distinctly so the evaluator routes correctly.

	Flavor A — pure refusal. The premise is not addressed anywhere in
	the corpus. Example: "How do I configure Claude API rate limits in
	Kubernetes?" K8s has no such concept. Schema: `category: "out_of_scope"`,
	`expected_sources: []`, `source_snippets: []`. The evaluator's
	`grounded_refusal` metric expects the answer to contain a refusal
	phrase ("does not contain", "no information") AND cite zero sources.
	Tests the pipeline path where retrieval correctly returns nothing
	useful and the agent correctly declines.

	Flavor B — documented negative. The corpus contains an explicit
	negative answer. Example: "How do I configure NetworkPolicy to enforce
	mTLS?" The K8s NetworkPolicy docs have a "What you can't do with
	network policies" section that explicitly says "Anything TLS related
	(use a service mesh or ingress controller for this)". Schema:
	`category: "retrieval"`, `question_type: "false_premise"`,
	`expected_sources: [<the negative-answer page>]`, `source_snippets:
	[<the verbatim negative statement>]`. The evaluator expects the agent
	to retrieve the page, find the negative statement, and answer
	negatively with a citation. Tests the stricter path where the corpus
	genuinely contains the answer and the agent must not hallucinate a
	contradictory capability.

	Why both matter for the honest-evaluation brand. Grounded refusal
	is not "refuse when retrieval is weak." It is "answer exactly what the
	source says, including when the source says no." Flavor A tests the
	first half (refuse when there is nothing to ground on); flavor B tests
	the second half (report the documented negative instead of
	confabulating a positive). The K8s golden dataset includes at least
	one of each. The first K8s pilot (`k8s_pilot_005`, NetworkPolicy
	mTLS) is flavor B. Flavor A is reserved for questions targeting
	features that genuinely do not exist in the K8s corpus; at least one
	such question is required in the full 25-question set.

	## Pilot_005 refusal-gate + agent-behavior measurement

	The first K8s pilot run surfaced two distinct flavor-B failure modes
	on `k8s_pilot_005` (NetworkPolicy mTLS). Both are empirical, both
	have specific numbers, and both are logged in
	`results/k8s_pilot_threshold_0.02.json` and
	`results/k8s_pilot_threshold_0.015.json`.

	Failure mode 1 — threshold calibration (at 0.02). The
	`SearchTool.execute()` refusal gate fired with `max_score=0.01639` —
	exactly `1/(60+1)`, the rank-1 RRF score from a single fusion system.
	BM25 hit "NetworkPolicy" at rank 1; the dense encoder contributed
	nothing, because "Anything TLS related (use a service mesh or ingress
	controller for this)" is a single negative sentence, not a conceptual
	topic the page is semantically "about." Hybrid fusion inherited only
	the BM25 rank-1 score. At threshold 0.02 (the FastAPI working value),
	the gate refused before the agent saw any chunks. Retrieval P@5 and
	R@5 both 0.00; answer is a generic refusal.

	Failure mode 2 — agent behavior on documented negative (at 0.015).
	With the threshold dropped just below the measured max score
	(`0.015 < 0.01639`), retrieval is perfect: P@5 1.00, R@5 1.00, all
	five top chunks from `k8s_network_policies.md`. But the agent still
	produces a flavor-A-style refusal: *"The Kubernetes documentation
	does not provide specific instructions on configuring a NetworkPolicy
	to enforce mutual TLS..."* The "Anything TLS related" sentence is in
	the retrieved chunks — the agent simply treats the absence of
	positive instructions as grounds for refusal, rather than reading the
	explicit negative sentence and citing it as the answer. KHR 0.67: the
	`service mesh` and `ingress controller` keywords (the documented
	alternatives the page points to) are missing from the answer.

	Implication. The flavor-B mechanism requires more than threshold
	tuning. Fixing the gate is necessary but not sufficient. The system
	prompt needs a flavor-B clause (e.g., *"if the documentation
	explicitly says a feature does not exist or is not supported, report
	that with citation — do not treat it as unanswerable"), or* the
	K8s golden dataset's flavor-B questions must use phrasing the
	current prompt can route correctly. The 0.30 placeholder value from
	the design doc was based on "prefer conservative" intuition without
	empirical grounding — the measured working range for K8s pilot
	retrieval is lower by more than an order of magnitude than that
	intuition, and even at the working threshold the prompt layer is the
	blocker.

	What this measurement is. A pilot smoke-test result, not a
	benchmark claim. Aggregates at 0.02: P@5 0.63, R@5 0.83, KHR 0.69.
	Aggregates at 0.015: P@5 0.80, R@5 1.00, KHR 0.75. Five of six pilots
	produce substantively correct answers on K8s content under the
	working threshold — evidence the retrieval stack generalizes to K8s.
	The pilot's job was schema validation + calibration evidence, not
	launch metrics. Launch metrics come from the 25-question K8s golden
	set with tuned threshold and (likely) a revised system prompt,
	sequenced after this pilot.

	## Evaluation-layer multi-corpus support lagged the serving-layer refactor

	The Tasks 1–8 multi-corpus refactor wired corpora through
	`app.state.corpus_map` and the `/ask` serving route. `scripts/evaluate.py`
	was not touched and remained single-corpus — it read
	`config.rag.store_path` and `config.evaluation.golden_dataset`
	directly, with no awareness of the `corpora` dict. This was an
	accurate scoping of the refactor (serving-layer, not eval-layer) but
	the gap was not surfaced in the original task list.

	The K8s pilot commit adds `--corpus <name>` to `scripts/evaluate.py`,
	routing through `config.corpora[name]` for `store_path`,
	`refusal_threshold`, and a new optional `golden_dataset` field on
	`CorpusConfig`. Without `--corpus`, the legacy single-store path is
	preserved for backward compatibility with `make evaluate-fast` and
	any existing invocations.

	`CorpusConfig.golden_dataset` is `str \| None = None` — optional
	rather than required — because two legitimate states exist: corpus
	has a golden dataset (FastAPI, K8s post-authoring), and corpus has no
	golden dataset yet (any corpus during bring-up). The CLI errors
	cleanly with "corpus '<name>' has no golden_dataset configured"
	when the field is None, rather than requiring all corpora to ship
	with datasets.

	## Deferred: path-preserving ingestion

	`scripts/ingest.py` uses `doc_path.glob("*.md")` (non-recursive) and
	stores the bare filename as the chunk's `source` field. This forces
	a flat-namespace convention: FastAPI ships as `fastapi_*.md`, K8s
	ships as `k8s_*.md`, and golden dataset `expected_sources` are
	filename stems. The path-preserving alternative (recursive `rglob`
	plus relative-path source IDs, e.g., `concepts/workloads/pods`) was
	evaluated during the K8s pilot planning and explicitly deferred. The
	root-cause refactor would have required FastAPI re-ingestion and a
	rewrite of the FastAPI golden dataset's `expected_sources` — trading
	certain regression risk on a green baseline (288 tests, citation
	accuracy 1.00 on API providers) for speculative legibility benefit
	on K8s authoring.

	The `source_pages` field on `GoldenQuestion` preserves the
	human-readable path anchor separately from the machine identifier,
	so the deferral does not lose information. Authors see both
	`expected_sources: ["k8s_pods.md"]` (what the evaluator matches on)
	and `source_pages: ["concepts/workloads/pods"]` (where the content
	came from on kubernetes.io) in the same question record.

	Pattern marker, not a promise. This is the second visa-timeline
	deferral of a root-cause refactor in favor of a minimal-blast-radius
	fix; the first was the Mar 25 → Apr 12 P@5 slide bisection. Both
	deferrals were deliberate, not forgetting. Not scheduled until
	post-launch; marker only. Post-launch scope: modify `ingest.py` to
	`rglob` + relative-path source IDs, re-ingest FastAPI, rewrite both
	golden datasets' `expected_sources` to path-style. Estimated 3h.

	## K8s refusal_threshold empirical calibration — 0.02 → 0.015

	Change. `configs/default.yaml`, `corpora.k8s.refusal_threshold`:
	`0.02` → `0.015`. Single-line config change, pilot-corpus only.
	FastAPI threshold unchanged.

	Empirical evidence. Diagnostic instrumentation of `k8s_pilot_005`
	(*"How do I configure a Kubernetes NetworkPolicy to enforce mutual
	TLS (mTLS) between Pods in the same namespace?"*) captured the
	retrieval gate firing at `max_score = 0.01639344262295082` — exactly
	`1 / (60 + 1)`, the algebraic floor for a single rank-1 BM25 hit
	under RRF with `rrf_k = 60`, dense contribution zero. At
	`refusal_threshold = 0.02`, pilot_005 tripped the gate and short-
	circuited before retrieval chunks reached the agent. At
	`refusal_threshold = 0.015` (one tick below the measured floor), the
	gate releases and retrieval proceeds. The 0.015 value is not a
	tuning guess — it is the nearest round-number floor below the
	observed gate-fire value for the single worst pilot in the set.

	Validation. `results/k8s_preedit.json` captures the full 6-pilot
	run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean
	`tool_calls_made` 1.167. All six questions receive retrieval; no
	gate-fire short-circuits. pilot_005 still refuses as a separate
	downstream issue (see next entry when the counterfactual-query fix
	lands); that is not a threshold problem.

	Scope of this commit. K8s only. FastAPI `refusal_threshold`
	(0.02) is not affected and FastAPI baseline is not re-measured.
	Launch-intent `0.30` placeholder for K8s remains as a comment
	marker; the full threshold sweep against the 25-question golden set
	replaces 0.015 with a properly-tuned value in a later commit. 0.015
	is the pilot-floor safety value, not the production-target value.

	Why this is a separate commit from the prompt revision. The
	threshold calibration is empirically grounded on its own — it
	removes the 0.01639 gate-fire blocker, which is the precondition for
	any downstream evaluation of pilot_005's actual agent behavior. The
	prompt revision addresses a different failure mode surfaced once
	the gate releases (agent search strategy is monotone positive-
	framing). Two independent changes must not entangle in one commit;
	if the prompt revision fails its regression gate and is reverted,
	the threshold calibration should stand on its own empirical merit.
	Feedback memory `feedback_fix_before_sweep.md` applies recursively:
	fix measurement-affecting bugs at every layer before combining
	fixes into single experiments.

	## Prep for counterfactual-query prompt regression — pin, wire, tolerances

	**Three sub-changes bundled as one prep commit, each small and in
	service of making the downstream regression measurement valid.**

	1. OpenAI model pin. `agent_bench/core/provider.py:208` changes
	`self.model = "gpt-4o-mini"` → `self.model = "gpt-4o-mini-2024-07-18"`.
	The unpinned alias is a known drift vector — the Mar 25 → Apr 12 P@5
	slide bisection is an already-open parallel track item traceable to
	silent alias migration. A regression run that uses the alias across
	pre-edit and post-edit phases conflates prompt-clause effect with
	model drift, even within a single session if the alias happens to
	roll between runs. Pinning the dated snapshot removes the variable.
	Pricing dict in `configs/default.yaml` gets a matching
	`gpt-4o-mini-2024-07-18` entry so the cost-lookup at
	`provider.py:209` still resolves. Tests that pin the model string
	live in mock response payloads (not outgoing assertions) and the
	langchain baseline (separate code path) — neither affected.

	2. FastAPI multi-corpus eval wiring. `configs/default.yaml`
	adds `corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json`.
	The production serving path at `routes.py:105-120 _resolve_system_prompt`
	already routes `/ask` and `/ask/stream` through `format_system_prompt(label)`
	from `core/prompts.py` — the `app.state.system_prompt` legacy fallback
	(serving/app.py:276) is effectively dead code given the shipped multi-corpus
	config. The only remaining caller of `task.system_prompt` is the
	`scripts/evaluate.py` legacy branch used by `make evaluate-fast`. Adding
	the missing `golden_dataset` field makes `--corpus fastapi` work so the
	regression gate can measure the actual production prompt path, not the
	legacy eval-scaffolding prompt. Purely additive; zero blast radius on
	serving (serving doesn't read `golden_dataset`).

	3. Pre-committed four-metric tolerances. Written down now, before
	the post-edit runs, so the pass/fail call on the counterfactual-query
	prompt clause is not a judgment under confirmation-bias pressure.
	Applied identically to FastAPI and K8s:

	\| Metric \| Pass criterion \|
	\|---\|---\|
	\| P@5 \| post-edit ≥ pre-edit − 0.02 \|
	\| R@5 \| post-edit ≥ pre-edit − 0.02 \|
	\| Citation accuracy \| post-edit ≥ pre-edit (hard gate — any drop blocks commit) \|
	\| Mean `tool_calls_made` \| post-edit ≤ pre-edit + 0.30 \|
	\| Individual question cap \| no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit \|

	pilot_005 strict flip criterion (K8s-only):
	- `keyword_hit_rate ≥ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
	- Answer cites `k8s_network_policies.md`
	- Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
	- Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")

	Baseline reference: K8s pre-edit numbers from `results/k8s_preedit.json`
	at commit `125dac0` — P@5 0.80, R@5 1.00, citation 1.00 (all 6),
	mean tool_calls 1.167. FastAPI pre-edit reference established by
	`results/fastapi_preedit.json` in the next step of this session,
	same pinned ID, same refusal threshold (0.02).

	Rationale for bundling. All three sub-changes answer "what must
	be true before the regression measurement is valid" — drift control,
	evaluation path, decision criteria. Splitting into three commits
	would add noise without adding signal. None of them change the
	prompt template itself; the prompt edit is the NEXT commit and is
	the sole experimental variable the regression measures.

	## Fix 1 (prompt-level counterfactual clause) attempted and reverted

	Outcome. K8s regression clean on every metric (P@5, R@5, KHR,
	citation, mean tool_calls all within tolerance or unchanged); K8s
	pilot_005 flipped from refusal to documented-negative-with-citation
	as designed (KHR 0.67 → 1.00, answer contains both "service mesh"
	and "ingress controller", cites `k8s_network_policies.md`).
	FastAPI regression failed on the iteration-inflation tolerance:
	mean `tool_calls_made` 1.111 → 1.556 (delta +0.444, gate +0.30),
	and two retrieval questions (q024, q025) were pushed from 1 pre-edit
	tool call to 3 post-edit tool calls (hitting `max_iterations=3`
	cap), violating the pre-committed "no new cap-hits from sub-cap
	baseline" criterion.

	Correctness metrics on FastAPI all held. Citation accuracy
	stayed at 1.000 / 1.000 across all 27 questions. P@5 delta −0.007,
	R@5 delta 0.000, KHR delta +0.006. The failure is purely process
	inflation, not output regression. q024 and q025 produce identical
	P@5/R@5/KHR/citation numbers pre and post despite the cap-hit — the
	orchestrator's "max iterations hit → one final complete() without
	tools" path happened to keep answers correct, but that is
	observation, not structural protection.

	Failure mode. The clause's trigger condition — *"your first
	search returned documentation about the subject of the question
	without addressing the specific capability or feature the user is
	asking about"* — relies on subjective LLM judgment about whether
	retrieved content "addresses" a capability. The judgment is fuzzy
	on compound multi-topic questions where the first search returns
	partial-topic coverage. q024 asks about "Docker + Gunicorn workers
	+ health checks + Pydantic Settings"; first search returns Docker
	content, LLM reads "documentation about the subject without
	addressing the specific capability," fires the follow-up with
	negative framing, gets nothing useful, does a third normal search
	to cover the remaining topics, hits the cap. Same pattern on q025.
	Over-firing on this class of question is an inherent fragility of
	prompt-level LLM-judged triggers; a wording refinement might
	narrow the misfire rate but cannot eliminate it as long as the
	judgment itself is fuzzy.

	q023 vs q024/q025 asymmetry is a useful signal for Fix 2. q023
	is a pre-existing 3-tool-call compound question ("custom error
	handling + CORS middleware + structured testing with dependency
	overrides"). Under the prompt clause, q023 was unchanged — the
	clause did not fire on it — while q024 and q025, structurally
	similar compound questions, were pushed into 3-tool-call cap-hit.
	The difference is not in question structure but in how the LLM
	interpreted the first-search return for each. That asymmetry is
	the precise reason a deterministic trigger is the right next step:
	any Fix 2 / Fix 3 candidate should be unit-testable against
	`(pilot_005, q023, q024, q025)` — the right fix must fire on
	pilot_005 and behave predictably on all three compound questions
	(either fire on all of them or none of them, but not pick them
	selectively by LLM whim).

	Gate discipline honored. The pre-committed FastAPI tolerances
	fired for exactly the reason the pre-commitment was designed:
	catching process-metric regressions before they ship. Tolerance-
	relaxation post-hoc would burn the session's strongest discipline
	artifact (pre-committed-tolerances + honored-gate) for marginal
	ship-this-approach EV. The narrow pilot_005 finding does not
	evaporate with the revert — chunk 63 (`d0806d5da91d6026`) is real,
	the negative-framing retrieval is reproducible, and Fix 2 will
	surface the documented negative the same way via a deterministic
	path.

	Fix 2 deferred to a later session. Deterministic query
	expansion at the `SearchTool` layer: when a `search_documents`
	call returns no chunk containing a direct answer string, issue a
	second internal search with negative-framing keywords and merge
	results before returning to the orchestrator. Offline-testable,
	corpus-agnostic, no LLM judgment required, no iteration-budget
	impact (the double-search happens inside a single tool call, not
	across iterations). Unit-testable against the
	`(pilot_005, q023, q024, q025)` asymmetry as an acceptance fixture.

	Evidence retained. Four result JSONs in `results/` document the
	regression measurement at the pinned `gpt-4o-mini-2024-07-18`
	snapshot in this session:
	- `fastapi_preedit.json` — 27 questions, HEAD prompt, 0.02 threshold
	- `fastapi_postedit.json` — 27 questions, clause prompt, 0.02 threshold (gate-failing run)
	- `k8s_preedit_pinned.json` — 6 pilots, HEAD prompt, 0.015 threshold
	- `k8s_postedit.json` — 6 pilots, clause prompt, 0.015 threshold (gate-passing run, pilot_005 strict flip confirmed)

	The previously-committed `results/k8s_preedit.json` (from `125dac0`)
	is also a valid K8s-pinned measurement at the session-equivalent
	snapshot and remains the canonical threshold-commit evidence.

	Held DECISIONS.md drafts stay held. The counterfactual-query
	finding draft (to be updated when Fix 2 lands) and the threshold-
	calibration entry already committed at `125dac0` are both correct
	in scope. The narrowed serving-migration deferral entry (tied to
	any external reference to the counterfactual-query fix) also stays
	deferred until Fix 2 lands, since the production/eval-harness
	prompt divergence is unchanged by this revert.

	## Fix 2 pre-committed regression gate — SearchTool deterministic query expansion

	Pre-committed BEFORE post-edit runs (same discipline pattern
	that caught Fix 1's iteration inflation cleanly).

	Mechanism under test. `agent_bench/tools/search.py`
	`SearchTool.execute` gains a deterministic two-query retrieval
	path. When the primary retrieval passes the refusal gate, a
	secondary retrieval is issued against an expanded query
	(`original_query + " not supported limitations cannot"`), and the
	final context returned to the LLM is `primary_top_3 ++
	secondary_top_5` deduplicated by `chunk.id`. Both retrievals run
	inside a single `SearchTool.execute` call — from the LLM's
	perspective, the tool schema, name, parameters, and return shape
	are unchanged, and the iteration budget is untouched.

	Why this is architecturally different from Fix 1. Fix 1 placed
	a behavioral clause in the system prompt that told the agent to
	issue follow-up searches itself. The trigger was an LLM judgment
	("did the first search return content addressing the specific
	capability?") and the follow-up was a separate tool call, so it
	counted against `max_iterations`. Over-firing on compound questions
	inflated iteration counts and pushed q024/q025 to the cap. Fix 2
	replaces this with a deterministic trigger (primary passes gate),
	a fixed expansion suffix, and a merge that happens entirely inside
	one tool call. No LLM judgment; no iteration change; corpus-
	agnostic.

	Suffix choice. `" not supported limitations cannot"`. Keyword-
	dense, ungrammatical on purpose — the suffix exists to shift BM25
	and embedding mass toward "what you cannot do" / "limitations"
	sections, not to read well. The ungrammatical form is also a self-
	documenting signal in retrieval logs: anyone reading a query trace
	sees the suffix and immediately knows it is a synthetic expansion,
	not user input. A one-line comment in `search.py` preserves the
	rationale for future readers.

	Merge choice. `primary_top_3 + secondary_top_5` deduped by
	`chunk.id`, producing 5–8 unique chunks per call. Rationale: top-5
	primary would make the expansion redundant on high-overlap queries
	(defeating the mechanism), while primary-top-3 guarantees the
	expansion always contributes to the final context window. Probe
	data (`/tmp/probe_fix2_v2.py`, throwaway) confirms this merge
	strategy surfaces pilot_005's target chunk
	(`d0806d5da91d6026`, chunk_index 63, "Anything TLS related ... use
	a service mesh or ingress controller for this") at position 6–8 in
	the merged list.

	Opt-in flag, defaulting ON. `SearchTool` accepts
	`negative_framing_expansion: bool = True`. Default is the shipping
	configuration because the regression gate must measure the shipping
	behavior, not the no-op path. A `False` default would mean the gate
	validates an unused parameter, and a subsequent commit flipping the
	default would have no regression evidence. Kill switch is preserved
	via explicit `False` at construction if a future regression
	requires an A/B comparison.

	Baseline reuse. The Fix 1 session's pre-edit JSONs
	(`results/fastapi_preedit.json`, `results/k8s_preedit_pinned.json`,
	both committed at `213da36`) were measured under the currently-
	committed state of the repo: pinned `gpt-4o-mini-2024-07-18`, K8s
	threshold 0.015, FastAPI threshold 0.02, HEAD `prompts.py` with no
	clause, HEAD `search.py` with no expansion. The working tree
	verification confirms this state is unchanged. These JSONs are
	therefore reused as the Fix 2 pre-edit baseline and do not need to
	be re-measured. Only post-edit runs are required for the Fix 2
	regression (~$0.02 saved).

	Pre-committed tolerances.

	\| Metric \| Pass criterion \|
	\|---\|---\|
	\| P@5 \| post-edit ≥ pre-edit − 0.02 \|
	\| R@5 \| post-edit ≥ pre-edit − 0.02 \|
	\| Citation accuracy \| post-edit ≥ pre-edit (hard gate — any drop blocks commit) \|
	\| Mean `tool_calls_made` \| post-edit ≤ pre-edit + 0.05 (design-correctness gate — see note) \|
	\| Individual cap-hit \| no question that used fewer than `max_iterations=3` iterations pre-edit may hit the cap post-edit \|

	Note on the tool_calls gate. ≤ +0.05 is a design-correctness
	gate, not a performance gate. Fix 2's invariant is that both
	retrievals happen inside one `SearchTool.execute` call, so the
	LLM's iteration count is unchanged by construction. Any non-trivial
	movement in `mean tool_calls_made` indicates the design invariant
	is broken — e.g., expansion accidentally exposed as a separate
	tool, or the LLM observing two-call behavior and adapting its
	strategy. The gate fires on design violation, not on performance
	regression. The 0.05 absolute threshold absorbs legitimate run-to-
	run variance from non-determinism in the LLM even at temperature
	0, without absorbing real iteration-count movement.

	**pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1
	gate):**
	- `keyword_hit_rate ≥ 0.60` against golden keywords `["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]`
	- Answer cites `k8s_network_policies.md`
	- Answer contains "service mesh" OR "ingress controller"
	- Answer does NOT begin with refusal phrasing

	Baseline reference for the gate.

	\| Corpus \| Pre-edit source \| P@5 \| R@5 \| Citation \| Mean tool_calls \|
	\|---\|---\|---\|---\|---\|---\|
	\| FastAPI (27) \| `results/fastapi_preedit.json` @ `213da36` \| 0.585 \| 0.679 \| 1.000 \| 1.111 \|
	\| K8s (6 pilots) \| `results/k8s_preedit_pinned.json` @ `213da36` \| 0.800 \| 1.000 \| 1.000 \| 1.167 \|

	Post-edit filenames (to be produced).
	- `results/fastapi_postedit_fix2.json`
	- `results/k8s_postedit_fix2.json`

	If the gate passes: commit Fix 2 with `search.py` change, unit
	tests (including the tool-spec snapshot test), the two post-edit
	result JSONs, and this DECISIONS.md entry extended with the
	regression outcome.

	If the gate fires: revert, document the failure mode, surface
	the specific criterion that fired. No tolerance relaxation — same
	discipline pattern as Fix 1 revert.

	## Fix 2 outcome — mechanism works, response-style criterion fired, reverted

	Regression runs produced. Two post-edit runs on K8s (FastAPI not
	run — K8s findings gated the decision before API spend on the
	broader set):

	\| Run \| Merge rule \| File \| Purpose \|
	\|---\|---\|---\|---\|
	\| Fix 2 v1 \| `primary[:3] + secondary[:5]` \| `results/k8s_postedit_fix2.json` \| Initial implementation \|
	\| Fix 2 v2 \| `primary[:5] + secondary[:5]` \| `results/k8s_postedit_fix2_merge_v2.json` \| Path A refinement after v1 failed P@5 on a metric-definition mismatch \|

	v1 findings. Aggregate: P@5 0.800 → 0.767 (Δ −0.033, FAILED
	the P@5 ≥ −0.02 tolerance). The failure traced to a merge-rule /
	metric-semantics interaction: `retrieval_precision_at_k` computes
	precision on `retrieved_sources[:5]`, and with `primary[:3] +
	secondary[:5]` the first 5 entries were `primary_top_3 +
	secondary_top_2`. For pilot_005, `secondary[1]` was
	`k8s_pods.md` (chunk_index 40, surfaced because the reranker
	matched its "localhost communication" content against the expanded
	query). That single off-source chunk in position 5 dropped P@5
	from 1.00 to 0.80 for pilot_005 and similarly for pilot_006.
	Iteration invariant held (tool_calls 1.167 → 1.167). Citation
	accuracy held (1.000 → 1.000). Target chunk
	(`d0806d5da91d6026`, "Anything TLS related") reached the LLM
	context for pilot_005 at merged position 7.

	Path A refinement (merge v2). Change `primary[:3] +
	secondary[:5]` → `primary[:5] + secondary[:5]`. Rationale:
	primary_top_5 is preserved in positions 1–5 by construction, so
	P@5 computed on `ranked_sources[:5]` is unchanged from the
	no-expansion baseline. Expansion chunks land in positions 6–10.
	Target chunk still reaches LLM context (position 9 for pilot_005).
	This is an **implementation refinement, not a tolerance
	relaxation** — the pre-committed gate thresholds stand; only the
	merge rule was adjusted to respect the metric's window semantics.

	v2 findings — perfect metric preservation, but strict-flip fails on response style.

	Aggregate:

	\| Metric \| Pre-edit \| Fix 2 v2 \| Delta \|
	\|---\|---\|---\|---\|
	\| P@5 \| 0.800 \| 0.800 \| 0.000 \|
	\| R@5 \| 1.000 \| 1.000 \| 0.000 \|
	\| KHR \| 0.806 \| 0.806 \| 0.000 \|
	\| Citation accuracy \| 1.000 \| 1.000 \| 0.000 \|
	\| Mean `tool_calls_made` \| 1.167 \| 1.167 \| 0.000 \|

	Every aggregate metric literally unchanged. Per-question
	deltas: zero on every metric, every question. The design
	invariant (iteration budget unchanged, tool schema unchanged,
	refusal gate behavior unchanged) holds perfectly.

	But pilot_005 strict flip fails on the refusal-phrasing criterion.
	Post-edit answer:

	> *"The Kubernetes documentation does not provide specific
	> instructions on configuring a NetworkPolicy to enforce mutual TLS
	> (mTLS) between Pods in the same namespace. For mTLS, it is
	> generally recommended to use a service mesh or other proxy
	> solutions, as NetworkPolicy alone does not handle TLS
	> configurations directly [source: k8s_network_policies.md]."*

	The answer substantively contains the documented negative with
	citation. But it opens with *"The Kubernetes documentation does
	not provide specific instructions..."* — the exact refusal-
	phrasing opener the strict-flip criterion was pre-committed to
	reject. The criterion exists because the brand is honest
	evaluation: an answer that opens apologizing that the
	documentation "does not provide specific instructions" reads, to
	a technical reviewer, like the system failed to find the answer
	and is papering over the gap, even though the facts and citation
	are present. The criterion fired as designed.

	Compare to Fix 1 post-edit answer (from `213da36` evidence):

	> *"Kubernetes NetworkPolicy does not support enforcing mutual TLS
	> (mTLS) directly. The documentation states that anything TLS
	> related should be handled using a service mesh or ingress
	> controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."*

	Fix 1's answer asserts a fact about NetworkPolicy ("does not
	support"); Fix 2's answer asserts a fact about the documentation
	("does not provide instructions"). The first forecloses the
	capability; the second leaves open whether the capability exists
	somewhere the system didn't see. That distinction is load-bearing
	for any grounded-refusal narrative, and it separates a system that
	handles documented negatives crisply from one that hedges around
	them.

	Diagnosis. Fix 2's mechanism successfully gets the target chunk
	into the LLM's context window — the retrieval side of the problem
	is solved. What Fix 2 cannot provide is explicit guidance on
	how to phrase the documented negative once the chunk is present.
	Fix 1's prompt clause was doing that guidance work; removing the
	clause and relying on the LLM's unaided response style produces a
	hedging answer because the LLM, seeing both NetworkPolicy-spec
	content and a TLS limitation bullet, defaults to contextual
	hedging rather than crisp assertion.

	**Fix 2 is therefore not an alternative to Fix 1's prompt clause
	— it is a prerequisite.** Fix 2 guarantees the chunk reaches
	context; a future "Fix 2 + targeted prompt clause" stack could
	resolve both the retrieval gap and the response-style gap without
	Fix 1's over-firing problem, because the clause would no longer
	need to direct the agent to do a follow-up search (Fix 2 handled
	that). The over-firing on compound questions that broke Fix 1 was
	caused by the agent deciding to do extra search iterations under
	LLM judgment; if the expansion already happened deterministically
	inside the first tool call, the clause has less work to do and
	may not trigger the second-LLM-call pattern at all. **Speculative
	and not for this session.** Future work item.

	Gate verdict: failed on pilot_005 strict flip criterion.
	Reverting, same Fix-1 pattern.

	What this commit contains.
	- `agent_bench/tools/search.py` reverted to HEAD (no Fix 2
	code changes)
	- `tests/test_tools.py` retains the `MockChunk.id` hygiene fix
	(the real `Chunk` class has `id`; mock should match the real API
	for future test authors)
	- `tests/test_tools.py` adds `TestSearchToolSpecSnapshot`: a
	general-purpose guard that freezes `SearchTool`'s LLM-facing
	contract (name, description, parameters). The lesson from Fix 2
	is that any future refactor exposing internal SearchTool state
	to the LLM would break iteration-budget invariants — the
	snapshot test catches that at test time, independent of whether
	Fix 2 lands.
	- Two regression evidence JSONs: `results/k8s_postedit_fix2.json`
	(v1, the P@5 failure) and `results/k8s_postedit_fix2_merge_v2.json`
	(v2, the strict-flip failure). Retained as the measurement
	trail behind the revert decision.
	- This DECISIONS.md entry (pre-committed gate + outcome + revert
	narrative).

	What this commit does NOT contain. No changes to
	`agent_bench/tools/search.py`, `agent_bench/core/prompts.py`, or
	`configs/default.yaml`. Both Fix 1 (prompt clause) and Fix 2
	(SearchTool expansion) have been attempted and reverted this
	session. Three commits of progress nonetheless: `125dac0`
	(threshold calibration, empirical), `5c1f49f` (prep bundle: model
	pin + fastapi wire + Fix 1 pre-committed tolerances), `213da36`
	(Fix 1 revert narrative). The threshold calibration and model pin
	are real, shipped, measurement-grounded infrastructure changes.
	The two fix attempts are documented learning that shapes the
	future direction.

	## `grounded_refusal` metric reads answer text, not retrieved sources — 2026-04-14

	Context. Week 1 step 5 authoring (25-question K8s golden set). Two
	flavor-A out-of-scope questions (`k8s_004` Jaeger sidecar, `k8s_024`
	Envoy xDS ADS) surfaced a pre-existing bug in the
	`grounded_refusal` metric during the functional check.

	Bug 1 — wrong signal. The metric's docstring said it checks
	whether the answer correctly refuses AND cites no sources, but the
	implementation was checking `len(response_sources) == 0` where
	`response_sources` is the retrieved-sources list. Real agents
	retrieve candidates on any non-trivial OOS query (the grounded-refusal
	gate at tool level only catches the thinnest queries), inspect the
	candidates, find nothing relevant, and refuse in the answer text
	without citing anything. Checking retrieval emptiness flagged those
	correct refusals as failures. Fix: inspect the answer text for
	`[source: X.md]` citations via regex; drop the `response_sources`
	parameter from the signature entirely.

	This was a silent false negative on all 5 fastapi out-of-scope
	questions (`q008`–`q010`, `q026`–`q027`) which all correctly refuse
	but were being marked `grounded_refusal=False`. Aggregate
	`refusal_rate` in `report.py` shifts by the resulting 5-question
	delta; any historical comparison to pre-fix fastapi numbers needs
	to acknowledge this.

	Bug 2 — metric coverage gap surfaced during 25-question authoring.
	`grounded_refusal_rate` recognized "does not contain information"
	phrasing (in `refusal_phrases` list) but missed "not in the
	{corpus_label} documentation" phrasing — the exact shape taught by
	the system prompt at `core/prompts.py:17-18`. The LLM produced the
	canonical form on some questions and the phrase-list form on others;
	the metric inflation/deflation was non-deterministic. Fix: narrow
	regex `\bnot in the\b[^.]{0,60}\bdocumentation\b` added alongside
	phrase-list matching.

	Rejected alternative. Substring `"not in the"` would produce
	false positives on valid-answer phrasing — "the rate limit is not in
	the same scope as the request timeout", "the flag is not in the 1.28
	release; it landed in 1.29", "this value is not in the default
	range" — all of which are legitimate retrieval answers with
	conditional or scope-limiting language, not refusals. Honest
	evaluation cannot afford a metric that silently counts these as
	grounded refusals.

	Tests. Two unit tests pin both directions:
	`test_canonical_refusal_phrasing_recognized` covers the positive
	case ("The answer is not in the Kubernetes documentation"), and
	`test_not_in_the_is_not_substring_refusal` covers the negative case
	("The rate limit is not in the same scope as the request timeout").
	The negative test is the load-bearing one — without it, a future
	refactor could silently widen the matcher back to substring and pass
	all existing tests. The negative test pins design intent.

	Scope bound. This is a metric correctness fix, not a threshold
	change. The 0.015 refusal-gate threshold (calibrated in `125dac0`
	against the 6-question pilot) is unchanged by this commit. Whether
	the corrected metric shifts the optimal threshold against the full
	25-question set is a question for the threshold-sweep session, not
	this authoring session.

	## Parallel tracks / deferred items — 2026-04-14

	Tracked list of work items that are deferred to parallel sessions.
	Each item has a reason for deferral and a rough scope boundary so
	the session that picks it up has the context to pre-commit tolerances
	and decision criteria before measuring.

	1. `routes.py:552` audit-logger semantics unification. The
	serving layer's audit record field still uses the pre-fix
	`grounded_refusal = not bool(sources)` expression, which disagrees
	with the evaluation metric's answer-text-based definition. Not
	surfaced to the dashboard (audit log only), but external reviewers
	who reference audit records for runtime verification would see a
	different definition than the benchmark claims. Fix: call
	`grounded_refusal(answer, category)` from `metrics.py` directly.
	When this lands, the "grounded_refusal metric" DECISIONS.md entry
	above should get a one-line addendum noting the unification.

	2. **Full 25Q threshold sweep → production-target `refusal_threshold`
	for K8s.** The 25Q set exists, the metric is correct. Sweep
	against the full set, compare to pilot-floor 0.015, pick the
	production-target value, update `configs/default.yaml` placeholder
	comment. Pre-commit before measuring: sweep range, decision
	criteria, tolerances. Do not entangle with flavor-B response-style
	work below — those are independent axes.

	3. Flavor-B response-style class (pilot_005 + k8s_022). Two
	independent reproductions of "LLM refuses when documented negative
	is in retrieved context". Retrieval is healthy on both; the gap
	is prompting. Future session: Fix 2 (counterfactual-query
	expansion in `SearchTool`) + targeted prompt clause stacked —
	previously speculative in the Fix 2 revert entry, now addresses
	a documented reproducible class. Two reproductions, not one-off.

	4. Serving-migration deferral. Tied to external references to
	the counterfactual-query fix. Unchanged from prior sessions.

	5. `agent-bench` → `refusal-bench` rename — CLOSED 2026-04-14.
	Decision: keep `agent-bench`, reframe via tagline. The original
	concern was name collision with AgentBench (Liu et al., ICLR
	2024, ~1000 citations). Due-diligence at launch time: the name
	is `agent-bench` (hyphenated) vs. `AgentBench` (camelcase),
	which are distinct identifiers across GitHub, arXiv, and PyPI.
	The two projects target different audiences (LLM-as-agent
	capability vs. RAG+refusal benchmark) and any reviewer reaching
	the repo via LinkedIn or CV sees the scope in the README within
	seconds. Rename cost is substantial (~350 internal references
	across ~60 files, two external account renames, one HF Space
	URL break with no redirect) for a naming-precision benefit that
	isn't supported by the actual scope — the benchmark measures
	retrieval, grounding, multi-hop, citation accuracy, and refusal
	as seven axes, not refusal alone. Tagline reframe captures the
	honest-evaluation positioning without the rename cost:
	> "A RAG benchmark built from primitives, with honest
	> evaluation of retrieval, refusal, and grounded citation."
	HF Space rename (`Nomearod/agentbench` → `Nomearod/agent-bench`
	for GitHub-name consistency) is a separate, smaller follow-up
	deferred approximately one week. Reason: several job
	applications submitted the preceding week reference the current
	HF URL (`nomearod-agentbench.hf.space`); renaming the Space now
	would break those inbound links with no HF-side redirect. The
	rename absorbs cleanly once the application wave lands and the
	reference window expires. Until then the README, dashboard, and
	DECISIONS.md continue to reference the current `agentbench` URL;
	launch-adjacent work (Post #1, screenshots, cold-start measure)
	uses the current URL and will be updated in a single small
	follow-up commit when the rename happens.

	6. OpenAI snapshot drift bisection. Mar 25 → Apr 12 P@5 slide;
	the model pin at `5c1f49f` (`gpt-4o-mini-2024-07-18`) removed
	the ongoing drift risk, so any future measurement is apples-to-
	apples. The original bisection is still unresolved but cheap at
	this point — tractable whenever there is session capacity, low
	urgency because the pin protects forward runs.

	7. Fix 2 revert commit SHA missing from the Fix 2 outcome entry.
	The "Fix 2 outcome — mechanism works, response-style criterion
	fired, reverted" DECISIONS.md entry describes the revert
	narratively but does not cite the revert commit's SHA
	(post-rewrite: `27c2e17` — `docs(eval): Fix 2 SearchTool query
	expansion — attempted and reverted`). Add retroactive SHA
	reference in the next docs pass. Not urgent; noted so the
	narrative-without-SHA pattern does not spread to other entries.
	Lesson going forward: prefer explicit SHAs over positional
	references like "this commit" / "commit above" in DECISIONS.md
	entries — positional references do not survive history rewrites
	as robustly as SHA references do.

	## K8s refusal_threshold sweep against 25-question golden — 2026-04-14

	Override notice. This sweep ran in the same session as the
	25-question authoring + grounded_refusal metric fix (`4454894`),
	after I explicitly flagged that the parallel-tracks guidance from
	earlier in the session recommended waiting for a fresh session with
	pre-commitment discipline. The user issued an explicit override:
	"proceed on best-judgment sweep range and criteria" — logged here
	for audit trail. The pre-commitment frame below was drafted BEFORE
	running any sweep value, not after. The decision criteria were
	locked before the first data point was observed, not retrofitted.

	Sweep grid. 4 threshold values: `0.010`, `0.015` (already
	measured in `.cache/eval_k8s_full25_postfix.json`, the post-metric-
	fix run from `4454894`), `0.020`, `0.025`.
	- `0.010`: one tick below current calibration; sanity-check floor.
	- `0.015`: current calibration (pilot-floor, one tick below
	pilot_005's 0.01639 max_score).
	- `0.020`: matches legacy FastAPI threshold and the original
	provisional K8s default before the `125dac0` calibration.
	- `0.025`: one tick above legacy; exploration of whether aggressive
	OOS short-circuiting is worth the correctness risk.

	Decision criteria (pre-committed).
	1. OOS refusal must hold. Both `k8s_004` (Jaeger) and `k8s_024`
	(Envoy xDS) must retain `grounded_refusal=True` at the chosen
	threshold — whether the gate fires at the tool level or the
	LLM refuses after inspecting context doesn't matter, only that
	the metric reports True.
	2. Retrieval recall must not degrade. Each retrieval-category
	question's R@5 at the chosen threshold must be ≥ its R@5 at
	`0.015` (the post-fix-25Q baseline) with a noise tolerance of at
	most ONE question dropping by at most 0.20. Two or more drops,
	or any drop > 0.20, disqualifies the value.
	3. Citation accuracy must hold. All questions' citation_accuracy
	must be ≥ 0.95 at the chosen threshold. One question at 0.80 is
	noise-tolerated; two or more is a hard stop.
	4. k8s_022 (flavor-B) retrieval must remain at R@5=1.0. The
	gap is prompting-side, not retrieval-side; any threshold that
	breaks the already-working retrieval on flavor-B questions is
	a regression.
	5. Pick the highest threshold that satisfies 1–4. Rationale:
	a higher threshold short-circuits more OOS queries at the tool
	level, saving a retrieval round trip and an LLM call — this is
	a real latency and token-cost win when the correctness is held.
	6. Tie-break. If multiple values all satisfy 1–4, prefer the
	value closest to a clean round number (0.020 over 0.018) for
	documentation clarity.
	7. Floor. If no threshold > 0.015 satisfies 1–4, keep 0.015.
	No threshold < 0.015 will be chosen regardless — sub-0.015 is
	strictly less protective than the pilot-floor.

	Scope bound. K8s only; FastAPI's `refusal_threshold: 0.02` is
	unchanged. The flavor-B response-style gap (parallel track #3) is
	NOT a sweep variable — changing the threshold does not fix LLM
	phrasing; that's the Fix 2 + prompt guidance stacked experiment
	the parallel-tracks list already defers.

	Measured results. All four runs use the post-metric-fix pipeline
	(grounded_refusal metric from `4454894`), deterministic mode,
	`gpt-4o-mini-2024-07-18`, same retriever config.

	\| threshold \| avg R@5 \| OOS refusal \| gate fired on \| broken retrieval \|
	\|-----------\|---------\|-------------\|-----------------------------------\|------------------------\|
	\| 0.010 \| 0.957 \| 2/2 \| — \| — \|
	\| 0.015 \| 0.957 \| 2/2 \| — \| — \|
	\| 0.020 \| 0.870 \| 2/2 \| k8s_006, k8s_007, k8s_024 \| k8s_006, k8s_007 (R@5=0.00) \|
	\| 0.025 \| 0.913 \| 2/2 \| k8s_004, k8s_007, k8s_024 \| k8s_007 (R@5=0.00) \|

	Structural finding: LLM query variance makes max_scores non-deterministic.
	At 0.020, `k8s_006` (ConfigMap, simple) gate-fired → empty retrieval →
	R@5=0.00. At 0.025, `k8s_006` did NOT gate-fire → 5 sources → R@5=1.00.
	A higher threshold producing fewer gate-fires is physically impossible
	if retrieval is deterministic — the SearchTool receives different
	queries across runs because the orchestrator issues LLM-generated
	queries, and the same question can produce different top-k max_scores
	run-to-run. `k8s_006`'s max_score for the query the LLM chose lives
	somewhere around the 0.018–0.025 boundary; which side of any given
	threshold it lands on depends on which query the LLM wrote.

	This means any threshold above 0.015 is structurally fragile, not
	merely "failed on this run." Even if a run at 0.018 passed, a future
	run could gate-fire on `k8s_006` or `k8s_007` because the query is
	non-reproducible. The production threshold needs to sit below all
	legitimate simple-question max_scores with enough margin to absorb
	LLM query variance.

	Decision: keep `refusal_threshold: 0.015`.

	- `0.010`: meets all criteria, identical measured metrics to `0.015`
	(avg R@5=0.957, OOS refusal 2/2, no citation fails). Not chosen:
	lowering strictly weakens the gate's ability to catch low-
	confidence retrievals without improving any measured metric.
	- `0.015`: chosen. Meets all criteria and is the highest value that
	does not degrade retrieval — which is the definition of the
	correct refusal-gate threshold. Preserving the gate's signal is
	the gate's purpose; `0.015` gives maximum gate strength without
	cost, `0.010` gives the same measurable behavior with less gate
	signal, so `0.015` dominates.
	- `0.020`: breaks TWO retrieval questions (`k8s_006`, `k8s_007`);
	disqualified per criterion 2.
	- `0.025`: breaks ONE retrieval question in this run (`k8s_007`)
	but the non-determinism finding means a future run could break
	more. Even ignoring non-determinism, still disqualified by the
	citation-accuracy-equivalent drop on `k8s_007`.

	Corpus characteristic finding. The 0.020 default inherited from
	FastAPI breaks on K8s because K8s retrieval score distributions are
	lower for "easy" questions. `k8s_006` ("What is a ConfigMap?") and
	`k8s_007` ("What does a Kubernetes Job do?") are both `type: simple`
	with clean single-source expected answers — exactly the cases where
	BM25+embedding scores should be highest. They land at max_scores in
	the ~0.018 range, below the FastAPI-calibrated 0.020 default. This
	is not an authoring bug — both questions retrieve their
	`expected_sources` correctly when the gate doesn't fire. It's a
	corpus characteristic: K8s documentation has more topic-overlap
	across pages than FastAPI, diluting top-k concentration.

	The 25-question set exposed this because the 6-question pilot had
	no simple questions with low max_scores — the pilot was drawn from
	retrieval-stressful areas (comparison, multi-hop, flavor-B). The
	25-question authoring deliberately added simple questions to hit
	the CRAG distribution target (6 simple, 5–6 target), and those
	simple questions revealed the corpus-characteristic floor.

	Config change. `configs/default.yaml` `corpora.k8s.refusal_threshold`
	comment updated to reference this sweep. Value unchanged at `0.015`.

	Not in scope. (a) Adding retry-with-query-variance to the
	SearchTool to reduce max_score variance — separate session, affects
	other corpora. (b) Tuning FastAPI's threshold against its golden
	set — the FastAPI default was empirically fine on its own 30Q set
	and is not a documented regression. (c) Fixing the `k8s_015`
	R@5=0.50 value observed across all threshold runs — pre-existing
	authoring state from `4454894`, tracked separately if it becomes
	a concern on future runs.

	Narrative summary. Session hypothesis: pilot_005 is a
	counterfactual-query-expansion problem. Session evidence: the
	hypothesis is correct on retrieval — the target chunk is reachable
	via negative-framing queries and Fix 2 surfaces it deterministically
	with zero iteration-budget impact. Session evidence also shows the
	hypothesis is incomplete — retrieval-only fixes cannot close
	the response-style gap, because the LLM under unaided prompting
	hedges when a documented negative is surrounded by unrelated
	topical content. A future session exploring **Fix 2 + targeted
	prompt guidance stacked** is the natural next experiment; this
	session's pilot-first discipline has been preserved against two
	distinct pre-committed gates, both firing for the reasons they
	were designed to catch.

	## Credential-exposure incident and history rewrite — 2026-04-14/15

	Summary. During Week 1 work on the
	`feat/user-friendly-landing-page-live-dashboard` branch, an
	`instruction.txt` file containing plaintext OpenAI and Anthropic
	API keys was accidentally committed at pre-rewrite SHA `2b3150f`
	(`style: fix ruff lint — import sorting, line length`) and removed
	from the working tree in a later commit (pre-rewrite SHA `3a2c5ef`,
	`security: remove instruction.txt containing plaintext credentials`).
	The removal did not clean git history — the keys remained accessible
	via `git show 2b3150f:instruction.txt` in local history.

	Discovery. The issue was discovered when GitHub push protection
	rejected the first push of the branch to the `origin` remote,
	flagging the credentials via its secret-scanning system. The branch
	had never been pushed to any public remote prior to the rewrite;
	the detection fired on the very first push attempt, which is the
	correct moment for secret-scanning to act. Honest credit to the
	tooling: GitHub's push protection did exactly what it was designed
	to do, and the alternative failure mode (silent push of real
	credentials to a public repo) did not occur.

	Immediate actions, in order.

	1. Key rotation. Rotated both OpenAI and Anthropic keys at the
	respective provider dashboards, revoking the exposed values
	immediately. Rotation was confirmed before any git operation
	ran — the reasoning was that the keys were exposed on the local
	disk regardless of whether they ever made it to a public remote,
	so the exposure window needed to be closed first.

	2. Unauthorized-use check. Verified billing/usage dashboards on
	both OpenAI and Anthropic for the exposure window (from commit
	`2b3150f` landing until rotation). No unauthorized activity
	observed on either account.

	3. Local `.env` update and smoke test. Updated local `.env`
	with the new keys. Verified both worked via minimal API calls
	that return only HTTP status codes (never the key values
	themselves): `GET /v1/models` for OpenAI (200), `POST /v1/messages`
	with a 1-token request for Anthropic (200). Total verification
	cost: <$0.0001.

	4. Repository backup. Before running any history-rewriting
	command, backed up the entire repository via `rsync -a` to
	`/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>`,
	excluding only `.mypy_cache` and `.cache` (both derivative,
	regenerable, and explicitly `.gitignore`'d). The backup preserved
	`.git/`, all four worktree state files under `.git/worktrees/`,
	the `.worktrees/` checkouts themselves, and all tracked source
	files. The backup is the safety net if the rewrite had gone
	wrong in any way; this session never needed to consult it.

	5. History rewrite via `git filter-repo`. Ran
	`git filter-repo --path instruction.txt --invert-paths --force`
	on the main clone. The `--force` flag was required because
	filter-repo's default safety check refuses to run on non-fresh
	clones; the backup step above mitigates the risk that this flag
	is usually guarding against. 186 commits were parsed and
	rewritten in ~2.4 seconds; filter-repo's internal repacking
	completed in an additional ~5 seconds. The `origin` and `hf`
	remotes were automatically unset by filter-repo as its standard
	safety behavior (and restored from a saved file before the push).

	6. Dropped empty commit. Pre-rewrite commit `3a2c5ef` (which
	removed `instruction.txt` from the working tree but did not
	clean history) became empty after filter-repo stripped the file
	from all prior commits and was dropped automatically. This is
	correct filter-repo behavior: the commit's only net effect was
	to remove a file that no longer exists in any predecessor, so
	post-rewrite it has no content change and is elided from the
	linear history. The total commit count went from 186 → 185.
	Pre-rewrite SHA `3a2c5ef` maps to `00000...00000` in
	`.git/filter-repo/commit-map`, indicating the drop. The dropped
	SHA was not referenced anywhere in DECISIONS.md, so the drop
	had zero audit-trail impact.

	7. Multi-layer verification sweep. Ran six checks across every
	location where the credentials could still be present:
	(a) `git log --all --full-history -- instruction.txt` returned
	empty; (b) `git rev-list --all --objects \| grep instruction.txt`
	returned 0 matches; (c) `git reflog --all` was empty after
	`git reflog expire --expire=now --all`; (d) `git fsck
	--unreachable` returned clean; (e) `git stash list` was empty;
	(f) a precise key-value regex scan across all blobs in the
	rewritten object database (`sk-[A-Za-z0-9]{30,}`,
	`sk-ant-[A-Za-z0-9]{20,}`, and env-var-assignment patterns)
	found 23 matches, all verified to be non-secret content
	— specifically: 15 historical README.md blobs containing the
	documentation placeholder `ANTHROPIC_API_KEY=sk-ant-...`
	(with three literal dots), 7 historical `docs/provider_comparison.md`
	blobs with the same documentation placeholder pattern, and 1
	`tests/test_output_validator.py` blob containing test fixtures
	that intentionally use mock key-shaped strings to verify the
	output-validator's secret-redaction logic. The precise scan is
	a meaningful check: it demonstrates that the exposure was
	isolated to `instruction.txt` and did not spread via copy-paste
	of the key values into other files before removal.

	8. Worktree walk. All four worktrees (`feat-infra-sprint`,
	`feature-grounded-refusal`, `langchain-baseline`,
	`security-hardening`) were checked for `instruction.txt` history
	pollution and for uncommitted changes. All four were clean —
	no pollution in any branch's history (filter-repo operates on
	all refs in a shared `.git/`, so the worktrees were reached
	through the main clone's object database) and no local dirty
	state in any working tree. No worktree deletion or recreation
	was needed.

	9. DECISIONS.md SHA remap. The filter-repo operation rewrote
	every commit's SHA downstream of the first rewritten commit.
	This broke every explicit SHA reference in DECISIONS.md because
	those references pointed to pre-rewrite SHAs that no longer
	exist. The remap used `.git/filter-repo/commit-map` as the
	authoritative SHA-based mapping (not message-based pairing,
	which would have been vulnerable to duplicate-message
	ambiguity — 2 pairs of commits in the pre-rewrite history did
	in fact have identical messages, though neither was in the
	substitution set). Four unique old SHAs were remapped across
	18 substitution sites:

	\| OLD (pre-rewrite) \| NEW (post-rewrite) \| Commit role \|
	\|---\|---\|---\|
	\| `bd2b913` \| `213da36` \| Fix 1 counterfactual prompt clause revert \|
	\| `b97f00f` \| `125dac0` \| K8s refusal_threshold 0.02 → 0.015 calibration \|
	\| `77017db` \| `5c1f49f` \| pin gpt-4o-mini snapshot + wire fastapi golden \|
	\| `526be18` \| `4454894` \| Week 1 step 5 — 25Q golden + grounded_refusal fix \|

	Every message matched exactly across the old→new pairing; no
	new SHA prefix collides with any old SHA prefix; post-remap
	grep confirmed zero remaining references to any old SHA.

	Exposure scope assessment. The branch had never been pushed
	to any public remote prior to the rewrite. The credentials existed
	in:
	- Local git history at `/Users/zenith/Desktop/agent-bench/.git/` (cleaned)
	- Four worktree clones sharing the same `.git/` (cleaned via the main repo)
	- The rsync backup at
	`/Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>`
	(to be deleted after this commit and test suite confirm the
	rewrite is correct)

	No external exposure via GitHub, HF Spaces, or any other shared
	system occurred. No cached CI artifacts contain the keys because
	CI only runs on pushed branches and this branch was never pushed.
	No forks or clones exist outside the local machine. GitHub's
	push-protection detection itself touched the key strings during
	the rejected push attempt, but GitHub's secret scanning is trusted
	infrastructure and the rejection is the good outcome, not an
	additional exposure event.

	Why this entry exists. Credential hygiene failures are worth
	documenting, not hiding. A reviewer who reads this entry sees a
	developer who: made a mistake, caught it via automated tooling
	working as designed, rotated keys before touching git, rewrote
	history surgically with a backup as the safety net, verified the
	rewrite across six independent checks, and preserved audit-trail
	integrity through the SHA remap. The honest-evaluation brand
	extends to credential-handling incidents — the alternative of
	pretending this didn't happen, or silently unblocking the secret-
	scanning rejection to push exposed values to a public repo, would
	be a strictly worse outcome for both security posture and brand
	credibility.

	Procedural lessons for DECISIONS.md going forward. Prefer
	explicit commit SHAs over positional references like "this commit"
	or "commit above" — positional references do not survive history
	rewrites as robustly as explicit SHAs do. The "Fix 2 outcome"
	entry above was identified during this incident as missing an
	explicit SHA reference to the Fix 2 revert commit (post-rewrite
	SHA `27c2e17`); this is tracked as parallel-tracks item #7 for a
	retroactive fix in the next docs pass.

	### Round 2 — Google API key format in a test fixture

	After the round-1 rewrite was complete and the feature branch had
	been pushed to `origin` for the first time, GitHub secret scanning
	raised a second alert (alert #1, `secret_type: google_api_key`)
	against `tests/test_output_validator.py` line 152 at pre-round-2
	commit `8ebe3964af7d` (`security: fail-closed on secret extraction
	and env var leakage`). The alert was on a test fixture inside a
	`@pytest.mark.parametrize` list, structurally consistent with the
	other fake fixtures in the same list (OpenAI `sk-test123`,
	Anthropic `sk-ant-xyz`, AWS `AKIAIOSFODNN7EXAMPLE`). The Google
	fixture, however, was 35 chars after the `AIza` prefix and matched
	both GitHub's detection pattern and the output validator's own
	detection regex exactly.

	Disambiguation. Asked whether the string was a hand-typed fake
	or a real-leaked Google API key, the developer confirmed: (1) yes,
	a Google API key had been created at some point in a GCP or
	Google AI Studio context unrelated to this project, and (2) no,
	the string on line 152 was not recognizably hand-typed. Combined
	with the structural inconsistency against the other clearly-fake
	fixtures in the same parametrize list, the safe interpretation
	was to treat it as potentially real and rotate + rewrite rather
	than dismiss as false positive.

	Actions, in order.

	1. Google API key rotation. All Google API keys on the
	developer's GCP and Google AI Studio accounts rotated at the
	provider dashboards, regardless of which specific key matched
	line 152, because the specific match was not known with
	certainty. Rotation confirmed before any git operation.

	2. Billing/activity check. Verified Google Cloud billing and
	API activity on every project for the window since commit
	`8ebe3964af7d` landed (2026-04-12 18:18). No unauthorized
	activity observed.

	3. Why the validator regex and GitHub's detector are identical.
	The output validator's regex at `agent_bench/security/output_validator.py`
	line 23 is `\bAIza[0-9A-Za-z_\-]{35}\b` — byte-for-byte identical
	to GitHub's secret-scanning Google API Key detection pattern.
	This means there is no static test fixture that satisfies the
	validator's test assertion (the validator must block the input)
	without also triggering GitHub's push protection. Any replacement
	with a fixture that matches the validator's regex is immediately
	re-flagged; any replacement with a fixture that does not match
	the validator's regex breaks the test assertion. The cleanest
	resolution is to remove the Google fixture from the static
	parametrize list entirely and restore Google API key format
	coverage via a runtime-generated fixture that constructs a
	35-char `AIza`-prefixed string at test time and never lands as
	a literal in source code. Tracked as a parallel-tracks item.
	The output validator's regex is NOT weakened; the test loses
	one of seven parametrize cases but continues to verify OpenAI,
	Anthropic, AWS, JWT, and env-var-assignment detection.

	4. Round-2 filter-repo. Ran
	`git filter-repo --replace-text <file> --force` with the pattern
	file containing `regex:AIza[A-Za-z0-9_\-]{35}==>AIzaFIXTUREREDACTED`.
	This replaced the Google API key format anywhere it appeared
	in any historical blob across the entire repository. Every
	commit from `8ebe3964af7d` forward was rewritten, which
	cascaded through the full post-round-1 history including all
	round-1-remapped SHAs and tonight's 5 commits. Total commits
	processed: 186. filter-repo's internal commit-map wrote 152
	changed entries and 35 unchanged entries (commits before
	`8ebe3964af7d` that never touched the pattern).

	5. Working-tree fixture removal. After the filter-repo rewrite,
	`tests/test_output_validator.py` line 152 read
	`"google says AIzaFIXTUREREDACTED"` (15 chars after `AIza`,
	below the validator's 35-char regex threshold). Removed the
	line entirely from the parametrize list and added a block
	comment explaining the removal, the regex-collision reason,
	the parallel-tracks item to restore via runtime-generated
	fixture, and an explicit note that the validator's regex
	remains unchanged. Committed as a separate new commit on top
	of the rewritten history.

	6. Round-2 verification sweep. Re-ran the same six-check
	sweep: `git log`, `git rev-list --all --objects`, reflog,
	fsck, stash, and a precise regex scan across all blobs for
	the `\bAIza[0-9A-Za-z_\-]{35}\b` pattern. Zero blobs in
	the post-round-2 object database contain a 35-char `AIza`
	pattern. The scrub is complete across all history.

	7. Round-2 DECISIONS.md SHA remap. The round-1 remap table
	above uses SHAs `213da36`, `125dac0`, `5c1f49f`, `4454894`
	as the "NEW (post-rewrite)" column. These are the
	post-round-2 SHAs; they were `e6d9675`, `c1d8163`,
	`740c9d5`, `6d177ba` after round 1 and got rewritten again by
	round 2. To avoid a three-column mapping table showing
	intermediate round-1 SHAs, the table above reads as a direct
	pre-rewrite → current-state mapping. The round-1-only
	intermediate SHAs are preserved in this narrative as
	"round-1 SHAs" for audit completeness but are not the
	canonical SHAs anyone looking up a commit should use. The
	canonical SHAs are the post-round-2 values.

	Additional round-2 SHA update: parallel-tracks item #7
	(Fix 2 revert commit SHA missing from the Fix 2 outcome entry)
	was updated from `8c836f5` (post-round-1) to `27c2e17`
	(post-round-2).

	Exposure scope, round 2. The branch had been pushed to origin
	exactly once before round-2 was discovered (the first push at the
	end of round 1, which landed commit `3167b59` at origin). The
	feature branch was the only affected ref — `main` was not updated,
	and no PR had been merged. The round-2 cleanup requires a
	force-push with `--force-with-lease` to overwrite the pushed
	round-1 history with the round-2 history. Force-push is normally a
	discipline concern, but here it is safe: the branch was published
	less than one hour before round-2 was discovered, no other work
	was based on the pushed round-1 history, and the force-push is
	scoped to this specific branch (not `main` or any long-lived ref).

	Alert dismissal. GitHub alert #1 was dismissed as
	`false_positive` via `gh api` after the force-push, with the
	resolution comment noting that the pre-round-2 commit SHA the
	alert referenced (`8ebe3964af7d`) no longer exists in the
	rewritten history and the test fixture has been removed from
	`tests/test_output_validator.py` pending a runtime-generated
	replacement.

	Round-2 procedural lesson. The validator-regex ↔ detector-regex
	identity is a structural finding worth noting for future security
	test design. Any test fixture that verifies detection of a
	specific secret format will, by construction, match the format
	it is testing. If the format is one GitHub (or any upstream
	detector) also scans for, the fixture will trigger an alert on
	every push where it is introduced. The three durable mitigations
	are: (a) generate fixtures at runtime so they never land in source,
	(b) use an isolated regex that is a proper subset of the production
	detector's regex so fixtures fall below the detector's match
	threshold, or (c) mark the file explicitly in a
	`.github/secret-scanning.yml` allowlist. This project is adopting
	option (a) as the follow-up, because it preserves the production
	detector regex without weakening and keeps the test's fidelity to
	the actual attack surface.

	## Audit-path bug — streams masked a request-crashing failure (2026-04-15)

	During v1 deploy smoke testing, every non-stream `POST /ask` request
	and every injection-blocked request on the HF Space returned
	`{"detail": "Internal server error"}` (HTTP 500) instead of the
	intended 200 / 403. Normal queries via `POST /ask/stream` appeared to
	work correctly from the dashboard, so the bug was invisible until a
	direct curl hit a non-stream endpoint. Discovered and fixed same-day.

	Root cause. `agent_bench/security/audit_logger.py:60` called
	`self.path.parent.mkdir(parents=True, exist_ok=True)` the first time
	a request wrote an audit record. The default path `logs/audit.jsonl`
	resolved to `/home/user/app/logs/audit.jsonl` at runtime. The
	Dockerfile's `WORKDIR /home/user/app` creates that directory as
	root:root mode 0755 because Docker's WORKDIR directive does not
	honor `--chown`. Subsequent `COPY --chown=user` lines only change
	ownership of the copied files, not the directory itself. At runtime
	under `USER user` (uid 1000), the process had `r-x` on WORKDIR — it
	could read and execute, but not create new subdirectories. First
	audit write → `PermissionError: [Errno 13] Permission denied: 'logs'`.

	Why `.cache/` worked and `logs/` didn't. `.cache/` is created at
	build time by the two `RUN python scripts/ingest.py ...` steps
	(running as root) and then explicitly chowned by
	`RUN chown -R user:user .cache/`. `logs/` was created lazily at
	runtime by the audit logger, as a non-root user, in a directory owned
	by root. Directory-creation permission wall.

	Why the streaming endpoint masked it. `/ask/stream` calls
	`_write_audit` at the end of the event generator
	(`routes.py:438`), after all stage events and the final answer have
	already been yielded over SSE. A failure there is already too late
	to affect the client's view of the response — the client sees the
	answer, the stream ends, and the audit entry is silently missing.
	Non-stream `/ask` calls `_write_audit` synchronously before `return`
	(line 263), so the failure propagates up through the middleware
	exception handler and becomes a visible 500. Injection-blocked
	requests on both endpoints also audit synchronously before returning
	(lines 193, 302) and produce the same visible 500. Normal dashboard
	use hits `/ask/stream` → symptom invisible; smoke testing
	non-stream `/ask` → symptom obvious.

	Traceback confirmed from the HF Space runtime log:

	```
	File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream
	_write_audit(
	File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit
	audit_logger.log(record)
	File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log
	self.path.parent.mkdir(parents=True, exist_ok=True)
	PermissionError: [Errno 13] Permission denied: 'logs'
	```

	Two-fix patch. Both applied in the same deploy cycle:

	1. Dockerfile (proximate). Commit `ca34ccb`:
	`RUN mkdir -p logs && chown -R user:user .cache/ logs/` at build
	time, before `USER user`. Ensures the runtime user owns the
	audit-log directory from container start.
	2. `AuditLogger.log()` (architectural, defense-in-depth). Commit
	`25e0f1b`: wrap the entire write body in `try/except Exception`,
	log the failure via structlog as `audit_write_failed` with the
	exception type and path, return normally.

	Why both, not either alone.

	- Only fix 1: unblocks the Space but leaves the architectural
	bug — an audit logger that can crash the app is misdesigned
	regardless of filesystem permissions. The SECURITY.md OWASP LLM10
	writeup would be structurally weak.
	- Only fix 2: stops the crashes, but audit writes would still
	silently fail on HF Spaces because `logs/` remains unwritable. The
	OWASP LLM10 claim would be live-untrue on the demo surface.
	- Both: demo is honest (audit log actually written on HF Space),
	architecture is sound (audit failures degrade gracefully), and the
	OWASP LLM10 writeup has live evidence backing both claims.

	Verification on the live Space after the patch. `GET /health`
	flipped from `"degraded"` to `"healthy"` with `provider_available:
	true`; `POST /ask` normal returned 200 with grounded answers citing
	real sources from both corpora (was 500); `POST /ask` with the
	`ignore_previous` injection pattern returned 403
	`"Request blocked: potential prompt injection detected"` (was 500);
	the runtime log shows `injection_detected pattern=ignore_previous
	tier=heuristic` followed by a 403 status and no `audit_write_failed`
	entries, confirming the audit write succeeded end-to-end.

	Methodology lesson — streaming endpoints can hide backend failures.
	Any request-path side effect (audit logging, metrics, cache write,
	persistence) that runs at the end of a streaming generator will fail
	silently from the client's perspective, because content is already
	committed to the wire before the failure fires. Testing such side
	effects against their failure modes through a streaming happy path
	is not sufficient — the happy path masks the failure. Every
	request-path side effect needs at least one test that verifies it
	fails loudly (or, where the design says graceful, degrades loudly)
	when its underlying dependency is broken, **exercised synchronously,
	not via the streaming code path.**

	For this project, `AuditLogger.log()` should gain a unit test that
	asserts it returns normally (not raises) when its target directory is
	unwritable — i.e., a regression test for the `try/except` fix above.
	And the dependency injection of the audit logger in the route
	handlers should get an integration test that exercises both the
	streaming and non-streaming paths against the same failure injection,
	to assert the streaming path does not hide what the non-streaming
	path surfaces. Tracked as a v1.1 test-coverage follow-up.

	Why this wasn't caught locally. Existing
	`tests/test_audit_logger.py` tests use pytest's `tmp_path` fixture,
	which is always writable by the test process. The failure mode only
	manifests when the runtime user cannot create the log directory, a
	deployment-environment condition not reproducible in a unit test
	against a local filesystem owned by the developer. An integration
	test that runs the Docker image end-to-end and hits `/ask` against
	the actual container filesystem would have caught it pre-deploy.
	Such a test is out of scope for v1 (adds ~5 min to CI plus Docker
	build infrastructure) but is the right long-term mitigation for this
	class of bug.

	## LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call

	The continuous-score single-call judges in `agent_bench/evaluation/metrics.py`
	(`answer_faithfulness`, `answer_correctness`, `_judge_call`) are deleted
	and replaced by the per-dimension Judge layer at
	`agent_bench/evaluation/judges/`. Hard cut, no deprecation cycle.

	Design doc: `docs/plans/2026-05-04-judge-layer-v1-design.md`.

	Why this is a supersession, not a refactor. The new layer differs from
	the old on six axes: discrete-anchored scale (vs continuous 0–1),
	reasoning-before-score JSON ordering (vs score-first), per-dimension
	judges (vs combined faithfulness/correctness), full provenance per call
	(judge_id + rubric_version + system_output_hash + prompt_seed; old had
	none), composable variance wrappers (rubric_permute, jury — old was
	single-call), and an intentional abstain-vs-raise discipline (vs silent
	`None` from a bare `except Exception`).

	Evidence backing the supersession claim — the calibration κ table
	quantifies the new layer's agreement with hand-labels across 6 ablation
	rows (baseline + 3 variance ablations + permute + 2-judge jury). The
	files defending this entry's claim, by file path:

	- `measurements/2026-05-04-judge-calibration-labels.jsonl` — 30 items × 3
	dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the
	literature ceiling). Lands in Phase 10.
	- `results/calibration_v1_judge_baseline.json`, `_baseline_no_cot.json`,
	`_baseline_no_anchors.json`, `_baseline_no_abstain.json`,
	`_permute.json`, `_jury_kappa_weighted.json` — per-row predictions.
	Land in Phase 11.
	- `docs/_generated/kappa_table.md` — generated κ ablation table copy-
	pasted into the writeup. Lands in Phase 11.
	- `docs/judge-design.md` — interpretive writeup with the closing
	"when NOT to use LLM-judge" position. Lands in Phase 12.

	Config-knob preservation. `evaluation.judge_provider` is unchanged
	across all 5 YAML configs; new `evaluation.judge_dimensions` field
	defaults to the three v1 dimensions. Zero user-facing config migration.

	Out of scope (v1.1+). Mistral self-hosted as the third jury member,
	Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2
	prompt optimization, citation_faithfulness in the default
	judge_dimensions, AC2 sympy-derived parity tests.

	## Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05

	The Opus stress-test pass over the 30 calibration items × 3 dimensions
	disagreed with the single-rater human gold on **22 of 30 groundedness
	items** (8/30 agreement). Relevance and completeness agreed at 28/30 and
	25/30 respectively. The groundedness disagreement is consistent in
	direction — every disagreed-on item is `human=1, opus=0` — and has a
	single root cause.

	Root cause: reference-scope drift between rubric author and labeler.
	`agent_bench/evaluation/rubrics/groundedness.md` defines the reference
	scope as the gold snippets attached to each item:

	> The judge sees only the gold snippets — not the retrieved chunks. A
	> claim that happens to be true in the world but is not entailed by the
	> snippets fails groundedness.

	The single-rater notes on the disagreed-on items describe checking
	against the broader documentation, not against `source_snippets`:
	"supported by the corpus", "supported by the docs", "supported by the
	provided dependency snippet". For items like `k8s_006` the gold snippet
	is one sentence ("A ConfigMap is an API object used to store
	non-confidential data in key-value pairs"), while the agent's answer
	correctly synthesizes seven or eight additional claims from the full
	`k8s_configmap.md`. Those claims are true in the world and well-supported
	by the full doc, but not entailed by the one snippet. Opus applied
	the strict-snippet rubric; the human rater applied a corpus-supported
	rubric.

	Why this blocks `make calibrate` against the current gold. The κ
	ablation compares Haiku and GPT-4o-mini judges against the human gold.
	A judge that correctly applies the strict-snippet rubric will disagree
	with miscalibrated gold; a judge that's too lenient will agree. The
	ablation rewards leniency and punishes rigor — the opposite of the
	intended measurement. This is the same failure mode codified earlier in
	this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune
	compensation when the measurement is wrong, not the intended effect.

	Why the rubric stays as written, not relaxed to "corpus-supported".
	Strict-snippet groundedness measures RAG behavior: did the agent
	synthesize from what it retrieved? Corpus-supported groundedness
	measures LLM general knowledge passing through a RAG harness: did the
	agent happen to be correct? The first is what this benchmark is for;
	the second is what `agent_bench/evaluation/metrics.py` measured before
	supersession. Relaxing the rubric to "corpus-supported" would silently
	re-introduce the failure mode the supersession entry above just removed.

	Decision — three-step correction lands before `make calibrate` runs:

	1. Rubric clarification commit on `agent_bench/evaluation/rubrics/groundedness.md`.
	Add an explicit reference-scope line and one anchored example
	contrasting "supported by the snippet" vs "true in the world but
	not in the snippet". Audit-trail requirement: the v1.1 writeup will
	cite "rubric clarified between v1.0 and v1.1", and the git history
	needs to back that claim.
	2. Re-label the 22 disagreed-on groundedness items in
	`measurements/2026-05-04-judge-calibration-labels.jsonl` against the
	clarified rubric, snippet-only. **Do not mechanically copy Opus's
	labels.** The labels remain the human single-rater's; what changes is
	the rubric being applied. Mechanical copy would turn the κ table
	into "judge vs Opus", which is not what the writeup claims it
	measures.
	3. Recompute `make calibrate` against the corrected gold and emit
	`docs/_generated/kappa_table.md` from the v1.1 labels.

	Evidence files for the v1.1 writeup section:

	- `measurements/2026-05-05-judge-rubric-opus-stress.jsonl` — 90 Opus
	labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min
	org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
	- `measurements/2026-05-04-judge-calibration-labels.jsonl` — original
	v1.0 single-rater gold; will be diffed against v1.1 corrected gold to
	quantify the re-label delta.
	- `agent_bench/evaluation/rubrics/groundedness.md` — pre/post diff is
	the rubric clarification.

	**Pre-labeling observations also worth recording for the writeup
	methodology section:**

	- `q021` (fastapi · calculation) answered the CORS preflight question
	correctly (600 / 60 = 10 minutes) with `sources: []` and
	`ranked_sources: []` — the agent did the arithmetic without retrieval
	and emitted an answer consistent with the snippet without having
	retrieved it. Methodologically interesting for the
	citation-faithfulness story (Block 2.7) if it ships: an answer can be
	correct without being grounded-by-citation.
	- `q025` (fastapi · multi_hop) answer was truncated mid-token by the
	orchestrator's max_tokens limit. The labels reflect what the system
	produced, not a mentally-patched complete version. The completeness
	rubric does not currently anchor "truncated response" as a level —
	v1.1 rubric work should add an anchor.
	- Several K8s items embed external knowledge that's correct but not in
	the snippet phrasing (`k8s_017` mentions exit-code-0 for init-container
	success; `k8s_009` describes Roles vs ClusterRoles by their semantics).
	The clarified groundedness rubric should pick strict on this case
	(claim must be supportable by the retrieved spans, not just consistent
	with them) and the anchored example should show that ruling.

	Methodology framing for the writeup. The Opus stress-test was added
	specifically to catch hand-labeled-gold fragility before the κ table is
	published. It caught it. The writeup's calibration section should
	disclose the rubric clarification, quantify the re-label delta on
	groundedness, and report κ against the v1.1 corrected gold — that is a
	more credible story than a first-try clean κ table would have been.

	Outcome — 2026-05-05 calibrate run on v1.1 gold. All 6 ablation rows
	ran cleanly after three coupled production-code fixes that landed on the
	same branch as the rubric clarification: (1) markdown fence stripping in
	`agent_bench/evaluation/judges/base.py::_strip_markdown_fence` because
	Haiku 4.5 wraps JSON output in ` ```json ... ``` `, (2) `max_tokens`
	512 → 1024 because v1.1 anchored examples elicit longer model reasoning,
	(3) calibration runner v1.0 omitted `item_id` from prediction records;
	fixed in v1.1 with backfill of the 6 already-written row files via
	`hash → item_id` map (no re-spend). Probe-one-cell-before-sweep saved a
	fourth $0.50 wasted run after the fence-strip change — the methodology
	note in `feedback_judge_probe_before_sweep.md` was earned by this
	session's two failed full-row attempts that paid ~$1.15 for unparseable
	output before the diagnosis converged.

	The κ table at `docs/_generated/kappa_table.md` (regenerated on
	2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for
	completeness — see report.py `_DIM_METRIC`) shows three findings
	that the writeup interprets rather than reports verbatim:

	v1.1 finding 1 — relevance is not "judges fail" territory.
	Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the
	29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00.
	AC1 is the load-bearing statistic on relevance and groundedness; both
	metrics agree on completeness where the gold (23×2 / 5×1) is balanced.

	**v1.1 finding 2 — `no_cot completeness` agreement is real, not
	selective abstain.** AC1 = κ = 1.000 at n=24. The 2 absent cells
	(`q021`, `k8s_012`) are infrastructure abstains (provider rate-limit
	retry exhaustion), both gold=`2`, neither in baseline's disagreement
	set. On the 24 scored cells, all 4 baseline-with-CoT disagreements
	(3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to
	agreement when CoT is removed. The interview-relevant claim is the
	opposite of the conventional CoT-helps story: CoT-before-score on
	3-point completeness lets the judge over-emphasize partial coverage
	and rationalize `1` when the human gold sides with the holistic
	"covers the points" reading.

	**v1.1 finding 3 — `jury_kappa_weighted` underperformed baseline on
	completeness, with a precise mechanism.** Per-member analysis from
	`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`:
	Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%;
	gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw
	26.9% — systematically harsh on the 3-point scale, almost never
	scoring `2`. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% —
	matches gpt-4o-mini alone exactly because the jury verdict reduces
	to gpt-4o-mini's verdict on every disputed cell.

	The mechanism is missing-weight + round-down compounding, not
	weighted voting in the usual sense. `scripts/run_calibration.py
	::_load_weights_from_baseline` is a documented v1 stub that returns
	weight = 1.0 for every judge_id present in baseline. baseline.json
	contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini
	gets 1.0 from `jury.py`'s missing-key fallback (with a logged
	`jury_missing_weight_fallback_to_one` warning per call). Equal
	weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean
	of 1.5; the `_discretize_mean` rule is `frac > 0.5 → ceil else floor`,
	and `0.5 > 0.5` is false, so 1.5 floors to 1. gpt-4o-mini's verdict
	wins every disputed cell. The v1 design doc's risks subsection listed
	"jury κ worse than the better individual judge — (a) kappa-weighting
	wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired
	both branches simultaneously: branch (a) because the weighting is a
	stub returning equal weights, and branch (b) because round-down at
	exact 0.5 ties hands the verdict to the lower-scoring member.

	The deeper structural point is that weighting alone cannot rescue a
	systematically miscalibrated member. Even held-out validation that
	correctly assigned gpt-4o-mini's true low weight on completeness
	would still let it dominate disputed ties unless its weight were
	driven near zero — and at that point exclusion is more honest than
	near-zero inclusion. The conservative-on-binary "ties to lower" rule
	also doesn't transfer cleanly to ordinal scales: on completeness,
	"conservative" means scoring toward incomplete, which is precisely
	the direction of gpt-4o-mini's bias.

	v1.2 fix list (four items, expanding the earlier two-item list):

	1. Held-out jury weights. Replace the
	`_load_weights_from_baseline` stub with a real κ-derived
	computation, evaluated on a held-out validation set — not the
	same calibration row whose κ is being measured against the gold.
	Closes the circular-weighting hole.
	2. Symmetric member coverage in the weights source. Missing-member
	fallback to weight = 1.0 amplifies an unweighted member rather than
	suppressing it. Either every jury member must have a weight in the
	source file or the run must abort. The `jury_missing_weight_
	fallback_to_one` warning fired loudly on every call this run; in
	v1.2 it should be a hard error.
	3. **Per-dimension member exclusion when individual κ falls below a
	threshold.** gpt-4o-mini at κ = 0.020 on completeness should not be
	in the completeness jury at all. Weights below a floor (suggested
	κ < 0.2) should be treated as exclusion, not as small-weight
	inclusion. Held-out validation fixes circular weighting; it does
	not fix systematic member bias.
	4. Per-dimension tie-break rule. v1's `_discretize_mean` rule
	(ties to lower) was selected for conservative behavior on binary
	scales, where "conservative" means scoring 0 on uncertainty. On
	3-point completeness, "conservative" means scoring toward
	incomplete, which interacts badly with member miscalibration.
	v1.2 should select the tie-break rule per-dimension based on the
	rubric's conservative direction, not globally.

	Evidence files: `docs/_generated/kappa_table.md` (regenerated with
	AC1 for groundedness/relevance, κ for completeness);
	`results/calibration_v1_judge_jury_kappa_weighted_members.jsonl`
	(per-member sidecar where the gpt-4o-mini completeness bias is
	visible per item); `results/calibration_v1_judge_baseline.json`
	(weights source — note the absence of any gpt-4o-mini-2024-07-18
	entries, which is why the missing-weight fallback fires).

	## v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria

	Date: 2026-05-06. Status: in-flight; this entry is the pre-experiment
	contract that pins down what counts as success before the re-aggregation
	runs, so the outcome can't be negotiated post-hoc.

	**Sharpened diagnostic — extraction-vs-reasoning split, not just "model is
	biased".** Re-reading the per-member sidecar (item-level, not aggregate)
	on the gpt-4o-mini completeness disputes shows a more specific failure
	mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2
	/ gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's `evidence_quotes`
	field correctly extracts the paraphrased coverage from the agent answer
	— and then its `reasoning` field denies that those very quotes constitute
	coverage. k8s_002 is the cleanest instance: the model quotes the strings
	"declarative updates" and "sticky identity" into evidence, then writes
	"the answer does not explicitly mention 'declarative updates' and 'sticky
	identity'". The score follows the reasoning, not the evidence. The
	mechanism is that the model's post-extraction reasoning step applies a
	literal-string-match standard to the answer text while the rubric
	requires "paraphrase allowed" — i.e., the structured-output discipline
	forced an extraction step that the reasoning step then contradicted on
	autopilot. This is a known failure mode in chain-of-thought judges and
	shows up more in smaller models because the reasoning step has less
	capacity to integrate the rubric's instruction with the literal-text
	comparison the model is running by default. The artifact for the writeup
	is `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` (three
	side-by-side reasoning + evidence_quotes excerpts).

	Pragmatic v1.1 weights-source decision. The v1.2 fix-list above
	specifies a held-out validation set for jury weights — methodologically
	clean but requires either splitting N=30 (loses statistical power on
	both halves) or labeling more items (eats interview prep time). v1.1
	chooses pragmatic: weights computed from the same calibration set used
	for κ reporting, with the circularity flagged in the writeup. Reason:
	(a) the alternative is splitting N=30, (b) the per-member κ values used
	as weights are internally consistent, (c) v1.2 will use a held-out 20-
	item set. The writeup will contain a sentence acknowledging the
	circularity rather than hiding it.

	v1.1 elevated fix-list (subset of the v1.2 list above). Items 2
	(symmetric coverage / hard-error) is elevated unconditionally. Item 1
	(real κ-derived weights) is elevated in pragmatic form (same set with
	circularity caveat). Items 3 (per-dimension exclusion) and 4 (per-
	dimension tie-break) remain v1.2 unless B's outcome forces them up.

	Pre-committed B success criteria. Plan B is "re-aggregate the existing
	164 member-rows in `calibration_v1_judge_jury_kappa_weighted_members.jsonl`
	with corrected κ-derived weights, no new API spend." The outcome maps
	deterministically to one of three predefined responses, picked before
	B runs:

	- **Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥
	0.05** (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story:
	"v1's weights-source bug masked correct aggregation; once both bugs
	(asymmetric coverage + missing-weight fallback) are fixed, the jury
	improves on baseline. Per-dimension exclusion remains a v1.2 design
	pattern but is not needed at v1.1." This is the strong story.
	- Outcome 2 — jury κ within ±0.05 of Haiku-baseline (i.e., 0.366 ≤
	jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity
	but the jury isn't doing meaningful work on completeness — gpt-4o-
	mini's near-zero weight makes it effectively excluded by aggregation.
	This is soft exclusion via weighting; v1.2 will make exclusion
	explicit." Defensible but less clean.
	- Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05 (i.e.,
	jury κ < 0.366). Writeup story: "weights-source fix is necessary but
	not sufficient; even at near-zero weight gpt-4o-mini's verdict tips
	disputed (1, 2) ties due to the round-down rule. v1.1 escalates to
	per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.

	Why the predefined-criteria framing matters. "I ran B, looked at the
	number, decided it was good enough" is the same data with a weaker frame
	than "I predefined the success criteria before running the experiment, B
	landed at outcome X, which mapped to predefined response Y". The latter
	demonstrates evaluation maturity in the writeup; the former invites
	post-hoc reading of the outcome.

	B outcome — 2026-05-06. Plan B re-aggregated the existing 164 sidecar
	rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on
	completeness; clipped at 0 from raw κ values). Result: **jury κ on
	completeness = 0.416**, exactly matching Haiku-baseline. Δ = 0.000;
	maps to Outcome 2 (soft exclusion via weighting). Per the
	pre-committed response, v1.1 stops here and writes up; per-dimension
	member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.

	Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1)
	with corrected weights aggregates as `(2 × 0.416 + 1 × 0.020) / 0.436 =
	1.954`. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the
	correct verdict. v1's two compounding bugs (asymmetric source returning
	weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt-
	4o-mini) jointly forced equal weights, and equal-weights with the same
	round-up rule produced `(2 × 1 + 1 × 1) / 2 = 1.5`, which has frac
	exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every
	disputed cell. The bug fixes recover the right verdict purely
	mechanically; no judge model behavior changes.

	The empirical reading: the weighting is not doing meaningful work —
	gpt-4o-mini's near-zero weight effectively excludes it on completeness,
	and the jury's κ matches Haiku-alone exactly because Haiku's verdict
	wins every disputed cell. This is "soft exclusion via weighting"; v1.2's
	explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the
	exclusion visible in the jury config rather than emergent from κ-derived
	weight collapse.

	v1.1 code changes (this commit):
	- `agent_bench/evaluation/variance/jury.py` — silent missing-weight
	fallback to 1.0 → hard `ValueError`. Two existing tests that asserted
	the old contract (`test_kappa_weighted_reasoning_reports_applied_weights_not_dict`,
	`test_kappa_weighted_logs_warning_on_missing_weight`) updated to
	assert the new contract.
	- `scripts/run_calibration.py::_load_weights_from_baseline` →
	`_compute_kappa_weights` — replaces the v1 stub with real per-judge
	Cohen's κ on the dimension; hard-errors when any expected member is
	missing from the source. Clips κ < 0 to weight = 0 (soft exclusion).
	- `configs/calibration/rows/jury_kappa_weighted.yaml` — `weights_source`
	re-pointed from `calibration_v1_judge_baseline.json` (Haiku-only,
	asymmetric coverage) to
	`calibration_v1_judge_jury_kappa_weighted_members.jsonl` (both judges,
	same calibration set with documented circularity).
	- `tests/scripts/test_run_calibration_dispatch.py` — two new tests cover
	`_compute_kappa_weights`: (a) computes real κ (high-agreement judge →
	weight=1.0, chance-agreement judge → 0); (b) hard-errors on
	asymmetric source coverage.
	- `results/calibration_v1_judge_jury_kappa_weighted_v1_1.json` — new
	predictions row produced by re-aggregating the existing sidecar
	offline (no API spend; via `scripts/_dev/reaggregate_jury_v1_1.py`).
	`docs/_generated/kappa_table.md` regenerated with this row alongside
	the broken v1 row, giving the writeup a clean before/after diff
	(completeness: 0.014 → 0.416, n=26).
	- `measurements/2026-05-06-gpt4o-extraction-reasoning-split.md` — the
	three side-by-side reasoning + evidence_quotes excerpts (q006 /
	k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split
	diagnostic finding.

	The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1
	(pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.

	## Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria)

	Date: 2026-05-06. Status: in-flight; this entry pins down the
	hypothesis and success criteria before the experiment runs.

	Hypothesis sharpened by the 1A direction-of-bias finding. GPT-4o-
	mini's completeness disagreements are 17/19 gold=2/pred=1 with zero
	up-mistakes across 26 items spanning two corpora — direction-aware noise,
	not balanced random labeling. The model is consistently applying some
	rule stricter than the rubric requires. The hypothesis under test: that
	stricter rule is "literal-string match required, paraphrase doesn't
	count," and the bias is fixable by recency-positioning the rubric's
	"paraphrase allowed" instruction adjacent to the commit-to-score
	decision instead of leaving it 500+ tokens upstream in the rubric body.

	The intervention is positional, not lexical. The current
	`CompletenessJudge` prompt (`agent_bench/evaluation/judges/completeness.py`)
	sends the rubric body, then the gold reference, then the system answer,
	then a one-line "Score this answer..." instruction immediately followed
	by the JSON schema clause. The rubric body's "paraphrase allowed" clause
	appears in the introductory paragraphs, hundreds of tokens before the
	score decision. The intervention adds one sentence between the system
	answer and the score instruction:

	> *"Note: a paraphrase that captures the same meaning as a gold-answer
	> point counts as covered. Score on content equivalence, not surface
	> form."*

	This is the recency-positioning hypothesis: the model loses the
	paraphrase conditioning across the rubric anchors and the reasoning
	step. Restating the instruction adjacent to the score decision tests
	whether the bias is positionally correctable.

	Selected 5 disputed items (representative of the gold=2 / Haiku=2 /
	gpt=1 pattern across both corpora): `q006`, `q011`, `k8s_002`, `k8s_006`,
	`k8s_018`. All four are pure paraphrase-coverage cases (the system
	answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored
	1 with the extraction-vs-reasoning split documented in
	`measurements/2026-05-06-gpt4o-extraction-reasoning-split.md`).

	Pre-committed 3A success criteria.

	- Fixed (≥3/5 shift from 1 → 2): Recency-positioning is sufficient.
	Re-run GPT-4o-mini on the full 26 disputed items with the corrected
	prompt, recompute κ, update the writeup table. Story: "rubric-
	engineering matters more than judge model choice for ordinal scales —
	recency-positioning the paraphrase instruction recovered N% of
	disputed items." The completeness story becomes actionable, not
	diagnostic-only.
	- Partially fixed (1–2/5 shift): Inconclusive at N=5 (binomial-
	significance line is ~3+). Re-run on the full 26 disputed items
	(~$0.20) to get a clean number; write up whatever the full-26 says.
	- Not fixed (0/5 shift): The instruction is being received and
	ignored — the model can't act on it under reasoning load. Escalate
	to 4A (GPT-4o full on the same 5 items) to verify the small-model-
	specific claim. Story: "repositioning the paraphrase instruction
	adjacent to the score decision did not shift any of 5 disputed items;
	GPT-4o handled the same prompts. The bias is small-model-specific,
	not prompt-fixable."

	The 3/5 threshold is the binomial-significance line at this N — random
	shifting under the null produces 0 or 1 changes most of the time. Pre-
	committing avoids the "2 shifted, that's kind of a fix" negotiation.

	On the 1A relevance finding — confirmed. Both judges essentially
	correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ
	degeneracy is structural under 29/30 prevalence at class-2; AC1 +
	raw agreement is the right reporting. No further investigation on
	relevance. Writeup paragraph is one short sentence: prevalence-induced
	degeneracy → AC1 is load-bearing.

	## Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1)

	Date: 2026-05-06. Status: complete; the v1.1.1 prompt is now
	permanent in `agent_bench/evaluation/judges/completeness.py`.

	3A 5-item probe: 3/5 disputed items shifted 1 → 2 (q006, q011,
	k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre-
	committed threshold (≥3/5 → "fixed"), so the protocol triggered the
	full-26 re-run on gpt-4o-mini only (Haiku held as control to make the
	v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on
	the affected judge).

	Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):

	\| \| n \| raw \| κ \| AC1 \|
	\|------------------------------\|----\|--------\|--------\|--------\|
	\| v1.1 gpt-4o-mini \| 26 \| 26.9% \| +0.020 \| +0.006 \|
	\| v1.1.1 gpt-4o-mini \| 28 \| 42.9% \| +0.000 \| +0.232 \|
	\| v1.1 Haiku (control) \| 26 \| 84.6% \| +0.416 \| +0.792 \|

	Per-item delta (v1.1 → v1.1.1): 7 items shifted up (1 → 2 or 1 → 2),
	0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2
	items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016,
	k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1
	in v1.1, now over-credited at pred=2). Net per-item correctness delta:
	+5 items.

	Cohen's κ is misleading on this comparison. v1.1.1 raw agreement
	rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from
	0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly
	lower than v1.1's 0.020. The mechanism is prevalence-rebalancing in
	the marginals: gpt-4o-mini's pred distribution shifted from `{0:2, 1:19,
	2:5}` (concentrated at 1) to `{0:4, 1:12, 2:12}` (more balanced, closer
	to gold's `{1:5, 2:23}` over n=28). Cohen's κ = `(P_o - P_e)/(1 - P_e)`;
	when marginals become more diverse, P_e (chance agreement) rises in
	lockstep with P_o (observed agreement), and κ deflates. AC1 uses
	prevalence-robust chance correction (`P_e = (1/(q-1)) Σ pi_k(1-pi_k)`)
	and reads the actual signal.

	This is the same trap that motivated AC1 over κ on the relevance and
	groundedness rows of the original κ table, surfacing here at a
	different distribution boundary. The κ table footer already explains
	why per-dimension metric selection matters; v1.1.1's outcome
	demonstrates the trap induced by the intervention itself.

	Effect on the jury aggregate. With κ-derived weights and gpt-4o-
	mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury
	verdict on completeness is now mathematically equivalent to Haiku-alone
	on every item (gpt's contribution is multiplied by zero). Jury κ stays
	at 0.416, identical to v1.1's corrected aggregate. The intervention's
	per-member improvement is invisible at the jury level under this
	weighting scheme.

	Methodological consequence — v1.2 fix-list addition. The v1.2 fix-
	list now expands by one item:

	5. Prevalence-robust weights for prevalence-skewed dimensions.
	v1.1's `_compute_kappa_weights` uses Cohen's κ for every dimension,
	which has a self-defeating property on prevalence-skewed gold:
	improving a member can lower its weight even as it gets more
	accurate.

	Mechanism. Cohen's κ = `(P_o - P_e) / (1 - P_e)`, where
	`P_e = Σ_k P(gold=k) × P(pred=k)` is the chance-agreement term
	computed from the marginal distributions. P_e is not invariant to
	the predictor's marginal distribution — when a member's predictions
	become more diverse (less concentrated at one class), P_e rises
	as the marginals approach gold's marginals. Concretely: when an
	intervention moves a member's pred distribution from concentrated-
	at-one-class toward gold's distribution, P_o and P_e rise together
	in lockstep. The numerator `P_o - P_e` stays small, and κ deflates
	even as raw accuracy improves. This is the same prevalence-induced
	degeneracy that motivated AC1 over κ on relevance/groundedness rows
	in the κ table — it surfaces in jury weighting at any
	distribution-shifting intervention's boundary.

	Empirically observed in v1.1.1. The recency-positioning
	intervention shifted gpt-4o-mini completeness pred dist from
	`{0:2, 1:19, 2:5}` to `{0:4, 1:12, 2:12}`, closer to gold's
	`{1:5, 2:23}` over n=28. Per-cell raw agreement 26.9% → 42.9%.
	AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×).
	Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o
	≈ 0.43. v1.1's `_compute_kappa_weights` clips the new κ at zero,
	producing weight = 0 — and the jury aggregate loses access to a
	member that was empirically improved. The intervention's per-
	member improvement is invisible at the jury level under κ-weighting.

	Architectural decomposition for v1.2. The right separation:
	- Per-dimension metric for κ table reporting (already in v1.1
	via `agent_bench/evaluation/calibration/report.py::_DIM_METRIC`).
	- Per-dimension weight metric for jury aggregation (new in
	v1.2, reuses `_DIM_METRIC`). Use κ where the gold's prevalence
	supports it, AC1 where κ degenerates. Same lookup, same per-
	dimension policy at both reporting and weighting layers.
	- Per-dimension membership as explicit configuration override
	for members that are structurally inappropriate (v1.2 fix #3,
	unchanged) — distinct from "low score on the chosen metric,"
	which is handled by the weight floor.

	Why this is non-obvious. A reader's first instinct is that
	"weight by κ" is a sensible default — κ is the standard inter-
	rater statistic. The self-defeating property is invisible until
	you observe a real intervention that shifts marginals; in static
	conditions (no intervention, fixed prompts), the κ-weight choice
	is benign. The v1.1.1 outcome is the first time the agent-bench
	calibration set has produced an intervention-induced marginal
	shift on the same gold; the failure mode wouldn't have been
	visible in v1.0's static calibration sweep.

	v1.1.1 code changes (this commit):
	- `agent_bench/evaluation/judges/completeness.py` — adds
	`PARAPHRASE_RECENCY_CLAUSE` constant, inserted between the system
	answer and the score instruction. Comment cites the 3A probe.
	- `tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt`
	— extends to assert the recency clause appears AND is positioned
	between the answer and the score instruction (position is load-
	bearing, not just lexical inclusion).
	- `results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl`
	— merged sidecar: v1.1 groundedness/relevance rows (unchanged
	judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku
	completeness rows.
	- `measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl` — the
	5-item probe artifact with reasoning + evidence_quotes for each.
	- `scripts/_dev/probe_3a_paraphrase_recency.py`,
	`scripts/_dev/rerun_completeness_v1_1_1.py` — reproducers; not
	part of the production calibration runner.

	No changes to the κ table. The jury aggregate κ on completeness is
	unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy
	described above; adding a `jury_kappa_weighted_v1_1_1` row with
	identical numbers would be visual noise. The v1.1.1 finding lives in
	the writeup body, not the table — the per-member AC1 improvement
	(0.006 → 0.232) is the headline number, surfaced as a separate
	paragraph next to the κ table rather than inside it.

	Total spend through Plan 3A: $0.0013 (3A probe) + $0.0075 (full-26
	re-run) = $0.0088.

	## Plan 4A — GPT-4o (full) on the v1.1.1 residual

	Date: 2026-05-06. Status: complete. Run after the writeup-
	framing review surfaced that v1.1.1's "fixed" verdict was overclaim-
	prone — 5/19 items were recovered, 14 remained unchanged and
	uncharacterized. 4A was originally scoped as conditional on 3A not
	being fixed (per the predefined sequencing rule), but became valuable
	as a post-3A diagnostic to characterize the residual: is it small-
	model-specific or rubric-under-specified?

	Scope. GPT-4o (`gpt-4o-2024-08-06`) on 5 of the 14 v1.1.1-unchanged
	items: `k8s_006`, `k8s_018`, `q011`, `q012`, `k8s_001`. Same v1.1.1
	production prompt (paraphrase recency clause active). The first two
	(k8s_006, k8s_018) are the items that didn't shift in the original 3A
	5-item probe — we have gpt-4o-mini's reasoning on those items with
	the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt
	varying only the model. q011, q012, k8s_001 cover the broader
	fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A
	checks whether GPT-4o agrees with gold or with Haiku).

	Result: 5/5 correct. All 5 items scored 2 by GPT-4o, matching gold
	exactly. Cost: $0.0011 reported (caveat: pricing config falls back to
	gpt-4o-mini rates for unlisted models, so actual cost is closer to
	$0.005–0.01 — the reported number under-reports by ~5–10×).

	Sharpened mechanism — criteria-invention, not just literal-match.
	The original 3-example artifact (q006, k8s_002, k8s_018) was framed
	as gpt-4o-mini "applying a literal-string-match standard" while
	correctly extracting paraphrased coverage into evidence_quotes. 4A's
	side-by-side reasoning on `k8s_018` shows a distinct second mechanism:

	- gpt-4o-mini (v1.1.1, score 1): "It mentions some key points
	from the reference... but does not explicitly state that the new
	fields in `autoscaling/v2` are preserved as annotations when using
	`autoscaling/v1`, nor does it mention the need to use
	`autoscaling/v2` directly for memory or custom metric scaling for
	a Deployment or StatefulSet."
	- gpt-4o (4A, score 2): "The answer covers all the key points
	from the reference. It mentions that the current stable version is
	autoscaling/v2, which supports scaling on memory and custom
	metrics, similar to the reference. It also notes that
	autoscaling/v1 only supports CPU-based scaling, aligning with the
	reference's points."

	The reference for k8s_018 specifies three points: (1) autoscaling/v2
	is the current stable API, (2) it adds memory metrics support beyond
	v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's
	reasoning step invents additional criteria the reference does not
	require ("preserved as annotations when using autoscaling/v1," "use
	autoscaling/v2 directly for ... a Deployment or StatefulSet") and then
	deducts against them, scoring 1. GPT-4o reads the reference's three
	points and scores against exactly those, scoring 2.

	This is a capacity finding distinct from the paraphrase-recency
	finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing
	it toward paraphrase semantics, manufactures additional gold criteria
	during scoring that aren't in the reference. Recency-positioning the
	"paraphrase allowed" clause doesn't address this — the bias isn't
	"missed paraphrase," it's "invented extra requirements." Two failure
	modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.

	Implication for v1.2. With 5/5 confirmed, v1.2 fix #3 (per-
	dimension membership) gets clean empirical support: gpt-4o-mini is
	the wrong tool for 3-point completeness with paraphrase semantics, and
	no amount of prompt engineering on this rubric is going to bridge the
	capacity gap. The right v1.2 path is one of:

	- Exclude gpt-4o-mini from completeness scoring (per-dim
	membership; jury reduces to single-judge Haiku on completeness;
	explicit and visible in config).
	- Replace gpt-4o-mini with GPT-4o on completeness (per-dim
	judge selection; jury keeps two members but the second is a
	frontier-class model on the dimension that needs it).

	Both are defensible v1.2 designs. The choice depends on cost
	budget — gpt-4o is ~10× the per-call cost of gpt-4o-mini. For
	agent-bench's calibration set scale (~30 items × per-row), even gpt-
	4o is trivially cheap; for production deployment evaluating thousands
	of agent outputs, the cost trade-off matters more.

	4A artifact: `measurements/2026-05-06-4a-gpt4o-full-probe.jsonl`
	(per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs
	with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for
	the side-by-side analysis above).

	Updated honest framing for the writeup. "v1.1.1 addressed one
	identified failure mode (paraphrase-instruction-loss across reasoning,
	recovered 5/19 disputed items via positional change). 4A confirmed the
	residual 14 are a distinct failure mode (capacity-limited criteria
	invention during the reasoning step) — GPT-4o handles all 5 sampled
	residuals at the same v1.1.1 prompt, so the failure is small-model-
	specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge
	membership / model selection) is the right escalation; the rubric
	itself doesn't need changes."

	Total session spend: $0.0099 reported (~$0.013–0.018 actual after
	gpt-4o pricing correction).