agentbench / DECISIONS.md
Nomearod's picture
calibrate(jury): 4A characterizes v1.1.1 residual as model-class-specific
504a35c

Design Decisions

Why build from primitives, not LangChain?

I wanted to demonstrate I understand tool dispatch, memory management, and retrieval orchestration at the implementation level. My provider abstraction is ~150 lines. When reranking or a second provider is needed, I know exactly where it plugs in — because I built every layer.

Why one provider in V1?

The interface supports multiple providers. V1 shipped OpenAI + Mock to prove the abstraction. V2 added Anthropic (claude-haiku-4-5), confirming that switching providers is a one-line config change. The orchestrator and tools are completely unchanged between providers.

Why one domain (technical docs)?

Clean text produces clean evaluation. Research papers introduce PDF parsing noise (tables, figures, formulas) that degrades eval quality without adding signal. The framework handles any text corpus — the domain is a config swap.

Why Reciprocal Rank Fusion, not score normalization?

BM25 scores and cosine similarities live on different scales. Normalizing across scales is brittle: min-max normalization is sensitive to outliers, z-score requires distribution assumptions. RRF fuses by rank position 1/(k + rank), which is robust, parameter-light (only k=60), and well-studied. Trade-off: loses magnitude information, but for top-5 retrieval this rarely matters.

Why ~16 curated docs, not a large corpus?

Small corpus produces predictable retrieval, reproducible benchmarks, and easy debugging. Golden dataset questions map to specific source files. A reviewer can verify any result by reading the source. The framework scales to larger corpora — the choice is about evaluation quality, not capability.

Why no reranker in V1?

Feature-flagged in config (rag.reranker.enabled: false). V1 benchmarks without reranking establish an honest baseline. V2 adds cross-encoder reranking and shows the delta.

Why no delete in the vector store?

FAISS flat index doesn't support efficient deletion. For a small corpus that changes rarely, rebuild-on-ingest is simpler and eliminates consistency bugs.

Why async internals, sync user behavior?

FastAPI and the OpenAI SDK are async-native. Using async for I/O avoids blocking the event loop. V2 added SSE streaming (/ask/stream) for the final synthesis step — tool calls remain non-streamed since they complete in ~100ms.

Why SQLite-backed conversation sessions

V1 was stateless by design — no conversation_id, no cross-request memory. V2 adds optional SQLite-backed sessions: pass session_id on /ask to persist and load conversation history. When omitted, behavior is identical to V1 (stateless). See the dedicated DECISIONS.md entry under "Why SQLite for conversation persistence" for the full rationale.

Why negative evaluation cases?

A system that always answers sounds confident but may hallucinate. 5 out-of-scope questions test whether the system refuses gracefully when the corpus doesn't contain the answer. Grounded refusal requires both refusal language AND zero cited sources — an answer that says "not found" but still cites docs is not a valid refusal.

Why deterministic eval + optional LLM judge?

CI needs free, deterministic tests. Retrieval P@5, keyword hit rate, citation accuracy, and grounded refusal rate run without API keys. LLM-judged faithfulness and correctness are manual enrichment steps, run locally, with results committed to the benchmark report.

Why structlog, not LangSmith/Langfuse?

Third-party observability contradicts the "built from primitives" narrative. structlog provides JSON-structured logs, /metrics exposes latency + cost. If a team uses LangSmith, adding it is a one-day task.

Why explicit citation format [source: filename.md]?

The system prompt mandates inline citations so the evaluation harness can parse them with \[source:\s*(.+?)\] and check against the structured sources list. This catches hallucinated citations and measures citation accuracy as a metric.

Why per-request retrieval settings via kwargs, not singleton mutation?

The orchestrator is a singleton shared across concurrent requests. Storing top_k / strategy on self causes cross-request state bleed. Instead, these are passed as local variables through the tool execution kwargs — no shared state is mutated.

Why a relevance threshold for grounded refusal

V1 never refuses — it always retrieves tangentially related content and synthesizes an answer. This is a trust failure: users cannot distinguish "the system found relevant information" from "the system fabricated from vaguely related chunks." Grounded refusal rate was 0/5.

We add a refusal gate in SearchTool.execute() based on the maximum RRF score across retrieved chunks. If no chunk scores above the threshold, the tool returns "No relevant documents found" — the LLM then refuses via the system prompt rather than fabricating from irrelevant content.

Gate location: The gate fires in SearchTool.execute(), not the orchestrator. SearchTool is where retrieval scores are still available — they are dropped before results reach the orchestrator. This also keeps the orchestrator unchanged.

Threshold value: rag.refusal_threshold: 0.02 is a provisional default pending an empirical sweep across the evaluation set. The sweep will test values 0.01–0.03 and select the value that maximizes refusal on out-of-scope queries without degrading in-scope P@5 and R@5. The actual RRF score distribution will be documented here after tuning.

Interaction with reranking: The refusal gate fires on RRF scores BEFORE reranking. It is a go/no-go decision, not a per-chunk filter. If the gate passes, the full candidate set proceeds to the reranker. This keeps the threshold calibration independent of whether reranking is enabled.

Default disabled: refusal_threshold: 0.0 preserves V1 behavior exactly. The feature is opt-in until the threshold is tuned.

Alternative considered: LLM-based relevance judgment ("is this content relevant to the query?"). Rejected because it adds latency, cost, and a second point of failure. The score-based approach is deterministic, fast, and debuggable.

Why cross-encoder reranking improves precision

BM25 retrieves lexically similar but semantically irrelevant chunks. RRF fusion mitigates this partially, but noisy BM25 results still dilute the top-5 set. P@5 was 0.70 in V1.

A cross-encoder (ms-marco-MiniLM-L-6-v2, ~80MB) scores each (query, chunk) pair jointly, capturing semantic relevance that bi-encoder similarity misses. The tradeoff is ~100–200ms extra latency per query — acceptable given our 4.7s baseline is dominated by LLM generation, not retrieval.

The reranker is enabled by default. Setting rag.reranker.enabled: false restores V1 behavior exactly. reranker.top_k is independent of retrieval.top_k, so the reranker's output count can be tuned without affecting the RRF candidate pool.

The retriever passes all RRF-fused candidates to the reranker rather than a computed subset. The reranker's top_k handles truncation. This is simpler and more robust than computing an input size from per-system candidate counts.

Why provider retry with exponential backoff

OpenAI returns 429 (rate limit) errors under load. Without retry logic, a single 429 causes a user-visible failure. We add exponential backoff: attempt after 1s, 2s, 4s. After 3 retries, raise ProviderRateLimitError so the middleware returns a clear 503.

The retry wraps the raw openai.RateLimitError — it must fire BEFORE the error gets translated to ProviderRateLimitError, otherwise retry logic is dead code. Other errors (400, 401, timeout) fail immediately.

Why in-memory API rate limiting

A public-facing API needs abuse protection. We use a simple in-memory sliding window limiter: 10 requests/minute per IP. Sufficient for a demo deployment; a production system would use Redis.

Known limitation: the per-IP dict grows without bound across distinct IPs. Acceptable for Fly.io with auto-stop (memory resets). If running continuously under bot traffic, add a periodic sweep or switch to a TTL-based structure.

Design choices:

  • /health and /metrics exempt: monitoring should never be rate-limited.
  • Retry-After header: follows HTTP 429 spec, lets clients back off.

Why SQLite for conversation persistence

Three options considered:

  1. In-memory dict: Lost on restart.
  2. SQLite: Zero-dependency, file-based, survives restarts.
  3. Redis/PostgreSQL: Adds infrastructure complexity.

SQLite is right for this scale. session_id is optional — when omitted, the system behaves identically to V1 (stateless). This preserves backward compatibility and keeps benchmark evaluation deterministic.

The route handler manages session state (load history, store Q+A), not the orchestrator. The orchestrator accepts an optional history parameter but has no knowledge of persistence. This keeps the agent loop testable without a database.

Note: On HF Spaces, SQLite is ephemeral (no persistent storage on free tier). For the demo this is acceptable — sessions last until the container sleeps. Production would use a volume or managed database.

Why a second provider (Anthropic)

The provider abstraction existed since V1 but only had OpenAI + Mock. Adding Anthropic proves the abstraction works across fundamentally different APIs:

  • System message: system= parameter, not in the messages list
  • Tool definitions: input_schema instead of parameters
  • Tool results: tool_result content blocks in user messages
  • Tool calls: tool_use content blocks, not a separate field
  • Stop reason: tool_use vs stop

The implementation is a config swap — provider.default: anthropic in YAML switches the entire system to Claude. The orchestrator, tools, evaluation harness, and serving layer are completely unchanged.

Same retry/timeout handling as OpenAI. Both providers are tested with mocked HTTP responses — no API keys needed in CI.

Why ranked_sources separate from deduplicated sources?

The deduplicated sources list in AgentResponse is for the API response. The ranked_sources list preserves rank order with duplicates for evaluation metrics. P@5 and R@5 need the raw retrieval ranking, not the post-processed answer metadata.

Why vLLM over TGI / llama.cpp

vLLM has the widest model support, best throughput via PagedAttention, and a native OpenAI-compatible server (/v1/chat/completions). TGI is a valid alternative; llama.cpp targets different use cases (edge/CPU inference). This is a deliberate choice, not ignorance of alternatives.

Why Modal for GPU inference

Serverless GPU eliminates idle cost and GPU node management. A10G at ~$1.30/hr costs ~$0.50 per full 27-question benchmark run. The Docker Compose path (docker-compose.vllm.yml) is retained for users who have local GPUs or prefer persistent serving.

Why split topology (K8s API + Modal GPU)

The API layer (retrieval, orchestration, tool routing) is CPU-bound and benefits from horizontal scaling via K8s HPA. The LLM inference layer is GPU-bound and benefits from serverless elasticity — Modal scales to zero when idle, scales up on demand with no node provisioning. Co-locating both in K8s would require GPU node pools with idle cost, node autoscaler latency, and NVIDIA device plugin management. This mirrors a common production pattern.

Why Helm only, not Kustomize + Helm

Showing two K8s deployment methods for the same app adds complexity without demonstrating distinct skills. Helm with values-dev.yaml / values-prod.yaml covers environment-specific configuration cleanly.

Why CPU-based HPA, not custom metrics

CPU utilization works without a Prometheus adapter or custom metrics server. A production improvement would use the Prometheus adapter to scale on p95 latency from the /metrics endpoint — this requires bridging the JSON metrics to Prometheus exposition format. Documented as a follow-up.

Why env var fallback in SelfHostedProvider

Follows the same pattern as OpenAIProvider reading OPENAI_API_KEY. The YAML config provides defaults; env vars override at runtime. No config loader changes needed.

Why lazy tool-call detection, not metadata check

Checking /v1/models metadata for tool-calling support is unreliable — model metadata doesn't consistently report this capability. Instead, the provider sends one tool-calling request on first complete() call with tools and checks if the response contains tool_calls. The result is cached as self._supports_tool_calling. Transient failures (timeout, 5xx) return None and retry on the next call rather than permanently downgrading to prompt-based fallback.

Why two-tier injection detection, not three

The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity — "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth.

Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only — documented, not hidden.

Why regex + optional spaCy for PII, not a cloud API

Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses.

spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (try: import spacy) degrades gracefully with a logged warning — no crash if someone sets use_ner: true without installing spaCy.

Why append-only JSONL for audit, not SQLite

One codepath, one format, no config branching. JSONL is append-only by nature — no schema migrations, no transactions, no connection pooling. Log rotation handles size. jq provides immediate queryability without building a custom API.

The original design included an optional SQLite backend and a query endpoint (GET /admin/audit). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication — an inconsistency when /ask itself has no auth.

JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned.

Why HMAC-SHA256 IP hashing in audit logs

HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, AUDIT_HMAC_KEY env var, or (with a logged warning) a random per-process fallback.

Why three output validators, not four

The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks — PII leakage, URL hallucination, blocklist — are deterministic with clear pass/fail semantics.

Why buffer-then-validate for streaming output

The /ask/stream endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative — streaming chunks immediately and appending a safety marker — leaks unsafe content to any client that stops reading after the done event.

Why no authentication on API endpoints

The HF Spaces demo is public by design — the curl examples in the README work without credentials, which is the point. Adding API key authentication would gate access but break the zero-friction demo experience that makes the project evaluable.

The security pipeline protects content (injection detection, PII redaction, output validation), not access. This is a deliberate scope boundary: application-layer guardrails ensure the system behaves safely regardless of who calls it, rather than assuming trusted callers. Rate limiting (10 RPM per IP) provides basic abuse protection.

A production deployment would add authentication (API keys or OAuth) at the infrastructure layer — reverse proxy, API gateway, or middleware. The security pipeline's getattr(..., None) pattern means auth can be layered on without modifying the existing security components.

Why monitor mode for output validation, not gating?

Output validation runs post-stream as a monitoring layer. The answer streams to the client, then validation runs and emits its verdict. Gating (buffer-then-validate) would add 4-5 seconds of dead air while the full answer generates — unacceptable streaming UX for a documentation Q&A bot. Trade-off: a hallucinated URL or PII fragment could reach the client before validation catches it. For this use case (FastAPI docs, no real PII in corpus), the risk is near-zero. The dashboard labels this "monitored" (not "gated") to be explicit about the posture.

Why named residual risks and scope limits, not "fully mitigated" verdicts?

The OWASP LLM Top 10 (2025) mapping could have been written as a 10-row table where LLM01 and LLM02 read as "addressed" without qualifiers — shorter and cleaner-looking. Rejected because OWASP's own 2025 text is explicit about what an input guardrail can and cannot do, and writing a verdict that contradicts the source the mapping cites would be compliance theater.

LLM01 Prompt Injection — OWASP 2025 states that RAG and fine-tuning do not fully mitigate prompt injection, and that indirect injection through retrieved content remains a core risk class. "Fully mitigated" is unsupportable for any system retrieving untrusted content into an LLM context window, which is every RAG system including this one. The LLM01 verdict reads "addressed directly with named residual risk"; the residual-risk cell cites OWASP's own "do not fully mitigate" language verbatim.

LLM02 Sensitive Information Disclosure — OWASP 2025's LLM02 mitigations span four concern classes: access controls, training-data handling, user-consent transparency, and proprietary-information governance. This implementation addresses a narrower output-side subset (output validation for PII leakage, secret formats, and URL hallucination) — not cleanly one of the four concern classes, but a narrower scope than any of them. The verdict reads "addressed directly for the applicable scope"; the scope-limit cell enumerates the four concern classes verbatim and names what addressing the broader concerns would require (multi-tenant or authenticated architecture).

The tension the entry resolves is honesty-vs-scannability: a mapping that surfaces named residual risks and scope limits is longer and harder to skim than one with uniform "addressed" verdicts, but the scannable version over-claims relative to the cited source. Honest evaluation is the brand. Every verdict cell in SECURITY.md must survive a reviewer reading OWASP 2025 in a second tab.

See SECURITY.md § LLM01 Prompt Injection and § LLM02 Sensitive Information Disclosure for the verdict cells; this entry covers why the verdict discipline takes the form it does. The LLM01 "do not fully mitigate" phrasing and the LLM02 four-concern-class enumeration are canonical in SECURITY.md; the README tail and landing-page subtitle paraphrase but must preserve the named-residual-risk and scope-limit structure.

Why additive SSE stage events?

The enhanced /ask/stream adds meta and stage event types alongside the existing sources, chunk, and done events. Existing consumers that only handle the three legacy types are unaffected — they simply ignore events with unknown types. This avoids versioning the endpoint or breaking the non-streaming /ask contract. The meta event fires first (before any stages) so the frontend can display provider/model info immediately.

Why vanilla JS for the frontend, not Alpine or React?

The showcase dashboard has ~5 pieces of reactive state (pipeline stages, retrieval results, security badges, stats, chat messages). The SSE handler is inherently imperative: receive event, querySelector the target node, update classList and textContent. Wrapping this in a reactive framework adds a dependency, interview questions about "why is there a framework for 5 state variables", and indirection that fights the imperative SSE pattern. One state object + a few render() functions handles it in ~150 lines.

Phase 1 SSE gate closure — two baselines on record, not one

The Phase 1 acceptance gate for the SSE backend work (meta event, stage events, iteration-aware metadata threading, route-level injection/output-validation events) requires re-running make evaluate-fast and confirming numbers match pre-change state on the pinned gpt-4o-mini-2024-07-18 snapshot. The re-run was honored literally rather than substituted with a git-diff argument, even though the SSE commits did not touch scripts/evaluate.py's legacy code path. Two reasons: the re-commitment discipline that kept Fix 1 and Fix 2 honest applies equally here, and the legacy path and the --corpus fastapi path produce materially different baselines that cannot substitute for each other.

Two distinct baselines now exist at the pinned snapshot, and both are on record — one per prompt path:

Baseline file Invocation Prompt source In-scope P@5 In-scope R@5 Citation Mean calls
results/fastapi_preedit.json @ 213da36 --corpus fastapi format_system_prompt("FastAPI") 0.718 0.833 1.000 1.14
results/fastapi_legacy_baseline_pinned.json @ this commit make evaluate-fast (no --corpus) tech_docs.yaml task.system_prompt 0.655 0.849 1.000 1.45

Citation accuracy holds at 1.000 on both paths, both in-scope and out-of-scope. The retrieval metric deltas (P@5 −0.063, R@5 +0.016, KHR +0.045) and behavioral delta (mean tool calls +0.318 in-scope, +1.00 out-of-scope) trace to the prompt-path divergence (scripts/evaluate.py:67 reads task.system_prompt in the legacy branch vs. format_system_prompt(label) in the --corpus branch), not to any change in retrieval, reranking, or refusal-gate code. This divergence is the same one the "evaluation-layer multi-corpus support lagged the serving-layer refactor" entry documents; the narrowed serving-migration deferral tracks its eventual migration.

Why both baselines are retained. When the serving-migration deferral lands and scripts/evaluate.py's legacy branch is removed (everything routes through --corpus fastapi), the regression gate is "post-migration make evaluate-fast output matches pre-migration --corpus fastapi output within pre-committed tolerances." That gate requires the --corpus fastapi baseline as the comparison reference AND the legacy baseline as evidence of the pre-migration state that is being retired. Retaining both makes the migration auditable and bounds its regression budget; retaining only one would force the post-migration run to compare against a baseline from a different prompt path, guaranteeing the gate fires on prompt divergence rather than on any actual regression.

Gate verdict: passed. No regression vs pre-SSE legacy path expectations (citation 1.000 holds, refusal gate fires on the same 5 out-of-scope questions, retrieval numbers in sane in-scope ranges). Phase 1 SSE backend work is closed from the backend side; the frontend's consumption of iteration-aware stage events is orthogonal and owned by Week 1 step 7 (showcase UI).

Why per-corpus refusal thresholds?

FastAPI and Kubernetes have different corpus characteristics. FastAPI has 16 short, well-structured docs with sparse cross-references — relevance tends to concentrate in 1-2 chunks per query. Kubernetes has 30-40 docs with heavy cross-referencing between concepts (Pod → Deployment → Service → Ingress), which spreads relevance across more chunks. A single global refusal threshold would either refuse too aggressively on K8s (no single chunk dominates, so the top score looks "low") or not aggressively enough on FastAPI (where a moderate-scoring chunk might be the only hit and should still refuse).

CorpusConfig carries refusal_threshold as a per-corpus field. Each threshold gets tuned against its own golden dataset — there is no "fair" shared threshold because BEIR showed these are not comparable across corpora. Placeholder values ship in default.yaml and are replaced by tuned values during the per-corpus evaluation sweep.

Why corpus and provider toggles compose — corpus_map[corpus][provider]

The simpler design would have been corpus_map[corpus] returning a single orchestrator. It ships in 10 fewer lines. It also silently breaks the provider toggle in multi-corpus mode: the orchestrator inside each corpus cell holds one fixed provider, and clicking "Anthropic" in the dashboard keeps running on OpenAI.

This project's hero-tile metric is the provider comparison (1.00 API / 0.14 7B self-hosted). Breaking the mechanism that demonstrates that metric — on a portfolio demo where a reviewer will open DevTools and notice — would erode the honest-evaluation brand the whole repo is built around. The nested corpus_map[corpus][provider] structure keeps both toggles functional. Store, retriever, and search tool are shared across providers within a corpus (the expensive objects are held once per corpus); only the orchestrator varies per provider since it holds the LLM client. Per-corpus × per-provider memory overhead is an orchestrator struct, not a FAISS index.

RSS is logged per corpus, not per corpus × provider, because the store is what drives memory. The provider multiplier is negligible compared to a hybrid index + embedder.

Why one parameterized system prompt, not per-corpus templates

The template is "You are a technical documentation assistant for {corpus_label}...". The only corpus-specific element is the label; prompt content is identical across corpora: same citation format, same refusal language, same grounding instructions. Having two separate prompt files would invite drift — someone tweaks the FastAPI prompt for a specific failure mode and forgets to update the K8s version, and the demo silently answers differently on the two toggles.

The parameterization is enforced by two tests: (a) format_system_prompt("") raises ValueError so an unresolved {corpus_label} can never reach the LLM, and (b) a spy on orchestrator.run_stream asserts FastAPI and K8s requests receive different prompts with the correct label substituted.

The wording deliberately differs from the typical "don't hallucinate" RAG template:

  • "refuse the question explicitly" matches our refusal-gate mechanism. "Say so politely" is soft language that models interpret as "hedge and answer anyway".
  • "do not infer, do not extrapolate, do not draw on general knowledge" is the three-verb prohibition. "Do not fabricate" is empirically easier to slip past because models distinguish fabrication (making things up) from extrapolation (drawing conclusions from adjacent but non-authoritative context).

Why Kubernetes curation targets recruiter-likely questions, not coverage

The K8s corpus targets ~30-40 pages curated around concepts a technical reviewer would naturally type (Pod, Deployment, Service, Ingress, ConfigMap, RBAC) plus cross-referencing overview pages that stress the reranker. Cluster administration deep-dives, tutorials, and kubectl reference are explicitly excluded — they add noise without adding reviewer value and hurt retrieval precision when adjacent content is thin on concept definitions.

data/k8s_docs/SOURCES.md is a version-controlled curation artifact. Each ingested URL has a one-line rationale, a date pulled, and a license note. This makes the corpus reproducible and documents the curation reasoning for any reviewer who looks closely.

Trade-off: the corpus is not comprehensive K8s knowledge. A question about etcd raft internals will be correctly refused. This is not a bug — the refusal is part of the demo story, and "the system knows what it doesn't know" is a feature of the grounded-refusal mechanism.

Why no cross-corpus score comparison (inspired by BEIR)

Inspired by BEIR's heterogeneous-benchmark framing (Thakur et al., NeurIPS 2021), which spans 18 datasets across 9 task types, absolute retrieval scores are not treated as comparable across FastAPI and K8s corpora — score distributions depend on chunk length, vocabulary overlap, and corpus density, none of which are held constant across domains. Only rank-ordering of system configurations within a single corpus is meaningful. Concrete consequences for this repo:

  • Per-corpus evaluation results are reported separately, never aggregated into a single "combined" number.
  • The hero-tile citation accuracy (1.00 API / 0.14 7B self-hosted) stays FastAPI-specific. It is not restated as a cross-corpus average.
  • make evaluate-fast accepts a --corpus flag but has no "combined" mode. Anyone who wants a cross-corpus number has to run twice and acknowledge the incomparability in prose.
  • The landing page "Key Findings" cards avoid sentences that compare FastAPI and K8s numbers directly.

The multi-corpus demo is a surface feature for interactive exploration, not a rebenchmark. The benchmark section of the README remains FastAPI-only and cites 27 questions on 16 docs with specific chunker settings.

K8s golden dataset uses CRAG's 8-type taxonomy as the schema

The K8s golden dataset uses CRAG's 8-type taxonomy (Yang et al., NeurIPS 2024) as the schema for question_type, not as a requirement to cover all 8 types. CRAG's taxonomy: simple, simple_w_condition, set, comparison, aggregation, multi_hop, post_processing_heavy, false_premise. Temporal dynamism is a separate orthogonal property captured as time_sensitive: bool on the question schema — it is not a CRAG category.

Target distribution across the 25-question K8s golden set:

  • simple (5–6): baseline retrieval
  • simple_w_condition (3–4): nuanced understanding under conditions
  • comparison (3–4): retrieval across concept pages, reranker stress
  • multi_hop (5–6): synthesis across 2–4 docs, reranker stress
  • false_premise (3–4): grounded refusal mechanism
  • set / aggregation / post_processing_heavy (0–3): included only where corpus content naturally supports

time_sensitive: bool flags 2–3 questions targeting version-bounded content (feature state, deprecations, API version migration).

false_premise questions come in two flavors (see separate "False-premise questions come in two flavors" entry): pure refusal (flavor A) and documented negative (flavor B). The K8s set includes at least one of each. Flavor A tests the path where retrieval correctly returns nothing useful; flavor B tests the path where the corpus contains an explicit negative answer and the agent must surface it with citation rather than confabulating a positive.

Rationale for using CRAG as schema (not coverage requirement): false_premise and time_sensitive stress grounded refusal and reduce test-set contamination risk; multi_hop and comparison stress the reranker because relevance spreads across multiple chunks. The distribution was chosen to exercise the parts of the pipeline the benchmark story claims — not to mimic a general-purpose QA benchmark.

The golden dataset JSON schema (v2, backward-compatible with the FastAPI flat list) includes:

  • source_chunk_ids: list[str] for multi-hop partial credit (answer must cite at least one of the expected chunks)
  • source_snippets: list[str] for human-readable context during review
  • question_type: str (CRAG taxonomy value)
  • is_multi_hop: bool for filtered reporting
  • Dataset-level header with corpus, version, snapshot_date, and pinned chunker parameters so the dataset is reproducible against a specific K8s docs snapshot

See docs/plans/2026-04-12-multi-corpus-refactor-design.md for the full schema and rationale.

EU AI Act corpus deferred to v1.2

EU AI Act compliance mapping is deferred to v1.2. Rationale: v1 ships two corpora (FastAPI, K8s) to demonstrate the multi-corpus architecture; EU AI Act as a third corpus would add ingestion and golden-set work without exercising architecturally new surface. Scoped as the first v1.2 addition after v1 launch.

Cold-start contingency: measure first, lazy-load if needed

Loading two corpora at startup costs memory and cold-start time. On HF Spaces (target deployment), the realistic ceiling is 8-10 GB resident RAM and ~60 seconds cold-start before the demo feels broken.

Policy:

  1. Measure HF Spaces cold-start on Day 1 of deployment.
  2. If cold-start < 60 s: plan validated, no changes.
  3. If cold-start > 60 s: implement a lazy-load path (FastAPI eager, K8s lazy on first K8s request). Scoped ~2 hours implementation.

This contingency is not pre-built. Pre-building a lazy-load path that may never ship creates dead code that rots, and the test surface for "lazy loading plus corpus routing plus provider switching" is non-trivial. The RSS logging in app.py (Task 2) emits the exact numbers needed to make the decision; the decision is documented here so future-me remembers the threshold and doesn't optimize prematurely on a hunch.

Cold-start gate fired — assumption falsified, fix deferred to v1.1 at the right cause

The preceding "Cold-start contingency" entry pre-committed a lazy-load fix (FastAPI eager, K8s lazy on first request) if the measured cold start exceeded 60 seconds. Measurement falsified the entry's core assumption: corpus loading is not the dominant cold-start cost. The committed fix addresses ~1 % of the observed overshoot. Executing it verbatim would honor the gate's letter but not its intent — theater dressed as discipline. This entry documents the measurement, the falsified assumption, and the new contingency pre-committed at the actual cause.

Measurement (N=3, 2026-04-15, HF Spaces target deployment):

Sample Cold start Silent Python init Visible phase
N=1 113 s ~101 s ~12 s
N=2 89 s ~70 s ~19 s
N=3 129 s ~115 s ~14 s
  • Median 113 s, mean ~110 s, range 89–129 s (spread ~40 s)
  • Gate fire is unambiguous at both tails. Even the fastest sample (89 s) is ~48 % over the 60 s threshold; the slowest (129 s) is ~115 % over. No boundary ambiguity.
  • Sample-size justification. N=3 is acknowledged as a small sample. It is adequate here because (a) the gate-fire conclusion is stable across both tails, (b) the "silent Python init dominates variance" finding is stable across all three samples (silent phase varies 70 → 115 s across runs; visible phase varies only 12 → 19 s), and (c) the cost of additional samples (manual HF Space restart + ~2 min wait + log extraction per sample) exceeds the marginal information gain once both tails fire the gate and the variance pattern is stable. N=4 would tighten the confidence interval on the median but does not change either the gate-fire conclusion or the falsified-assumption finding.
  • Variance source named. HF Spaces shared-infrastructure CPU / IO contention during Python module imports. The silent-init phase varies 45 s across samples (70 → 115 s); the visible phase is stable (12–19 s). That is the signature of host-level contention on a shared physical node, not code-level variability. An exclusively-owned container would plausibly show a tighter bound.
  • Raw log captures (preserved so this entry can be cross-checked against the underlying evidence without re-running the measurement): measurements/2026-04-15-coldstart-n1.log, -n2.log, -n3.log.

Where the cost lives. At the median (113 s):

  • Silent Python init phase — ~90 s (≈ 80 % of total): interpreter start, module imports (torch, transformers, langchain, faiss, fastapi, httpx, the full dependency closure), and initial model weight loading (all-MiniLM-L6-v2 embedder, cross-encoder reranker). Not logged — no observability inside the import chain.
  • Visible startup phase — ~15 s (≈ 15 % of total): injection classifier init (10 s, includes the "classifier skipped" warning), FastAPI corpus load (< 1 s, +0.9 MB RSS), K8s corpus load (< 1 s, +25.8 MB RSS), reranker warmup (2 s).

The K8s corpus load — which the pre-committed fix was designed to defer — contributes under 1 second of the 113-second median. Deferring it saves roughly 1 % of the overshoot. FastAPI corpus load is the same order of magnitude. Corpus loading is simply not where the cost lives on this deployment.

Why we are not executing the pre-committed fix. The preceding contingency was written under an empirical assumption about cost attribution (corpus loading is the dominant cost). Measurement falsified the assumption. Implementing the fix anyway would be a mechanical execution of a recipe whose premise has been disproven — it checks the gate-honoring box while failing to address the cause. That is structurally identical to relaxing-by-redefinition ("60 s was too tight"), just in the opposite direction: relaxing by execution. The pre-commitment rule's purpose is to prevent motivated reasoning about the gate, not to mandate mechanical compliance with a recipe whose empirical foundation has collapsed.

The honest action is (1) accept the measurement as the v1 baseline, (2) document the falsified assumption explicitly (this entry), (3) re-pre-commit a new contingency at the actual dominant cost with an explicit trigger condition so the decision is not relitigated at review time, and (4) update the user-facing README surface to reflect the measured cold-wake number rather than the optimistic pre-deploy estimate.

v1.1 contingency — pre-committed:

If HF Spaces traffic produces more than N cold wakes per day (N to be determined from observed usage patterns after launch, not estimated in advance), defer eager loading of (a) the cross-encoder reranker, (b) the sentence-transformers embedder, and (c) the injection classifier tier to first-relevant-request.

Estimated work: 4–6 hours (lazy-init wrappers + first-request caching + integration tests for the warm/cold transition).

Expected tradeoff: cold wake ~113 s → ~50–60 s (approaches the original 60 s target); first request after any cold wake incurs +8–15 s additional latency (model weights load synchronously in the request path), after which subsequent warm requests return to normal ~5 s latency.

Trigger is usage-justified, not estimate-justified. Until real traffic data justifies the work, there is nothing to optimize — a recruiter demo that gets one cold wake per day does not pay for 4–6 hours of engineering plus the new first-request-latency failure mode. The trigger threshold N is left unnamed deliberately: naming a number in advance would invite the same falsification pattern this entry is documenting.

Methodology lesson. When a pre-committed contingency is written under an empirical assumption, the contingency only holds if the assumption survives measurement. If measurement falsifies the assumption, the correct action is to document the falsification, accept the observed baseline, and re-pre-commit at the actual cause. The wrong action is to execute the original recipe anyway, which trades one form of motivated reasoning (threshold relaxation) for another (recipe compliance). The underlying discipline — "pre-commit your gates and honor them" — does not mean "mechanically run the pre-committed fix regardless of what it addresses." It means "honor the gate's intent, which is to prevent motivated reasoning about pass/fail."

Post-hoc refinement (2026-04-22) — three latency regimes observed. Follow-up warm-latency measurements after the initial cold-start characterization distinguish a middle regime that the README's "~2 min cold / ~5 s warm" phrasing collapses:

Regime Latency Cause
Cold-start (container spin-up) 89–129 s Python imports, model load (per the table above)
Wake-from-idle (first /ask after container is up) ~6.7 s (≈ +2 s over warm) Residual lazy init on first request
Steady-state warm ~5 s (mean 4.74 s, n=5) LLM API round-trip + retrieval

Source: n=1 first-hit-after-cold-wake @ 6.7 s, then n=5 steady-state warm @ 4.07, 4.69, 4.78, 5.06, 5.11 s (2026-04-22). The README claim captures regimes 1 and 3; the ~2 s first-hit penalty between them is a refinement of the warm-cold boundary, not a contradiction. The v1.1 projection of "+8–15 s first request after any cold wake" measures the delta from this ~2 s current baseline, not from a zero-penalty starting point — the true v1.1 UX cost over current behavior is closer to +6–13 s.

False-premise questions come in two flavors

When authoring golden-dataset questions whose premise is wrong, the question can point at one of two genuinely different failure modes. Both are valid; they test different pipeline paths and should be labeled distinctly so the evaluator routes correctly.

Flavor A — pure refusal. The premise is not addressed anywhere in the corpus. Example: "How do I configure Claude API rate limits in Kubernetes?" K8s has no such concept. Schema: category: "out_of_scope", expected_sources: [], source_snippets: []. The evaluator's grounded_refusal metric expects the answer to contain a refusal phrase ("does not contain", "no information") AND cite zero sources. Tests the pipeline path where retrieval correctly returns nothing useful and the agent correctly declines.

Flavor B — documented negative. The corpus contains an explicit negative answer. Example: "How do I configure NetworkPolicy to enforce mTLS?" The K8s NetworkPolicy docs have a "What you can't do with network policies" section that explicitly says "Anything TLS related (use a service mesh or ingress controller for this)". Schema: category: "retrieval", question_type: "false_premise", expected_sources: [<the negative-answer page>], source_snippets: [<the verbatim negative statement>]. The evaluator expects the agent to retrieve the page, find the negative statement, and answer negatively with a citation. Tests the stricter path where the corpus genuinely contains the answer and the agent must not hallucinate a contradictory capability.

Why both matter for the honest-evaluation brand. Grounded refusal is not "refuse when retrieval is weak." It is "answer exactly what the source says, including when the source says no." Flavor A tests the first half (refuse when there is nothing to ground on); flavor B tests the second half (report the documented negative instead of confabulating a positive). The K8s golden dataset includes at least one of each. The first K8s pilot (k8s_pilot_005, NetworkPolicy mTLS) is flavor B. Flavor A is reserved for questions targeting features that genuinely do not exist in the K8s corpus; at least one such question is required in the full 25-question set.

Pilot_005 refusal-gate + agent-behavior measurement

The first K8s pilot run surfaced two distinct flavor-B failure modes on k8s_pilot_005 (NetworkPolicy mTLS). Both are empirical, both have specific numbers, and both are logged in results/k8s_pilot_threshold_0.02.json and results/k8s_pilot_threshold_0.015.json.

Failure mode 1 — threshold calibration (at 0.02). The SearchTool.execute() refusal gate fired with max_score=0.01639 — exactly 1/(60+1), the rank-1 RRF score from a single fusion system. BM25 hit "NetworkPolicy" at rank 1; the dense encoder contributed nothing, because "Anything TLS related (use a service mesh or ingress controller for this)" is a single negative sentence, not a conceptual topic the page is semantically "about." Hybrid fusion inherited only the BM25 rank-1 score. At threshold 0.02 (the FastAPI working value), the gate refused before the agent saw any chunks. Retrieval P@5 and R@5 both 0.00; answer is a generic refusal.

Failure mode 2 — agent behavior on documented negative (at 0.015). With the threshold dropped just below the measured max score (0.015 < 0.01639), retrieval is perfect: P@5 1.00, R@5 1.00, all five top chunks from k8s_network_policies.md. But the agent still produces a flavor-A-style refusal: "The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS..." The "Anything TLS related" sentence is in the retrieved chunks — the agent simply treats the absence of positive instructions as grounds for refusal, rather than reading the explicit negative sentence and citing it as the answer. KHR 0.67: the service mesh and ingress controller keywords (the documented alternatives the page points to) are missing from the answer.

Implication. The flavor-B mechanism requires more than threshold tuning. Fixing the gate is necessary but not sufficient. The system prompt needs a flavor-B clause (e.g., "if the documentation explicitly says a feature does not exist or is not supported, report that with citation — do not treat it as unanswerable"), or the K8s golden dataset's flavor-B questions must use phrasing the current prompt can route correctly. The 0.30 placeholder value from the design doc was based on "prefer conservative" intuition without empirical grounding — the measured working range for K8s pilot retrieval is lower by more than an order of magnitude than that intuition, and even at the working threshold the prompt layer is the blocker.

What this measurement is. A pilot smoke-test result, not a benchmark claim. Aggregates at 0.02: P@5 0.63, R@5 0.83, KHR 0.69. Aggregates at 0.015: P@5 0.80, R@5 1.00, KHR 0.75. Five of six pilots produce substantively correct answers on K8s content under the working threshold — evidence the retrieval stack generalizes to K8s. The pilot's job was schema validation + calibration evidence, not launch metrics. Launch metrics come from the 25-question K8s golden set with tuned threshold and (likely) a revised system prompt, sequenced after this pilot.

Evaluation-layer multi-corpus support lagged the serving-layer refactor

The Tasks 1–8 multi-corpus refactor wired corpora through app.state.corpus_map and the /ask serving route. scripts/evaluate.py was not touched and remained single-corpus — it read config.rag.store_path and config.evaluation.golden_dataset directly, with no awareness of the corpora dict. This was an accurate scoping of the refactor (serving-layer, not eval-layer) but the gap was not surfaced in the original task list.

The K8s pilot commit adds --corpus <name> to scripts/evaluate.py, routing through config.corpora[name] for store_path, refusal_threshold, and a new optional golden_dataset field on CorpusConfig. Without --corpus, the legacy single-store path is preserved for backward compatibility with make evaluate-fast and any existing invocations.

CorpusConfig.golden_dataset is str | None = None — optional rather than required — because two legitimate states exist: corpus has a golden dataset (FastAPI, K8s post-authoring), and corpus has no golden dataset yet (any corpus during bring-up). The CLI errors cleanly with "corpus '' has no golden_dataset configured" when the field is None, rather than requiring all corpora to ship with datasets.

Deferred: path-preserving ingestion

scripts/ingest.py uses doc_path.glob("*.md") (non-recursive) and stores the bare filename as the chunk's source field. This forces a flat-namespace convention: FastAPI ships as fastapi_*.md, K8s ships as k8s_*.md, and golden dataset expected_sources are filename stems. The path-preserving alternative (recursive rglob plus relative-path source IDs, e.g., concepts/workloads/pods) was evaluated during the K8s pilot planning and explicitly deferred. The root-cause refactor would have required FastAPI re-ingestion and a rewrite of the FastAPI golden dataset's expected_sources — trading certain regression risk on a green baseline (288 tests, citation accuracy 1.00 on API providers) for speculative legibility benefit on K8s authoring.

The source_pages field on GoldenQuestion preserves the human-readable path anchor separately from the machine identifier, so the deferral does not lose information. Authors see both expected_sources: ["k8s_pods.md"] (what the evaluator matches on) and source_pages: ["concepts/workloads/pods"] (where the content came from on kubernetes.io) in the same question record.

Pattern marker, not a promise. This is the second visa-timeline deferral of a root-cause refactor in favor of a minimal-blast-radius fix; the first was the Mar 25 → Apr 12 P@5 slide bisection. Both deferrals were deliberate, not forgetting. Not scheduled until post-launch; marker only. Post-launch scope: modify ingest.py to rglob + relative-path source IDs, re-ingest FastAPI, rewrite both golden datasets' expected_sources to path-style. Estimated 3h.

K8s refusal_threshold empirical calibration — 0.02 → 0.015

Change. configs/default.yaml, corpora.k8s.refusal_threshold: 0.020.015. Single-line config change, pilot-corpus only. FastAPI threshold unchanged.

Empirical evidence. Diagnostic instrumentation of k8s_pilot_005 ("How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?") captured the retrieval gate firing at max_score = 0.01639344262295082 — exactly 1 / (60 + 1), the algebraic floor for a single rank-1 BM25 hit under RRF with rrf_k = 60, dense contribution zero. At refusal_threshold = 0.02, pilot_005 tripped the gate and short- circuited before retrieval chunks reached the agent. At refusal_threshold = 0.015 (one tick below the measured floor), the gate releases and retrieval proceeds. The 0.015 value is not a tuning guess — it is the nearest round-number floor below the observed gate-fire value for the single worst pilot in the set.

Validation. results/k8s_preedit.json captures the full 6-pilot run at 0.015. Aggregate: P@5 0.80, R@5 1.00, KHR 0.78, mean tool_calls_made 1.167. All six questions receive retrieval; no gate-fire short-circuits. pilot_005 still refuses as a separate downstream issue (see next entry when the counterfactual-query fix lands); that is not a threshold problem.

Scope of this commit. K8s only. FastAPI refusal_threshold (0.02) is not affected and FastAPI baseline is not re-measured. Launch-intent 0.30 placeholder for K8s remains as a comment marker; the full threshold sweep against the 25-question golden set replaces 0.015 with a properly-tuned value in a later commit. 0.015 is the pilot-floor safety value, not the production-target value.

Why this is a separate commit from the prompt revision. The threshold calibration is empirically grounded on its own — it removes the 0.01639 gate-fire blocker, which is the precondition for any downstream evaluation of pilot_005's actual agent behavior. The prompt revision addresses a different failure mode surfaced once the gate releases (agent search strategy is monotone positive- framing). Two independent changes must not entangle in one commit; if the prompt revision fails its regression gate and is reverted, the threshold calibration should stand on its own empirical merit. Feedback memory feedback_fix_before_sweep.md applies recursively: fix measurement-affecting bugs at every layer before combining fixes into single experiments.

Prep for counterfactual-query prompt regression — pin, wire, tolerances

Three sub-changes bundled as one prep commit, each small and in service of making the downstream regression measurement valid.

1. OpenAI model pin. agent_bench/core/provider.py:208 changes self.model = "gpt-4o-mini"self.model = "gpt-4o-mini-2024-07-18". The unpinned alias is a known drift vector — the Mar 25 → Apr 12 P@5 slide bisection is an already-open parallel track item traceable to silent alias migration. A regression run that uses the alias across pre-edit and post-edit phases conflates prompt-clause effect with model drift, even within a single session if the alias happens to roll between runs. Pinning the dated snapshot removes the variable. Pricing dict in configs/default.yaml gets a matching gpt-4o-mini-2024-07-18 entry so the cost-lookup at provider.py:209 still resolves. Tests that pin the model string live in mock response payloads (not outgoing assertions) and the langchain baseline (separate code path) — neither affected.

2. FastAPI multi-corpus eval wiring. configs/default.yaml adds corpora.fastapi.golden_dataset: agent_bench/evaluation/datasets/tech_docs_golden.json. The production serving path at routes.py:105-120 _resolve_system_prompt already routes /ask and /ask/stream through format_system_prompt(label) from core/prompts.py — the app.state.system_prompt legacy fallback (serving/app.py:276) is effectively dead code given the shipped multi-corpus config. The only remaining caller of task.system_prompt is the scripts/evaluate.py legacy branch used by make evaluate-fast. Adding the missing golden_dataset field makes --corpus fastapi work so the regression gate can measure the actual production prompt path, not the legacy eval-scaffolding prompt. Purely additive; zero blast radius on serving (serving doesn't read golden_dataset).

3. Pre-committed four-metric tolerances. Written down now, before the post-edit runs, so the pass/fail call on the counterfactual-query prompt clause is not a judgment under confirmation-bias pressure. Applied identically to FastAPI and K8s:

Metric Pass criterion
P@5 post-edit ≥ pre-edit − 0.02
R@5 post-edit ≥ pre-edit − 0.02
Citation accuracy post-edit ≥ pre-edit (hard gate — any drop blocks commit)
Mean tool_calls_made post-edit ≤ pre-edit + 0.30
Individual question cap no question that used fewer than max_iterations=3 iterations pre-edit may hit the cap post-edit

pilot_005 strict flip criterion (K8s-only):

  • keyword_hit_rate ≥ 0.60 against golden keywords ["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]
  • Answer cites k8s_network_policies.md
  • Answer contains "service mesh" OR "ingress controller" (the concrete documented-negative evidence the pre-edit refusal lacked)
  • Answer does NOT begin with refusal phrasing ("The ... documentation does not provide", "I cannot answer")

Baseline reference: K8s pre-edit numbers from results/k8s_preedit.json at commit 125dac0 — P@5 0.80, R@5 1.00, citation 1.00 (all 6), mean tool_calls 1.167. FastAPI pre-edit reference established by results/fastapi_preedit.json in the next step of this session, same pinned ID, same refusal threshold (0.02).

Rationale for bundling. All three sub-changes answer "what must be true before the regression measurement is valid" — drift control, evaluation path, decision criteria. Splitting into three commits would add noise without adding signal. None of them change the prompt template itself; the prompt edit is the NEXT commit and is the sole experimental variable the regression measures.

Fix 1 (prompt-level counterfactual clause) attempted and reverted

Outcome. K8s regression clean on every metric (P@5, R@5, KHR, citation, mean tool_calls all within tolerance or unchanged); K8s pilot_005 flipped from refusal to documented-negative-with-citation as designed (KHR 0.67 → 1.00, answer contains both "service mesh" and "ingress controller", cites k8s_network_policies.md). FastAPI regression failed on the iteration-inflation tolerance: mean tool_calls_made 1.111 → 1.556 (delta +0.444, gate +0.30), and two retrieval questions (q024, q025) were pushed from 1 pre-edit tool call to 3 post-edit tool calls (hitting max_iterations=3 cap), violating the pre-committed "no new cap-hits from sub-cap baseline" criterion.

Correctness metrics on FastAPI all held. Citation accuracy stayed at 1.000 / 1.000 across all 27 questions. P@5 delta −0.007, R@5 delta 0.000, KHR delta +0.006. The failure is purely process inflation, not output regression. q024 and q025 produce identical P@5/R@5/KHR/citation numbers pre and post despite the cap-hit — the orchestrator's "max iterations hit → one final complete() without tools" path happened to keep answers correct, but that is observation, not structural protection.

Failure mode. The clause's trigger condition — "your first search returned documentation about the subject of the question without addressing the specific capability or feature the user is asking about" — relies on subjective LLM judgment about whether retrieved content "addresses" a capability. The judgment is fuzzy on compound multi-topic questions where the first search returns partial-topic coverage. q024 asks about "Docker + Gunicorn workers

  • health checks + Pydantic Settings"; first search returns Docker content, LLM reads "documentation about the subject without addressing the specific capability," fires the follow-up with negative framing, gets nothing useful, does a third normal search to cover the remaining topics, hits the cap. Same pattern on q025. Over-firing on this class of question is an inherent fragility of prompt-level LLM-judged triggers; a wording refinement might narrow the misfire rate but cannot eliminate it as long as the judgment itself is fuzzy.

q023 vs q024/q025 asymmetry is a useful signal for Fix 2. q023 is a pre-existing 3-tool-call compound question ("custom error handling + CORS middleware + structured testing with dependency overrides"). Under the prompt clause, q023 was unchanged — the clause did not fire on it — while q024 and q025, structurally similar compound questions, were pushed into 3-tool-call cap-hit. The difference is not in question structure but in how the LLM interpreted the first-search return for each. That asymmetry is the precise reason a deterministic trigger is the right next step: any Fix 2 / Fix 3 candidate should be unit-testable against (pilot_005, q023, q024, q025) — the right fix must fire on pilot_005 and behave predictably on all three compound questions (either fire on all of them or none of them, but not pick them selectively by LLM whim).

Gate discipline honored. The pre-committed FastAPI tolerances fired for exactly the reason the pre-commitment was designed: catching process-metric regressions before they ship. Tolerance- relaxation post-hoc would burn the session's strongest discipline artifact (pre-committed-tolerances + honored-gate) for marginal ship-this-approach EV. The narrow pilot_005 finding does not evaporate with the revert — chunk 63 (d0806d5da91d6026) is real, the negative-framing retrieval is reproducible, and Fix 2 will surface the documented negative the same way via a deterministic path.

Fix 2 deferred to a later session. Deterministic query expansion at the SearchTool layer: when a search_documents call returns no chunk containing a direct answer string, issue a second internal search with negative-framing keywords and merge results before returning to the orchestrator. Offline-testable, corpus-agnostic, no LLM judgment required, no iteration-budget impact (the double-search happens inside a single tool call, not across iterations). Unit-testable against the (pilot_005, q023, q024, q025) asymmetry as an acceptance fixture.

Evidence retained. Four result JSONs in results/ document the regression measurement at the pinned gpt-4o-mini-2024-07-18 snapshot in this session:

  • fastapi_preedit.json — 27 questions, HEAD prompt, 0.02 threshold
  • fastapi_postedit.json — 27 questions, clause prompt, 0.02 threshold (gate-failing run)
  • k8s_preedit_pinned.json — 6 pilots, HEAD prompt, 0.015 threshold
  • k8s_postedit.json — 6 pilots, clause prompt, 0.015 threshold (gate-passing run, pilot_005 strict flip confirmed)

The previously-committed results/k8s_preedit.json (from 125dac0) is also a valid K8s-pinned measurement at the session-equivalent snapshot and remains the canonical threshold-commit evidence.

Held DECISIONS.md drafts stay held. The counterfactual-query finding draft (to be updated when Fix 2 lands) and the threshold- calibration entry already committed at 125dac0 are both correct in scope. The narrowed serving-migration deferral entry (tied to any external reference to the counterfactual-query fix) also stays deferred until Fix 2 lands, since the production/eval-harness prompt divergence is unchanged by this revert.

Fix 2 pre-committed regression gate — SearchTool deterministic query expansion

Pre-committed BEFORE post-edit runs (same discipline pattern that caught Fix 1's iteration inflation cleanly).

Mechanism under test. agent_bench/tools/search.py SearchTool.execute gains a deterministic two-query retrieval path. When the primary retrieval passes the refusal gate, a secondary retrieval is issued against an expanded query (original_query + " not supported limitations cannot"), and the final context returned to the LLM is primary_top_3 ++ secondary_top_5 deduplicated by chunk.id. Both retrievals run inside a single SearchTool.execute call — from the LLM's perspective, the tool schema, name, parameters, and return shape are unchanged, and the iteration budget is untouched.

Why this is architecturally different from Fix 1. Fix 1 placed a behavioral clause in the system prompt that told the agent to issue follow-up searches itself. The trigger was an LLM judgment ("did the first search return content addressing the specific capability?") and the follow-up was a separate tool call, so it counted against max_iterations. Over-firing on compound questions inflated iteration counts and pushed q024/q025 to the cap. Fix 2 replaces this with a deterministic trigger (primary passes gate), a fixed expansion suffix, and a merge that happens entirely inside one tool call. No LLM judgment; no iteration change; corpus- agnostic.

Suffix choice. " not supported limitations cannot". Keyword- dense, ungrammatical on purpose — the suffix exists to shift BM25 and embedding mass toward "what you cannot do" / "limitations" sections, not to read well. The ungrammatical form is also a self- documenting signal in retrieval logs: anyone reading a query trace sees the suffix and immediately knows it is a synthetic expansion, not user input. A one-line comment in search.py preserves the rationale for future readers.

Merge choice. primary_top_3 + secondary_top_5 deduped by chunk.id, producing 5–8 unique chunks per call. Rationale: top-5 primary would make the expansion redundant on high-overlap queries (defeating the mechanism), while primary-top-3 guarantees the expansion always contributes to the final context window. Probe data (/tmp/probe_fix2_v2.py, throwaway) confirms this merge strategy surfaces pilot_005's target chunk (d0806d5da91d6026, chunk_index 63, "Anything TLS related ... use a service mesh or ingress controller for this") at position 6–8 in the merged list.

Opt-in flag, defaulting ON. SearchTool accepts negative_framing_expansion: bool = True. Default is the shipping configuration because the regression gate must measure the shipping behavior, not the no-op path. A False default would mean the gate validates an unused parameter, and a subsequent commit flipping the default would have no regression evidence. Kill switch is preserved via explicit False at construction if a future regression requires an A/B comparison.

Baseline reuse. The Fix 1 session's pre-edit JSONs (results/fastapi_preedit.json, results/k8s_preedit_pinned.json, both committed at 213da36) were measured under the currently- committed state of the repo: pinned gpt-4o-mini-2024-07-18, K8s threshold 0.015, FastAPI threshold 0.02, HEAD prompts.py with no clause, HEAD search.py with no expansion. The working tree verification confirms this state is unchanged. These JSONs are therefore reused as the Fix 2 pre-edit baseline and do not need to be re-measured. Only post-edit runs are required for the Fix 2 regression (~$0.02 saved).

Pre-committed tolerances.

Metric Pass criterion
P@5 post-edit ≥ pre-edit − 0.02
R@5 post-edit ≥ pre-edit − 0.02
Citation accuracy post-edit ≥ pre-edit (hard gate — any drop blocks commit)
Mean tool_calls_made post-edit ≤ pre-edit + 0.05 (design-correctness gate — see note)
Individual cap-hit no question that used fewer than max_iterations=3 iterations pre-edit may hit the cap post-edit

Note on the tool_calls gate. ≤ +0.05 is a design-correctness gate, not a performance gate. Fix 2's invariant is that both retrievals happen inside one SearchTool.execute call, so the LLM's iteration count is unchanged by construction. Any non-trivial movement in mean tool_calls_made indicates the design invariant is broken — e.g., expansion accidentally exposed as a separate tool, or the LLM observing two-call behavior and adapting its strategy. The gate fires on design violation, not on performance regression. The 0.05 absolute threshold absorbs legitimate run-to- run variance from non-determinism in the LLM even at temperature 0, without absorbing real iteration-count movement.

pilot_005 strict flip criterion (K8s-only, unchanged from Fix 1 gate):

  • keyword_hit_rate ≥ 0.60 against golden keywords ["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"]
  • Answer cites k8s_network_policies.md
  • Answer contains "service mesh" OR "ingress controller"
  • Answer does NOT begin with refusal phrasing

Baseline reference for the gate.

Corpus Pre-edit source P@5 R@5 Citation Mean tool_calls
FastAPI (27) results/fastapi_preedit.json @ 213da36 0.585 0.679 1.000 1.111
K8s (6 pilots) results/k8s_preedit_pinned.json @ 213da36 0.800 1.000 1.000 1.167

Post-edit filenames (to be produced).

  • results/fastapi_postedit_fix2.json
  • results/k8s_postedit_fix2.json

If the gate passes: commit Fix 2 with search.py change, unit tests (including the tool-spec snapshot test), the two post-edit result JSONs, and this DECISIONS.md entry extended with the regression outcome.

If the gate fires: revert, document the failure mode, surface the specific criterion that fired. No tolerance relaxation — same discipline pattern as Fix 1 revert.

Fix 2 outcome — mechanism works, response-style criterion fired, reverted

Regression runs produced. Two post-edit runs on K8s (FastAPI not run — K8s findings gated the decision before API spend on the broader set):

Run Merge rule File Purpose
Fix 2 v1 primary[:3] + secondary[:5] results/k8s_postedit_fix2.json Initial implementation
Fix 2 v2 primary[:5] + secondary[:5] results/k8s_postedit_fix2_merge_v2.json Path A refinement after v1 failed P@5 on a metric-definition mismatch

v1 findings. Aggregate: P@5 0.800 → 0.767 (Δ −0.033, FAILED the P@5 ≥ −0.02 tolerance). The failure traced to a merge-rule / metric-semantics interaction: retrieval_precision_at_k computes precision on retrieved_sources[:5], and with primary[:3] + secondary[:5] the first 5 entries were primary_top_3 + secondary_top_2. For pilot_005, secondary[1] was k8s_pods.md (chunk_index 40, surfaced because the reranker matched its "localhost communication" content against the expanded query). That single off-source chunk in position 5 dropped P@5 from 1.00 to 0.80 for pilot_005 and similarly for pilot_006. Iteration invariant held (tool_calls 1.167 → 1.167). Citation accuracy held (1.000 → 1.000). Target chunk (d0806d5da91d6026, "Anything TLS related") reached the LLM context for pilot_005 at merged position 7.

Path A refinement (merge v2). Change primary[:3] + secondary[:5]primary[:5] + secondary[:5]. Rationale: primary_top_5 is preserved in positions 1–5 by construction, so P@5 computed on ranked_sources[:5] is unchanged from the no-expansion baseline. Expansion chunks land in positions 6–10. Target chunk still reaches LLM context (position 9 for pilot_005). This is an implementation refinement, not a tolerance relaxation — the pre-committed gate thresholds stand; only the merge rule was adjusted to respect the metric's window semantics.

v2 findings — perfect metric preservation, but strict-flip fails on response style.

Aggregate:

Metric Pre-edit Fix 2 v2 Delta
P@5 0.800 0.800 0.000
R@5 1.000 1.000 0.000
KHR 0.806 0.806 0.000
Citation accuracy 1.000 1.000 0.000
Mean tool_calls_made 1.167 1.167 0.000

Every aggregate metric literally unchanged. Per-question deltas: zero on every metric, every question. The design invariant (iteration budget unchanged, tool schema unchanged, refusal gate behavior unchanged) holds perfectly.

But pilot_005 strict flip fails on the refusal-phrasing criterion. Post-edit answer:

"The Kubernetes documentation does not provide specific instructions on configuring a NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace. For mTLS, it is generally recommended to use a service mesh or other proxy solutions, as NetworkPolicy alone does not handle TLS configurations directly [source: k8s_network_policies.md]."

The answer substantively contains the documented negative with citation. But it opens with "The Kubernetes documentation does not provide specific instructions..." — the exact refusal- phrasing opener the strict-flip criterion was pre-committed to reject. The criterion exists because the brand is honest evaluation: an answer that opens apologizing that the documentation "does not provide specific instructions" reads, to a technical reviewer, like the system failed to find the answer and is papering over the gap, even though the facts and citation are present. The criterion fired as designed.

Compare to Fix 1 post-edit answer (from 213da36 evidence):

"Kubernetes NetworkPolicy does not support enforcing mutual TLS (mTLS) directly. The documentation states that anything TLS related should be handled using a service mesh or ingress controller, rather than through NetworkPolicy [source: k8s_network_policies.md]."

Fix 1's answer asserts a fact about NetworkPolicy ("does not support"); Fix 2's answer asserts a fact about the documentation ("does not provide instructions"). The first forecloses the capability; the second leaves open whether the capability exists somewhere the system didn't see. That distinction is load-bearing for any grounded-refusal narrative, and it separates a system that handles documented negatives crisply from one that hedges around them.

Diagnosis. Fix 2's mechanism successfully gets the target chunk into the LLM's context window — the retrieval side of the problem is solved. What Fix 2 cannot provide is explicit guidance on how to phrase the documented negative once the chunk is present. Fix 1's prompt clause was doing that guidance work; removing the clause and relying on the LLM's unaided response style produces a hedging answer because the LLM, seeing both NetworkPolicy-spec content and a TLS limitation bullet, defaults to contextual hedging rather than crisp assertion.

Fix 2 is therefore not an alternative to Fix 1's prompt clause — it is a prerequisite. Fix 2 guarantees the chunk reaches context; a future "Fix 2 + targeted prompt clause" stack could resolve both the retrieval gap and the response-style gap without Fix 1's over-firing problem, because the clause would no longer need to direct the agent to do a follow-up search (Fix 2 handled that). The over-firing on compound questions that broke Fix 1 was caused by the agent deciding to do extra search iterations under LLM judgment; if the expansion already happened deterministically inside the first tool call, the clause has less work to do and may not trigger the second-LLM-call pattern at all. Speculative and not for this session. Future work item.

Gate verdict: failed on pilot_005 strict flip criterion. Reverting, same Fix-1 pattern.

What this commit contains.

  • agent_bench/tools/search.py reverted to HEAD (no Fix 2 code changes)
  • tests/test_tools.py retains the MockChunk.id hygiene fix (the real Chunk class has id; mock should match the real API for future test authors)
  • tests/test_tools.py adds TestSearchToolSpecSnapshot: a general-purpose guard that freezes SearchTool's LLM-facing contract (name, description, parameters). The lesson from Fix 2 is that any future refactor exposing internal SearchTool state to the LLM would break iteration-budget invariants — the snapshot test catches that at test time, independent of whether Fix 2 lands.
  • Two regression evidence JSONs: results/k8s_postedit_fix2.json (v1, the P@5 failure) and results/k8s_postedit_fix2_merge_v2.json (v2, the strict-flip failure). Retained as the measurement trail behind the revert decision.
  • This DECISIONS.md entry (pre-committed gate + outcome + revert narrative).

What this commit does NOT contain. No changes to agent_bench/tools/search.py, agent_bench/core/prompts.py, or configs/default.yaml. Both Fix 1 (prompt clause) and Fix 2 (SearchTool expansion) have been attempted and reverted this session. Three commits of progress nonetheless: 125dac0 (threshold calibration, empirical), 5c1f49f (prep bundle: model pin + fastapi wire + Fix 1 pre-committed tolerances), 213da36 (Fix 1 revert narrative). The threshold calibration and model pin are real, shipped, measurement-grounded infrastructure changes. The two fix attempts are documented learning that shapes the future direction.

grounded_refusal metric reads answer text, not retrieved sources — 2026-04-14

Context. Week 1 step 5 authoring (25-question K8s golden set). Two flavor-A out-of-scope questions (k8s_004 Jaeger sidecar, k8s_024 Envoy xDS ADS) surfaced a pre-existing bug in the grounded_refusal metric during the functional check.

Bug 1 — wrong signal. The metric's docstring said it checks whether the answer correctly refuses AND cites no sources, but the implementation was checking len(response_sources) == 0 where response_sources is the retrieved-sources list. Real agents retrieve candidates on any non-trivial OOS query (the grounded-refusal gate at tool level only catches the thinnest queries), inspect the candidates, find nothing relevant, and refuse in the answer text without citing anything. Checking retrieval emptiness flagged those correct refusals as failures. Fix: inspect the answer text for [source: X.md] citations via regex; drop the response_sources parameter from the signature entirely.

This was a silent false negative on all 5 fastapi out-of-scope questions (q008q010, q026q027) which all correctly refuse but were being marked grounded_refusal=False. Aggregate refusal_rate in report.py shifts by the resulting 5-question delta; any historical comparison to pre-fix fastapi numbers needs to acknowledge this.

Bug 2 — metric coverage gap surfaced during 25-question authoring. grounded_refusal_rate recognized "does not contain information" phrasing (in refusal_phrases list) but missed "not in the {corpus_label} documentation" phrasing — the exact shape taught by the system prompt at core/prompts.py:17-18. The LLM produced the canonical form on some questions and the phrase-list form on others; the metric inflation/deflation was non-deterministic. Fix: narrow regex \bnot in the\b[^.]{0,60}\bdocumentation\b added alongside phrase-list matching.

Rejected alternative. Substring "not in the" would produce false positives on valid-answer phrasing — "the rate limit is not in the same scope as the request timeout", "the flag is not in the 1.28 release; it landed in 1.29", "this value is not in the default range" — all of which are legitimate retrieval answers with conditional or scope-limiting language, not refusals. Honest evaluation cannot afford a metric that silently counts these as grounded refusals.

Tests. Two unit tests pin both directions: test_canonical_refusal_phrasing_recognized covers the positive case ("The answer is not in the Kubernetes documentation"), and test_not_in_the_is_not_substring_refusal covers the negative case ("The rate limit is not in the same scope as the request timeout"). The negative test is the load-bearing one — without it, a future refactor could silently widen the matcher back to substring and pass all existing tests. The negative test pins design intent.

Scope bound. This is a metric correctness fix, not a threshold change. The 0.015 refusal-gate threshold (calibrated in 125dac0 against the 6-question pilot) is unchanged by this commit. Whether the corrected metric shifts the optimal threshold against the full 25-question set is a question for the threshold-sweep session, not this authoring session.

Parallel tracks / deferred items — 2026-04-14

Tracked list of work items that are deferred to parallel sessions. Each item has a reason for deferral and a rough scope boundary so the session that picks it up has the context to pre-commit tolerances and decision criteria before measuring.

  1. routes.py:552 audit-logger semantics unification. The serving layer's audit record field still uses the pre-fix grounded_refusal = not bool(sources) expression, which disagrees with the evaluation metric's answer-text-based definition. Not surfaced to the dashboard (audit log only), but external reviewers who reference audit records for runtime verification would see a different definition than the benchmark claims. Fix: call grounded_refusal(answer, category) from metrics.py directly. When this lands, the "grounded_refusal metric" DECISIONS.md entry above should get a one-line addendum noting the unification.

  2. Full 25Q threshold sweep → production-target refusal_threshold for K8s. The 25Q set exists, the metric is correct. Sweep against the full set, compare to pilot-floor 0.015, pick the production-target value, update configs/default.yaml placeholder comment. Pre-commit before measuring: sweep range, decision criteria, tolerances. Do not entangle with flavor-B response-style work below — those are independent axes.

  3. Flavor-B response-style class (pilot_005 + k8s_022). Two independent reproductions of "LLM refuses when documented negative is in retrieved context". Retrieval is healthy on both; the gap is prompting. Future session: Fix 2 (counterfactual-query expansion in SearchTool) + targeted prompt clause stacked — previously speculative in the Fix 2 revert entry, now addresses a documented reproducible class. Two reproductions, not one-off.

  4. Serving-migration deferral. Tied to external references to the counterfactual-query fix. Unchanged from prior sessions.

  5. agent-benchrefusal-bench rename — CLOSED 2026-04-14. Decision: keep agent-bench, reframe via tagline. The original concern was name collision with AgentBench (Liu et al., ICLR 2024, 1000 citations). Due-diligence at launch time: the name is agent-bench (hyphenated) vs. AgentBench (camelcase), which are distinct identifiers across GitHub, arXiv, and PyPI. The two projects target different audiences (LLM-as-agent capability vs. RAG+refusal benchmark) and any reviewer reaching the repo via LinkedIn or CV sees the scope in the README within seconds. Rename cost is substantial (350 internal references across ~60 files, two external account renames, one HF Space URL break with no redirect) for a naming-precision benefit that isn't supported by the actual scope — the benchmark measures retrieval, grounding, multi-hop, citation accuracy, and refusal as seven axes, not refusal alone. Tagline reframe captures the honest-evaluation positioning without the rename cost:

    "A RAG benchmark built from primitives, with honest evaluation of retrieval, refusal, and grounded citation." HF Space rename (Nomearod/agentbenchNomearod/agent-bench for GitHub-name consistency) is a separate, smaller follow-up deferred approximately one week. Reason: several job applications submitted the preceding week reference the current HF URL (nomearod-agentbench.hf.space); renaming the Space now would break those inbound links with no HF-side redirect. The rename absorbs cleanly once the application wave lands and the reference window expires. Until then the README, dashboard, and DECISIONS.md continue to reference the current agentbench URL; launch-adjacent work (Post #1, screenshots, cold-start measure) uses the current URL and will be updated in a single small follow-up commit when the rename happens.

  6. OpenAI snapshot drift bisection. Mar 25 → Apr 12 P@5 slide; the model pin at 5c1f49f (gpt-4o-mini-2024-07-18) removed the ongoing drift risk, so any future measurement is apples-to- apples. The original bisection is still unresolved but cheap at this point — tractable whenever there is session capacity, low urgency because the pin protects forward runs.

  7. Fix 2 revert commit SHA missing from the Fix 2 outcome entry. The "Fix 2 outcome — mechanism works, response-style criterion fired, reverted" DECISIONS.md entry describes the revert narratively but does not cite the revert commit's SHA (post-rewrite: 27c2e17docs(eval): Fix 2 SearchTool query expansion — attempted and reverted). Add retroactive SHA reference in the next docs pass. Not urgent; noted so the narrative-without-SHA pattern does not spread to other entries. Lesson going forward: prefer explicit SHAs over positional references like "this commit" / "commit above" in DECISIONS.md entries — positional references do not survive history rewrites as robustly as SHA references do.

K8s refusal_threshold sweep against 25-question golden — 2026-04-14

Override notice. This sweep ran in the same session as the 25-question authoring + grounded_refusal metric fix (4454894), after I explicitly flagged that the parallel-tracks guidance from earlier in the session recommended waiting for a fresh session with pre-commitment discipline. The user issued an explicit override: "proceed on best-judgment sweep range and criteria" — logged here for audit trail. The pre-commitment frame below was drafted BEFORE running any sweep value, not after. The decision criteria were locked before the first data point was observed, not retrofitted.

Sweep grid. 4 threshold values: 0.010, 0.015 (already measured in .cache/eval_k8s_full25_postfix.json, the post-metric- fix run from 4454894), 0.020, 0.025.

  • 0.010: one tick below current calibration; sanity-check floor.
  • 0.015: current calibration (pilot-floor, one tick below pilot_005's 0.01639 max_score).
  • 0.020: matches legacy FastAPI threshold and the original provisional K8s default before the 125dac0 calibration.
  • 0.025: one tick above legacy; exploration of whether aggressive OOS short-circuiting is worth the correctness risk.

Decision criteria (pre-committed).

  1. OOS refusal must hold. Both k8s_004 (Jaeger) and k8s_024 (Envoy xDS) must retain grounded_refusal=True at the chosen threshold — whether the gate fires at the tool level or the LLM refuses after inspecting context doesn't matter, only that the metric reports True.
  2. Retrieval recall must not degrade. Each retrieval-category question's R@5 at the chosen threshold must be ≥ its R@5 at 0.015 (the post-fix-25Q baseline) with a noise tolerance of at most ONE question dropping by at most 0.20. Two or more drops, or any drop > 0.20, disqualifies the value.
  3. Citation accuracy must hold. All questions' citation_accuracy must be ≥ 0.95 at the chosen threshold. One question at 0.80 is noise-tolerated; two or more is a hard stop.
  4. k8s_022 (flavor-B) retrieval must remain at R@5=1.0. The gap is prompting-side, not retrieval-side; any threshold that breaks the already-working retrieval on flavor-B questions is a regression.
  5. Pick the highest threshold that satisfies 1–4. Rationale: a higher threshold short-circuits more OOS queries at the tool level, saving a retrieval round trip and an LLM call — this is a real latency and token-cost win when the correctness is held.
  6. Tie-break. If multiple values all satisfy 1–4, prefer the value closest to a clean round number (0.020 over 0.018) for documentation clarity.
  7. Floor. If no threshold > 0.015 satisfies 1–4, keep 0.015. No threshold < 0.015 will be chosen regardless — sub-0.015 is strictly less protective than the pilot-floor.

Scope bound. K8s only; FastAPI's refusal_threshold: 0.02 is unchanged. The flavor-B response-style gap (parallel track #3) is NOT a sweep variable — changing the threshold does not fix LLM phrasing; that's the Fix 2 + prompt guidance stacked experiment the parallel-tracks list already defers.

Measured results. All four runs use the post-metric-fix pipeline (grounded_refusal metric from 4454894), deterministic mode, gpt-4o-mini-2024-07-18, same retriever config.

threshold avg R@5 OOS refusal gate fired on broken retrieval
0.010 0.957 2/2
0.015 0.957 2/2
0.020 0.870 2/2 k8s_006, k8s_007, k8s_024 k8s_006, k8s_007 (R@5=0.00)
0.025 0.913 2/2 k8s_004, k8s_007, k8s_024 k8s_007 (R@5=0.00)

Structural finding: LLM query variance makes max_scores non-deterministic. At 0.020, k8s_006 (ConfigMap, simple) gate-fired → empty retrieval → R@5=0.00. At 0.025, k8s_006 did NOT gate-fire → 5 sources → R@5=1.00. A higher threshold producing fewer gate-fires is physically impossible if retrieval is deterministic — the SearchTool receives different queries across runs because the orchestrator issues LLM-generated queries, and the same question can produce different top-k max_scores run-to-run. k8s_006's max_score for the query the LLM chose lives somewhere around the 0.018–0.025 boundary; which side of any given threshold it lands on depends on which query the LLM wrote.

This means any threshold above 0.015 is structurally fragile, not merely "failed on this run." Even if a run at 0.018 passed, a future run could gate-fire on k8s_006 or k8s_007 because the query is non-reproducible. The production threshold needs to sit below all legitimate simple-question max_scores with enough margin to absorb LLM query variance.

Decision: keep refusal_threshold: 0.015.

  • 0.010: meets all criteria, identical measured metrics to 0.015 (avg R@5=0.957, OOS refusal 2/2, no citation fails). Not chosen: lowering strictly weakens the gate's ability to catch low- confidence retrievals without improving any measured metric.
  • 0.015: chosen. Meets all criteria and is the highest value that does not degrade retrieval — which is the definition of the correct refusal-gate threshold. Preserving the gate's signal is the gate's purpose; 0.015 gives maximum gate strength without cost, 0.010 gives the same measurable behavior with less gate signal, so 0.015 dominates.
  • 0.020: breaks TWO retrieval questions (k8s_006, k8s_007); disqualified per criterion 2.
  • 0.025: breaks ONE retrieval question in this run (k8s_007) but the non-determinism finding means a future run could break more. Even ignoring non-determinism, still disqualified by the citation-accuracy-equivalent drop on k8s_007.

Corpus characteristic finding. The 0.020 default inherited from FastAPI breaks on K8s because K8s retrieval score distributions are lower for "easy" questions. k8s_006 ("What is a ConfigMap?") and k8s_007 ("What does a Kubernetes Job do?") are both type: simple with clean single-source expected answers — exactly the cases where BM25+embedding scores should be highest. They land at max_scores in the ~0.018 range, below the FastAPI-calibrated 0.020 default. This is not an authoring bug — both questions retrieve their expected_sources correctly when the gate doesn't fire. It's a corpus characteristic: K8s documentation has more topic-overlap across pages than FastAPI, diluting top-k concentration.

The 25-question set exposed this because the 6-question pilot had no simple questions with low max_scores — the pilot was drawn from retrieval-stressful areas (comparison, multi-hop, flavor-B). The 25-question authoring deliberately added simple questions to hit the CRAG distribution target (6 simple, 5–6 target), and those simple questions revealed the corpus-characteristic floor.

Config change. configs/default.yaml corpora.k8s.refusal_threshold comment updated to reference this sweep. Value unchanged at 0.015.

Not in scope. (a) Adding retry-with-query-variance to the SearchTool to reduce max_score variance — separate session, affects other corpora. (b) Tuning FastAPI's threshold against its golden set — the FastAPI default was empirically fine on its own 30Q set and is not a documented regression. (c) Fixing the k8s_015 R@5=0.50 value observed across all threshold runs — pre-existing authoring state from 4454894, tracked separately if it becomes a concern on future runs.

Narrative summary. Session hypothesis: pilot_005 is a counterfactual-query-expansion problem. Session evidence: the hypothesis is correct on retrieval — the target chunk is reachable via negative-framing queries and Fix 2 surfaces it deterministically with zero iteration-budget impact. Session evidence also shows the hypothesis is incomplete — retrieval-only fixes cannot close the response-style gap, because the LLM under unaided prompting hedges when a documented negative is surrounded by unrelated topical content. A future session exploring Fix 2 + targeted prompt guidance stacked is the natural next experiment; this session's pilot-first discipline has been preserved against two distinct pre-committed gates, both firing for the reasons they were designed to catch.

Credential-exposure incident and history rewrite — 2026-04-14/15

Summary. During Week 1 work on the feat/user-friendly-landing-page-live-dashboard branch, an instruction.txt file containing plaintext OpenAI and Anthropic API keys was accidentally committed at pre-rewrite SHA 2b3150f (style: fix ruff lint — import sorting, line length) and removed from the working tree in a later commit (pre-rewrite SHA 3a2c5ef, security: remove instruction.txt containing plaintext credentials). The removal did not clean git history — the keys remained accessible via git show 2b3150f:instruction.txt in local history.

Discovery. The issue was discovered when GitHub push protection rejected the first push of the branch to the origin remote, flagging the credentials via its secret-scanning system. The branch had never been pushed to any public remote prior to the rewrite; the detection fired on the very first push attempt, which is the correct moment for secret-scanning to act. Honest credit to the tooling: GitHub's push protection did exactly what it was designed to do, and the alternative failure mode (silent push of real credentials to a public repo) did not occur.

Immediate actions, in order.

  1. Key rotation. Rotated both OpenAI and Anthropic keys at the respective provider dashboards, revoking the exposed values immediately. Rotation was confirmed before any git operation ran — the reasoning was that the keys were exposed on the local disk regardless of whether they ever made it to a public remote, so the exposure window needed to be closed first.

  2. Unauthorized-use check. Verified billing/usage dashboards on both OpenAI and Anthropic for the exposure window (from commit 2b3150f landing until rotation). No unauthorized activity observed on either account.

  3. Local .env update and smoke test. Updated local .env with the new keys. Verified both worked via minimal API calls that return only HTTP status codes (never the key values themselves): GET /v1/models for OpenAI (200), POST /v1/messages with a 1-token request for Anthropic (200). Total verification cost: <$0.0001.

  4. Repository backup. Before running any history-rewriting command, backed up the entire repository via rsync -a to /Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts>, excluding only .mypy_cache and .cache (both derivative, regenerable, and explicitly .gitignore'd). The backup preserved .git/, all four worktree state files under .git/worktrees/, the .worktrees/ checkouts themselves, and all tracked source files. The backup is the safety net if the rewrite had gone wrong in any way; this session never needed to consult it.

  5. History rewrite via git filter-repo. Ran git filter-repo --path instruction.txt --invert-paths --force on the main clone. The --force flag was required because filter-repo's default safety check refuses to run on non-fresh clones; the backup step above mitigates the risk that this flag is usually guarding against. 186 commits were parsed and rewritten in ~2.4 seconds; filter-repo's internal repacking completed in an additional ~5 seconds. The origin and hf remotes were automatically unset by filter-repo as its standard safety behavior (and restored from a saved file before the push).

  6. Dropped empty commit. Pre-rewrite commit 3a2c5ef (which removed instruction.txt from the working tree but did not clean history) became empty after filter-repo stripped the file from all prior commits and was dropped automatically. This is correct filter-repo behavior: the commit's only net effect was to remove a file that no longer exists in any predecessor, so post-rewrite it has no content change and is elided from the linear history. The total commit count went from 186 → 185. Pre-rewrite SHA 3a2c5ef maps to 00000...00000 in .git/filter-repo/commit-map, indicating the drop. The dropped SHA was not referenced anywhere in DECISIONS.md, so the drop had zero audit-trail impact.

  7. Multi-layer verification sweep. Ran six checks across every location where the credentials could still be present: (a) git log --all --full-history -- instruction.txt returned empty; (b) git rev-list --all --objects | grep instruction.txt returned 0 matches; (c) git reflog --all was empty after git reflog expire --expire=now --all; (d) git fsck --unreachable returned clean; (e) git stash list was empty; (f) a precise key-value regex scan across all blobs in the rewritten object database (sk-[A-Za-z0-9]{30,}, sk-ant-[A-Za-z0-9]{20,}, and env-var-assignment patterns) found 23 matches, all verified to be non-secret content — specifically: 15 historical README.md blobs containing the documentation placeholder ANTHROPIC_API_KEY=sk-ant-... (with three literal dots), 7 historical docs/provider_comparison.md blobs with the same documentation placeholder pattern, and 1 tests/test_output_validator.py blob containing test fixtures that intentionally use mock key-shaped strings to verify the output-validator's secret-redaction logic. The precise scan is a meaningful check: it demonstrates that the exposure was isolated to instruction.txt and did not spread via copy-paste of the key values into other files before removal.

  8. Worktree walk. All four worktrees (feat-infra-sprint, feature-grounded-refusal, langchain-baseline, security-hardening) were checked for instruction.txt history pollution and for uncommitted changes. All four were clean — no pollution in any branch's history (filter-repo operates on all refs in a shared .git/, so the worktrees were reached through the main clone's object database) and no local dirty state in any working tree. No worktree deletion or recreation was needed.

  9. DECISIONS.md SHA remap. The filter-repo operation rewrote every commit's SHA downstream of the first rewritten commit. This broke every explicit SHA reference in DECISIONS.md because those references pointed to pre-rewrite SHAs that no longer exist. The remap used .git/filter-repo/commit-map as the authoritative SHA-based mapping (not message-based pairing, which would have been vulnerable to duplicate-message ambiguity — 2 pairs of commits in the pre-rewrite history did in fact have identical messages, though neither was in the substitution set). Four unique old SHAs were remapped across 18 substitution sites:

    OLD (pre-rewrite) NEW (post-rewrite) Commit role
    bd2b913 213da36 Fix 1 counterfactual prompt clause revert
    b97f00f 125dac0 K8s refusal_threshold 0.02 → 0.015 calibration
    77017db 5c1f49f pin gpt-4o-mini snapshot + wire fastapi golden
    526be18 4454894 Week 1 step 5 — 25Q golden + grounded_refusal fix

    Every message matched exactly across the old→new pairing; no new SHA prefix collides with any old SHA prefix; post-remap grep confirmed zero remaining references to any old SHA.

Exposure scope assessment. The branch had never been pushed to any public remote prior to the rewrite. The credentials existed in:

  • Local git history at /Users/zenith/Desktop/agent-bench/.git/ (cleaned)
  • Four worktree clones sharing the same .git/ (cleaned via the main repo)
  • The rsync backup at /Users/zenith/Desktop/agent-bench.pre-filter-repo-backup-<ts> (to be deleted after this commit and test suite confirm the rewrite is correct)

No external exposure via GitHub, HF Spaces, or any other shared system occurred. No cached CI artifacts contain the keys because CI only runs on pushed branches and this branch was never pushed. No forks or clones exist outside the local machine. GitHub's push-protection detection itself touched the key strings during the rejected push attempt, but GitHub's secret scanning is trusted infrastructure and the rejection is the good outcome, not an additional exposure event.

Why this entry exists. Credential hygiene failures are worth documenting, not hiding. A reviewer who reads this entry sees a developer who: made a mistake, caught it via automated tooling working as designed, rotated keys before touching git, rewrote history surgically with a backup as the safety net, verified the rewrite across six independent checks, and preserved audit-trail integrity through the SHA remap. The honest-evaluation brand extends to credential-handling incidents — the alternative of pretending this didn't happen, or silently unblocking the secret- scanning rejection to push exposed values to a public repo, would be a strictly worse outcome for both security posture and brand credibility.

Procedural lessons for DECISIONS.md going forward. Prefer explicit commit SHAs over positional references like "this commit" or "commit above" — positional references do not survive history rewrites as robustly as explicit SHAs do. The "Fix 2 outcome" entry above was identified during this incident as missing an explicit SHA reference to the Fix 2 revert commit (post-rewrite SHA 27c2e17); this is tracked as parallel-tracks item #7 for a retroactive fix in the next docs pass.

Round 2 — Google API key format in a test fixture

After the round-1 rewrite was complete and the feature branch had been pushed to origin for the first time, GitHub secret scanning raised a second alert (alert #1, secret_type: google_api_key) against tests/test_output_validator.py line 152 at pre-round-2 commit 8ebe3964af7d (security: fail-closed on secret extraction and env var leakage). The alert was on a test fixture inside a @pytest.mark.parametrize list, structurally consistent with the other fake fixtures in the same list (OpenAI sk-test123, Anthropic sk-ant-xyz, AWS AKIAIOSFODNN7EXAMPLE). The Google fixture, however, was 35 chars after the AIza prefix and matched both GitHub's detection pattern and the output validator's own detection regex exactly.

Disambiguation. Asked whether the string was a hand-typed fake or a real-leaked Google API key, the developer confirmed: (1) yes, a Google API key had been created at some point in a GCP or Google AI Studio context unrelated to this project, and (2) no, the string on line 152 was not recognizably hand-typed. Combined with the structural inconsistency against the other clearly-fake fixtures in the same parametrize list, the safe interpretation was to treat it as potentially real and rotate + rewrite rather than dismiss as false positive.

Actions, in order.

  1. Google API key rotation. All Google API keys on the developer's GCP and Google AI Studio accounts rotated at the provider dashboards, regardless of which specific key matched line 152, because the specific match was not known with certainty. Rotation confirmed before any git operation.

  2. Billing/activity check. Verified Google Cloud billing and API activity on every project for the window since commit 8ebe3964af7d landed (2026-04-12 18:18). No unauthorized activity observed.

  3. Why the validator regex and GitHub's detector are identical. The output validator's regex at agent_bench/security/output_validator.py line 23 is \bAIza[0-9A-Za-z_\-]{35}\b — byte-for-byte identical to GitHub's secret-scanning Google API Key detection pattern. This means there is no static test fixture that satisfies the validator's test assertion (the validator must block the input) without also triggering GitHub's push protection. Any replacement with a fixture that matches the validator's regex is immediately re-flagged; any replacement with a fixture that does not match the validator's regex breaks the test assertion. The cleanest resolution is to remove the Google fixture from the static parametrize list entirely and restore Google API key format coverage via a runtime-generated fixture that constructs a 35-char AIza-prefixed string at test time and never lands as a literal in source code. Tracked as a parallel-tracks item. The output validator's regex is NOT weakened; the test loses one of seven parametrize cases but continues to verify OpenAI, Anthropic, AWS, JWT, and env-var-assignment detection.

  4. Round-2 filter-repo. Ran git filter-repo --replace-text <file> --force with the pattern file containing regex:AIza[A-Za-z0-9_\-]{35}==>AIzaFIXTUREREDACTED. This replaced the Google API key format anywhere it appeared in any historical blob across the entire repository. Every commit from 8ebe3964af7d forward was rewritten, which cascaded through the full post-round-1 history including all round-1-remapped SHAs and tonight's 5 commits. Total commits processed: 186. filter-repo's internal commit-map wrote 152 changed entries and 35 unchanged entries (commits before 8ebe3964af7d that never touched the pattern).

  5. Working-tree fixture removal. After the filter-repo rewrite, tests/test_output_validator.py line 152 read "google says AIzaFIXTUREREDACTED" (15 chars after AIza, below the validator's 35-char regex threshold). Removed the line entirely from the parametrize list and added a block comment explaining the removal, the regex-collision reason, the parallel-tracks item to restore via runtime-generated fixture, and an explicit note that the validator's regex remains unchanged. Committed as a separate new commit on top of the rewritten history.

  6. Round-2 verification sweep. Re-ran the same six-check sweep: git log, git rev-list --all --objects, reflog, fsck, stash, and a precise regex scan across all blobs for the \bAIza[0-9A-Za-z_\-]{35}\b pattern. Zero blobs in the post-round-2 object database contain a 35-char AIza pattern. The scrub is complete across all history.

  7. Round-2 DECISIONS.md SHA remap. The round-1 remap table above uses SHAs 213da36, 125dac0, 5c1f49f, 4454894 as the "NEW (post-rewrite)" column. These are the post-round-2 SHAs; they were e6d9675, c1d8163, 740c9d5, 6d177ba after round 1 and got rewritten again by round 2. To avoid a three-column mapping table showing intermediate round-1 SHAs, the table above reads as a direct pre-rewrite → current-state mapping. The round-1-only intermediate SHAs are preserved in this narrative as "round-1 SHAs" for audit completeness but are not the canonical SHAs anyone looking up a commit should use. The canonical SHAs are the post-round-2 values.

    Additional round-2 SHA update: parallel-tracks item #7 (Fix 2 revert commit SHA missing from the Fix 2 outcome entry) was updated from 8c836f5 (post-round-1) to 27c2e17 (post-round-2).

Exposure scope, round 2. The branch had been pushed to origin exactly once before round-2 was discovered (the first push at the end of round 1, which landed commit 3167b59 at origin). The feature branch was the only affected ref — main was not updated, and no PR had been merged. The round-2 cleanup requires a force-push with --force-with-lease to overwrite the pushed round-1 history with the round-2 history. Force-push is normally a discipline concern, but here it is safe: the branch was published less than one hour before round-2 was discovered, no other work was based on the pushed round-1 history, and the force-push is scoped to this specific branch (not main or any long-lived ref).

Alert dismissal. GitHub alert #1 was dismissed as false_positive via gh api after the force-push, with the resolution comment noting that the pre-round-2 commit SHA the alert referenced (8ebe3964af7d) no longer exists in the rewritten history and the test fixture has been removed from tests/test_output_validator.py pending a runtime-generated replacement.

Round-2 procedural lesson. The validator-regex ↔ detector-regex identity is a structural finding worth noting for future security test design. Any test fixture that verifies detection of a specific secret format will, by construction, match the format it is testing. If the format is one GitHub (or any upstream detector) also scans for, the fixture will trigger an alert on every push where it is introduced. The three durable mitigations are: (a) generate fixtures at runtime so they never land in source, (b) use an isolated regex that is a proper subset of the production detector's regex so fixtures fall below the detector's match threshold, or (c) mark the file explicitly in a .github/secret-scanning.yml allowlist. This project is adopting option (a) as the follow-up, because it preserves the production detector regex without weakening and keeps the test's fidelity to the actual attack surface.

Audit-path bug — streams masked a request-crashing failure (2026-04-15)

During v1 deploy smoke testing, every non-stream POST /ask request and every injection-blocked request on the HF Space returned {"detail": "Internal server error"} (HTTP 500) instead of the intended 200 / 403. Normal queries via POST /ask/stream appeared to work correctly from the dashboard, so the bug was invisible until a direct curl hit a non-stream endpoint. Discovered and fixed same-day.

Root cause. agent_bench/security/audit_logger.py:60 called self.path.parent.mkdir(parents=True, exist_ok=True) the first time a request wrote an audit record. The default path logs/audit.jsonl resolved to /home/user/app/logs/audit.jsonl at runtime. The Dockerfile's WORKDIR /home/user/app creates that directory as root:root mode 0755 because Docker's WORKDIR directive does not honor --chown. Subsequent COPY --chown=user lines only change ownership of the copied files, not the directory itself. At runtime under USER user (uid 1000), the process had r-x on WORKDIR — it could read and execute, but not create new subdirectories. First audit write → PermissionError: [Errno 13] Permission denied: 'logs'.

Why .cache/ worked and logs/ didn't. .cache/ is created at build time by the two RUN python scripts/ingest.py ... steps (running as root) and then explicitly chowned by RUN chown -R user:user .cache/. logs/ was created lazily at runtime by the audit logger, as a non-root user, in a directory owned by root. Directory-creation permission wall.

Why the streaming endpoint masked it. /ask/stream calls _write_audit at the end of the event generator (routes.py:438), after all stage events and the final answer have already been yielded over SSE. A failure there is already too late to affect the client's view of the response — the client sees the answer, the stream ends, and the audit entry is silently missing. Non-stream /ask calls _write_audit synchronously before return (line 263), so the failure propagates up through the middleware exception handler and becomes a visible 500. Injection-blocked requests on both endpoints also audit synchronously before returning (lines 193, 302) and produce the same visible 500. Normal dashboard use hits /ask/stream → symptom invisible; smoke testing non-stream /ask → symptom obvious.

Traceback confirmed from the HF Space runtime log:

File "/home/user/app/agent_bench/serving/routes.py", line 302, in ask_stream
    _write_audit(
File "/home/user/app/agent_bench/serving/routes.py", line 558, in _write_audit
    audit_logger.log(record)
File "/home/user/app/agent_bench/security/audit_logger.py", line 60, in log
    self.path.parent.mkdir(parents=True, exist_ok=True)
PermissionError: [Errno 13] Permission denied: 'logs'

Two-fix patch. Both applied in the same deploy cycle:

  1. Dockerfile (proximate). Commit ca34ccb: RUN mkdir -p logs && chown -R user:user .cache/ logs/ at build time, before USER user. Ensures the runtime user owns the audit-log directory from container start.
  2. AuditLogger.log() (architectural, defense-in-depth). Commit 25e0f1b: wrap the entire write body in try/except Exception, log the failure via structlog as audit_write_failed with the exception type and path, return normally.

Why both, not either alone.

  • Only fix 1: unblocks the Space but leaves the architectural bug — an audit logger that can crash the app is misdesigned regardless of filesystem permissions. The SECURITY.md OWASP LLM10 writeup would be structurally weak.
  • Only fix 2: stops the crashes, but audit writes would still silently fail on HF Spaces because logs/ remains unwritable. The OWASP LLM10 claim would be live-untrue on the demo surface.
  • Both: demo is honest (audit log actually written on HF Space), architecture is sound (audit failures degrade gracefully), and the OWASP LLM10 writeup has live evidence backing both claims.

Verification on the live Space after the patch. GET /health flipped from "degraded" to "healthy" with provider_available: true; POST /ask normal returned 200 with grounded answers citing real sources from both corpora (was 500); POST /ask with the ignore_previous injection pattern returned 403 "Request blocked: potential prompt injection detected" (was 500); the runtime log shows injection_detected pattern=ignore_previous tier=heuristic followed by a 403 status and no audit_write_failed entries, confirming the audit write succeeded end-to-end.

Methodology lesson — streaming endpoints can hide backend failures. Any request-path side effect (audit logging, metrics, cache write, persistence) that runs at the end of a streaming generator will fail silently from the client's perspective, because content is already committed to the wire before the failure fires. Testing such side effects against their failure modes through a streaming happy path is not sufficient — the happy path masks the failure. Every request-path side effect needs at least one test that verifies it fails loudly (or, where the design says graceful, degrades loudly) when its underlying dependency is broken, exercised synchronously, not via the streaming code path.

For this project, AuditLogger.log() should gain a unit test that asserts it returns normally (not raises) when its target directory is unwritable — i.e., a regression test for the try/except fix above. And the dependency injection of the audit logger in the route handlers should get an integration test that exercises both the streaming and non-streaming paths against the same failure injection, to assert the streaming path does not hide what the non-streaming path surfaces. Tracked as a v1.1 test-coverage follow-up.

Why this wasn't caught locally. Existing tests/test_audit_logger.py tests use pytest's tmp_path fixture, which is always writable by the test process. The failure mode only manifests when the runtime user cannot create the log directory, a deployment-environment condition not reproducible in a unit test against a local filesystem owned by the developer. An integration test that runs the Docker image end-to-end and hits /ask against the actual container filesystem would have caught it pre-deploy. Such a test is out of scope for v1 (adds ~5 min to CI plus Docker build infrastructure) but is the right long-term mitigation for this class of bug.

LLM-judge layer supersession — discrete-anchored 2-judge jury replaces continuous-score single-call

The continuous-score single-call judges in agent_bench/evaluation/metrics.py (answer_faithfulness, answer_correctness, _judge_call) are deleted and replaced by the per-dimension Judge layer at agent_bench/evaluation/judges/. Hard cut, no deprecation cycle.

Design doc: docs/plans/2026-05-04-judge-layer-v1-design.md.

Why this is a supersession, not a refactor. The new layer differs from the old on six axes: discrete-anchored scale (vs continuous 0–1), reasoning-before-score JSON ordering (vs score-first), per-dimension judges (vs combined faithfulness/correctness), full provenance per call (judge_id + rubric_version + system_output_hash + prompt_seed; old had none), composable variance wrappers (rubric_permute, jury — old was single-call), and an intentional abstain-vs-raise discipline (vs silent None from a bare except Exception).

Evidence backing the supersession claim — the calibration κ table quantifies the new layer's agreement with hand-labels across 6 ablation rows (baseline + 3 variance ablations + permute + 2-judge jury). The files defending this entry's claim, by file path:

  • measurements/2026-05-04-judge-calibration-labels.jsonl — 30 items × 3 dimensions hand-labeled (UK AISI bio/chem κ ~0.8 cited as the literature ceiling). Lands in Phase 10.
  • results/calibration_v1_judge_baseline.json, _baseline_no_cot.json, _baseline_no_anchors.json, _baseline_no_abstain.json, _permute.json, _jury_kappa_weighted.json — per-row predictions. Land in Phase 11.
  • docs/_generated/kappa_table.md — generated κ ablation table copy- pasted into the writeup. Lands in Phase 11.
  • docs/judge-design.md — interpretive writeup with the closing "when NOT to use LLM-judge" position. Lands in Phase 12.

Config-knob preservation. evaluation.judge_provider is unchanged across all 5 YAML configs; new evaluation.judge_dimensions field defaults to the three v1 dimensions. Zero user-facing config migration.

Out of scope (v1.1+). Mistral self-hosted as the third jury member, Langfuse self-host, dual-pass intra-rater calibration, DSPy/GEPA/MIPROv2 prompt optimization, citation_faithfulness in the default judge_dimensions, AC2 sympy-derived parity tests.

Opus stress-test surfaced groundedness rubric-scope drift before the κ ablation ran — 2026-05-05

The Opus stress-test pass over the 30 calibration items × 3 dimensions disagreed with the single-rater human gold on 22 of 30 groundedness items (8/30 agreement). Relevance and completeness agreed at 28/30 and 25/30 respectively. The groundedness disagreement is consistent in direction — every disagreed-on item is human=1, opus=0 — and has a single root cause.

Root cause: reference-scope drift between rubric author and labeler. agent_bench/evaluation/rubrics/groundedness.md defines the reference scope as the gold snippets attached to each item:

The judge sees only the gold snippets — not the retrieved chunks. A claim that happens to be true in the world but is not entailed by the snippets fails groundedness.

The single-rater notes on the disagreed-on items describe checking against the broader documentation, not against source_snippets: "supported by the corpus", "supported by the docs", "supported by the provided dependency snippet". For items like k8s_006 the gold snippet is one sentence ("A ConfigMap is an API object used to store non-confidential data in key-value pairs"), while the agent's answer correctly synthesizes seven or eight additional claims from the full k8s_configmap.md. Those claims are true in the world and well-supported by the full doc, but not entailed by the one snippet. Opus applied the strict-snippet rubric; the human rater applied a corpus-supported rubric.

Why this blocks make calibrate against the current gold. The κ ablation compares Haiku and GPT-4o-mini judges against the human gold. A judge that correctly applies the strict-snippet rubric will disagree with miscalibrated gold; a judge that's too lenient will agree. The ablation rewards leniency and punishes rigor — the opposite of the intended measurement. This is the same failure mode codified earlier in this document under "Fix 2 outcome" and elsewhere: tuning sweeps tune compensation when the measurement is wrong, not the intended effect.

Why the rubric stays as written, not relaxed to "corpus-supported". Strict-snippet groundedness measures RAG behavior: did the agent synthesize from what it retrieved? Corpus-supported groundedness measures LLM general knowledge passing through a RAG harness: did the agent happen to be correct? The first is what this benchmark is for; the second is what agent_bench/evaluation/metrics.py measured before supersession. Relaxing the rubric to "corpus-supported" would silently re-introduce the failure mode the supersession entry above just removed.

Decision — three-step correction lands before make calibrate runs:

  1. Rubric clarification commit on agent_bench/evaluation/rubrics/groundedness.md. Add an explicit reference-scope line and one anchored example contrasting "supported by the snippet" vs "true in the world but not in the snippet". Audit-trail requirement: the v1.1 writeup will cite "rubric clarified between v1.0 and v1.1", and the git history needs to back that claim.
  2. Re-label the 22 disagreed-on groundedness items in measurements/2026-05-04-judge-calibration-labels.jsonl against the clarified rubric, snippet-only. Do not mechanically copy Opus's labels. The labels remain the human single-rater's; what changes is the rubric being applied. Mechanical copy would turn the κ table into "judge vs Opus", which is not what the writeup claims it measures.
  3. Recompute make calibrate against the corrected gold and emit docs/_generated/kappa_table.md from the v1.1 labels.

Evidence files for the v1.1 writeup section:

  • measurements/2026-05-05-judge-rubric-opus-stress.jsonl — 90 Opus labels (claude-opus-4-7, serialized to stay under the 30K input-tok/min org rate limit, ~$0.20, ~14 min wall, zero infra-abstains).
  • measurements/2026-05-04-judge-calibration-labels.jsonl — original v1.0 single-rater gold; will be diffed against v1.1 corrected gold to quantify the re-label delta.
  • agent_bench/evaluation/rubrics/groundedness.md — pre/post diff is the rubric clarification.

Pre-labeling observations also worth recording for the writeup methodology section:

  • q021 (fastapi · calculation) answered the CORS preflight question correctly (600 / 60 = 10 minutes) with sources: [] and ranked_sources: [] — the agent did the arithmetic without retrieval and emitted an answer consistent with the snippet without having retrieved it. Methodologically interesting for the citation-faithfulness story (Block 2.7) if it ships: an answer can be correct without being grounded-by-citation.
  • q025 (fastapi · multi_hop) answer was truncated mid-token by the orchestrator's max_tokens limit. The labels reflect what the system produced, not a mentally-patched complete version. The completeness rubric does not currently anchor "truncated response" as a level — v1.1 rubric work should add an anchor.
  • Several K8s items embed external knowledge that's correct but not in the snippet phrasing (k8s_017 mentions exit-code-0 for init-container success; k8s_009 describes Roles vs ClusterRoles by their semantics). The clarified groundedness rubric should pick strict on this case (claim must be supportable by the retrieved spans, not just consistent with them) and the anchored example should show that ruling.

Methodology framing for the writeup. The Opus stress-test was added specifically to catch hand-labeled-gold fragility before the κ table is published. It caught it. The writeup's calibration section should disclose the rubric clarification, quantify the re-label delta on groundedness, and report κ against the v1.1 corrected gold — that is a more credible story than a first-try clean κ table would have been.

Outcome — 2026-05-05 calibrate run on v1.1 gold. All 6 ablation rows ran cleanly after three coupled production-code fixes that landed on the same branch as the rubric clarification: (1) markdown fence stripping in agent_bench/evaluation/judges/base.py::_strip_markdown_fence because Haiku 4.5 wraps JSON output in ```json ... ```, (2) max_tokens 512 → 1024 because v1.1 anchored examples elicit longer model reasoning, (3) calibration runner v1.0 omitted item_id from prediction records; fixed in v1.1 with backfill of the 6 already-written row files via hash → item_id map (no re-spend). Probe-one-cell-before-sweep saved a fourth $0.50 wasted run after the fence-strip change — the methodology note in feedback_judge_probe_before_sweep.md was earned by this session's two failed full-row attempts that paid ~$1.15 for unparseable output before the diagnosis converged.

The κ table at docs/_generated/kappa_table.md (regenerated on 2026-05-05 with AC1 for groundedness and relevance, Cohen's κ for completeness — see report.py _DIM_METRIC) shows three findings that the writeup interprets rather than reports verbatim:

v1.1 finding 1 — relevance is not "judges fail" territory. Cohen's κ = 0 across 5/6 rows is a prevalence degeneracy on the 29×score=2 + 1×score=1 gold; raw agreement is 96–100%, AC1 is 0.96–1.00. AC1 is the load-bearing statistic on relevance and groundedness; both metrics agree on completeness where the gold (23×2 / 5×1) is balanced.

v1.1 finding 2 — no_cot completeness agreement is real, not selective abstain. AC1 = κ = 1.000 at n=24. The 2 absent cells (q021, k8s_012) are infrastructure abstains (provider rate-limit retry exhaustion), both gold=2, neither in baseline's disagreement set. On the 24 scored cells, all 4 baseline-with-CoT disagreements (3× gold=2 scored 1 by CoT-judge, 1× gold=1 scored 2) flip to agreement when CoT is removed. The interview-relevant claim is the opposite of the conventional CoT-helps story: CoT-before-score on 3-point completeness lets the judge over-emphasize partial coverage and rationalize 1 when the human gold sides with the holistic "covers the points" reading.

v1.1 finding 3 — jury_kappa_weighted underperformed baseline on completeness, with a precise mechanism. Per-member analysis from results/calibration_v1_judge_jury_kappa_weighted_members.jsonl: Haiku-4.5 alone reaches κ = 0.416 / AC1 = 0.792 / raw 84.6%; gpt-4o-mini-2024-07-18 alone reaches κ = 0.020 / AC1 = 0.006 / raw 26.9% — systematically harsh on the 3-point scale, almost never scoring 2. Jury aggregate κ = 0.014 / AC1 = 0.016 / raw 26.9% — matches gpt-4o-mini alone exactly because the jury verdict reduces to gpt-4o-mini's verdict on every disputed cell.

The mechanism is missing-weight + round-down compounding, not weighted voting in the usual sense. scripts/run_calibration.py ::_load_weights_from_baseline is a documented v1 stub that returns weight = 1.0 for every judge_id present in baseline. baseline.json contains only Haiku, so Haiku gets 1.0 from the stub and gpt-4o-mini gets 1.0 from jury.py's missing-key fallback (with a logged jury_missing_weight_fallback_to_one warning per call). Equal weights make disputed (Haiku=2, gpt=1) cells produce a weighted mean of 1.5; the _discretize_mean rule is frac > 0.5 → ceil else floor, and 0.5 > 0.5 is false, so 1.5 floors to 1. gpt-4o-mini's verdict wins every disputed cell. The v1 design doc's risks subsection listed "jury κ worse than the better individual judge — (a) kappa-weighting wrong, or (b) worse judge drags mean" as a tracked risk; v1.1 fired both branches simultaneously: branch (a) because the weighting is a stub returning equal weights, and branch (b) because round-down at exact 0.5 ties hands the verdict to the lower-scoring member.

The deeper structural point is that weighting alone cannot rescue a systematically miscalibrated member. Even held-out validation that correctly assigned gpt-4o-mini's true low weight on completeness would still let it dominate disputed ties unless its weight were driven near zero — and at that point exclusion is more honest than near-zero inclusion. The conservative-on-binary "ties to lower" rule also doesn't transfer cleanly to ordinal scales: on completeness, "conservative" means scoring toward incomplete, which is precisely the direction of gpt-4o-mini's bias.

v1.2 fix list (four items, expanding the earlier two-item list):

  1. Held-out jury weights. Replace the _load_weights_from_baseline stub with a real κ-derived computation, evaluated on a held-out validation set — not the same calibration row whose κ is being measured against the gold. Closes the circular-weighting hole.
  2. Symmetric member coverage in the weights source. Missing-member fallback to weight = 1.0 amplifies an unweighted member rather than suppressing it. Either every jury member must have a weight in the source file or the run must abort. The jury_missing_weight_ fallback_to_one warning fired loudly on every call this run; in v1.2 it should be a hard error.
  3. Per-dimension member exclusion when individual κ falls below a threshold. gpt-4o-mini at κ = 0.020 on completeness should not be in the completeness jury at all. Weights below a floor (suggested κ < 0.2) should be treated as exclusion, not as small-weight inclusion. Held-out validation fixes circular weighting; it does not fix systematic member bias.
  4. Per-dimension tie-break rule. v1's _discretize_mean rule (ties to lower) was selected for conservative behavior on binary scales, where "conservative" means scoring 0 on uncertainty. On 3-point completeness, "conservative" means scoring toward incomplete, which interacts badly with member miscalibration. v1.2 should select the tie-break rule per-dimension based on the rubric's conservative direction, not globally.

Evidence files: docs/_generated/kappa_table.md (regenerated with AC1 for groundedness/relevance, κ for completeness); results/calibration_v1_judge_jury_kappa_weighted_members.jsonl (per-member sidecar where the gpt-4o-mini completeness bias is visible per item); results/calibration_v1_judge_baseline.json (weights source — note the absence of any gpt-4o-mini-2024-07-18 entries, which is why the missing-weight fallback fires).

v1.1 jury rescue — sharpened diagnostic + pre-committed A+B success criteria

Date: 2026-05-06. Status: in-flight; this entry is the pre-experiment contract that pins down what counts as success before the re-aggregation runs, so the outcome can't be negotiated post-hoc.

Sharpened diagnostic — extraction-vs-reasoning split, not just "model is biased". Re-reading the per-member sidecar (item-level, not aggregate) on the gpt-4o-mini completeness disputes shows a more specific failure mode than "harsh on 3-point". On the three representative gold=2 / Haiku=2 / gpt=1 cases (q006, k8s_002, k8s_018), gpt-4o-mini's evidence_quotes field correctly extracts the paraphrased coverage from the agent answer — and then its reasoning field denies that those very quotes constitute coverage. k8s_002 is the cleanest instance: the model quotes the strings "declarative updates" and "sticky identity" into evidence, then writes "the answer does not explicitly mention 'declarative updates' and 'sticky identity'". The score follows the reasoning, not the evidence. The mechanism is that the model's post-extraction reasoning step applies a literal-string-match standard to the answer text while the rubric requires "paraphrase allowed" — i.e., the structured-output discipline forced an extraction step that the reasoning step then contradicted on autopilot. This is a known failure mode in chain-of-thought judges and shows up more in smaller models because the reasoning step has less capacity to integrate the rubric's instruction with the literal-text comparison the model is running by default. The artifact for the writeup is measurements/2026-05-06-gpt4o-extraction-reasoning-split.md (three side-by-side reasoning + evidence_quotes excerpts).

Pragmatic v1.1 weights-source decision. The v1.2 fix-list above specifies a held-out validation set for jury weights — methodologically clean but requires either splitting N=30 (loses statistical power on both halves) or labeling more items (eats interview prep time). v1.1 chooses pragmatic: weights computed from the same calibration set used for κ reporting, with the circularity flagged in the writeup. Reason: (a) the alternative is splitting N=30, (b) the per-member κ values used as weights are internally consistent, (c) v1.2 will use a held-out 20- item set. The writeup will contain a sentence acknowledging the circularity rather than hiding it.

v1.1 elevated fix-list (subset of the v1.2 list above). Items 2 (symmetric coverage / hard-error) is elevated unconditionally. Item 1 (real κ-derived weights) is elevated in pragmatic form (same set with circularity caveat). Items 3 (per-dimension exclusion) and 4 (per- dimension tie-break) remain v1.2 unless B's outcome forces them up.

Pre-committed B success criteria. Plan B is "re-aggregate the existing 164 member-rows in calibration_v1_judge_jury_kappa_weighted_members.jsonl with corrected κ-derived weights, no new API spend." The outcome maps deterministically to one of three predefined responses, picked before B runs:

  • Outcome 1 — jury κ on completeness exceeds Haiku-baseline κ by ≥ 0.05 (i.e., new jury κ ≥ 0.466, vs Haiku-alone 0.416). Writeup story: "v1's weights-source bug masked correct aggregation; once both bugs (asymmetric coverage + missing-weight fallback) are fixed, the jury improves on baseline. Per-dimension exclusion remains a v1.2 design pattern but is not needed at v1.1." This is the strong story.
  • Outcome 2 — jury κ within ±0.05 of Haiku-baseline (i.e., 0.366 ≤ jury κ ≤ 0.466). Writeup story: "weights-source fix recovers parity but the jury isn't doing meaningful work on completeness — gpt-4o- mini's near-zero weight makes it effectively excluded by aggregation. This is soft exclusion via weighting; v1.2 will make exclusion explicit." Defensible but less clean.
  • Outcome 3 — jury κ falls below Haiku-baseline κ by >0.05 (i.e., jury κ < 0.366). Writeup story: "weights-source fix is necessary but not sufficient; even at near-zero weight gpt-4o-mini's verdict tips disputed (1, 2) ties due to the round-down rule. v1.1 escalates to per-dimension exclusion." Item 3 of the v1.2 fix-list moves into v1.1.

Why the predefined-criteria framing matters. "I ran B, looked at the number, decided it was good enough" is the same data with a weaker frame than "I predefined the success criteria before running the experiment, B landed at outcome X, which mapped to predefined response Y". The latter demonstrates evaluation maturity in the writeup; the former invites post-hoc reading of the outcome.

B outcome — 2026-05-06. Plan B re-aggregated the existing 164 sidecar rows with κ-derived weights (Haiku=0.416, gpt-4o-mini=0.020 on completeness; clipped at 0 from raw κ values). Result: jury κ on completeness = 0.416, exactly matching Haiku-baseline. Δ = 0.000; maps to Outcome 2 (soft exclusion via weighting). Per the pre-committed response, v1.1 stops here and writes up; per-dimension member exclusion (item C / v1.2 fix #3) is not escalated to v1.1.

Mechanism, validated empirically — a disputed cell (Haiku=2, gpt=1) with corrected weights aggregates as (2 × 0.416 + 1 × 0.020) / 0.436 = 1.954. The frac (0.954) > 0.5 round-up rule ceils to 2, giving the correct verdict. v1's two compounding bugs (asymmetric source returning weight=1.0 for Haiku and the missing-key fallback returning 1.0 for gpt- 4o-mini) jointly forced equal weights, and equal-weights with the same round-up rule produced (2 × 1 + 1 × 1) / 2 = 1.5, which has frac exactly 0.5 (not > 0.5), and floored to 1 — gpt's verdict winning every disputed cell. The bug fixes recover the right verdict purely mechanically; no judge model behavior changes.

The empirical reading: the weighting is not doing meaningful work — gpt-4o-mini's near-zero weight effectively excludes it on completeness, and the jury's κ matches Haiku-alone exactly because Haiku's verdict wins every disputed cell. This is "soft exclusion via weighting"; v1.2's explicit per-dimension exclusion (item 3 of the v1.2 fix-list) makes the exclusion visible in the jury config rather than emergent from κ-derived weight collapse.

v1.1 code changes (this commit):

  • agent_bench/evaluation/variance/jury.py — silent missing-weight fallback to 1.0 → hard ValueError. Two existing tests that asserted the old contract (test_kappa_weighted_reasoning_reports_applied_weights_not_dict, test_kappa_weighted_logs_warning_on_missing_weight) updated to assert the new contract.
  • scripts/run_calibration.py::_load_weights_from_baseline_compute_kappa_weights — replaces the v1 stub with real per-judge Cohen's κ on the dimension; hard-errors when any expected member is missing from the source. Clips κ < 0 to weight = 0 (soft exclusion).
  • configs/calibration/rows/jury_kappa_weighted.yamlweights_source re-pointed from calibration_v1_judge_baseline.json (Haiku-only, asymmetric coverage) to calibration_v1_judge_jury_kappa_weighted_members.jsonl (both judges, same calibration set with documented circularity).
  • tests/scripts/test_run_calibration_dispatch.py — two new tests cover _compute_kappa_weights: (a) computes real κ (high-agreement judge → weight=1.0, chance-agreement judge → 0); (b) hard-errors on asymmetric source coverage.
  • results/calibration_v1_judge_jury_kappa_weighted_v1_1.json — new predictions row produced by re-aggregating the existing sidecar offline (no API spend; via scripts/_dev/reaggregate_jury_v1_1.py). docs/_generated/kappa_table.md regenerated with this row alongside the broken v1 row, giving the writeup a clean before/after diff (completeness: 0.014 → 0.416, n=26).
  • measurements/2026-05-06-gpt4o-extraction-reasoning-split.md — the three side-by-side reasoning + evidence_quotes excerpts (q006 / k8s_002 / k8s_018) demonstrating the extraction-vs-reasoning split diagnostic finding.

The v1.2 fix-list above is unchanged in scope; v1.1 elevates items 1 (pragmatic form) and 2 (full form). Items 3 and 4 remain v1.2.

Plan 3A — recency-positioned paraphrase instruction (pre-committed criteria)

Date: 2026-05-06. Status: in-flight; this entry pins down the hypothesis and success criteria before the experiment runs.

Hypothesis sharpened by the 1A direction-of-bias finding. GPT-4o- mini's completeness disagreements are 17/19 gold=2/pred=1 with zero up-mistakes across 26 items spanning two corpora — direction-aware noise, not balanced random labeling. The model is consistently applying some rule stricter than the rubric requires. The hypothesis under test: that stricter rule is "literal-string match required, paraphrase doesn't count," and the bias is fixable by recency-positioning the rubric's "paraphrase allowed" instruction adjacent to the commit-to-score decision instead of leaving it 500+ tokens upstream in the rubric body.

The intervention is positional, not lexical. The current CompletenessJudge prompt (agent_bench/evaluation/judges/completeness.py) sends the rubric body, then the gold reference, then the system answer, then a one-line "Score this answer..." instruction immediately followed by the JSON schema clause. The rubric body's "paraphrase allowed" clause appears in the introductory paragraphs, hundreds of tokens before the score decision. The intervention adds one sentence between the system answer and the score instruction:

"Note: a paraphrase that captures the same meaning as a gold-answer point counts as covered. Score on content equivalence, not surface form."

This is the recency-positioning hypothesis: the model loses the paraphrase conditioning across the rubric anchors and the reasoning step. Restating the instruction adjacent to the score decision tests whether the bias is positionally correctable.

Selected 5 disputed items (representative of the gold=2 / Haiku=2 / gpt=1 pattern across both corpora): q006, q011, k8s_002, k8s_006, k8s_018. All four are pure paraphrase-coverage cases (the system answer paraphrases the gold's points; Haiku scored 2; GPT-4o-mini scored 1 with the extraction-vs-reasoning split documented in measurements/2026-05-06-gpt4o-extraction-reasoning-split.md).

Pre-committed 3A success criteria.

  • Fixed (≥3/5 shift from 1 → 2): Recency-positioning is sufficient. Re-run GPT-4o-mini on the full 26 disputed items with the corrected prompt, recompute κ, update the writeup table. Story: "rubric- engineering matters more than judge model choice for ordinal scales — recency-positioning the paraphrase instruction recovered N% of disputed items." The completeness story becomes actionable, not diagnostic-only.
  • Partially fixed (1–2/5 shift): Inconclusive at N=5 (binomial- significance line is 3+). Re-run on the full 26 disputed items ($0.20) to get a clean number; write up whatever the full-26 says.
  • Not fixed (0/5 shift): The instruction is being received and ignored — the model can't act on it under reasoning load. Escalate to 4A (GPT-4o full on the same 5 items) to verify the small-model- specific claim. Story: "repositioning the paraphrase instruction adjacent to the score decision did not shift any of 5 disputed items; GPT-4o handled the same prompts. The bias is small-model-specific, not prompt-fixable."

The 3/5 threshold is the binomial-significance line at this N — random shifting under the null produces 0 or 1 changes most of the time. Pre- committing avoids the "2 shifted, that's kind of a fix" negotiation.

On the 1A relevance finding — confirmed. Both judges essentially correct on every relevance item (Haiku 29/30, GPT-4o-mini 30/30); κ degeneracy is structural under 29/30 prevalence at class-2; AC1 + raw agreement is the right reporting. No further investigation on relevance. Writeup paragraph is one short sentence: prevalence-induced degeneracy → AC1 is load-bearing.

Plan 3A — outcome on the 5-item probe + full-26 re-run (v1.1.1)

Date: 2026-05-06. Status: complete; the v1.1.1 prompt is now permanent in agent_bench/evaluation/judges/completeness.py.

3A 5-item probe: 3/5 disputed items shifted 1 → 2 (q006, q011, k8s_002), 2/5 unchanged (k8s_006, k8s_018). Cost $0.0013. At pre- committed threshold (≥3/5 → "fixed"), so the protocol triggered the full-26 re-run on gpt-4o-mini only (Haiku held as control to make the v1.1 → v1.1.1 delta cleanly attributable to the intervention's effect on the affected judge).

Full-26 re-run (gpt-4o-mini completeness, v1.1.1 prompt):

n raw κ AC1
v1.1 gpt-4o-mini 26 26.9% +0.020 +0.006
v1.1.1 gpt-4o-mini 28 42.9% +0.000 +0.232
v1.1 Haiku (control) 26 84.6% +0.416 +0.792

Per-item delta (v1.1 → v1.1.1): 7 items shifted up (1 → 2 or 1 → 2), 0 shifted down, 19 unchanged. Of the 7 up-shifts: 6 are correct (gold=2 items moving from pred=1 to pred=2: k8s_002, k8s_013, k8s_015, k8s_016, k8s_017, q006), 1 is a regression (k8s_025: gold=1, was correctly pred=1 in v1.1, now over-credited at pred=2). Net per-item correctness delta: +5 items.

Cohen's κ is misleading on this comparison. v1.1.1 raw agreement rose from 26.9% to 42.9% (+16 percentage points), and AC1 rose from 0.006 to 0.232 (38× improvement). But Cohen's κ stayed at ~0 — slightly lower than v1.1's 0.020. The mechanism is prevalence-rebalancing in the marginals: gpt-4o-mini's pred distribution shifted from {0:2, 1:19, 2:5} (concentrated at 1) to {0:4, 1:12, 2:12} (more balanced, closer to gold's {1:5, 2:23} over n=28). Cohen's κ = (P_o - P_e)/(1 - P_e); when marginals become more diverse, P_e (chance agreement) rises in lockstep with P_o (observed agreement), and κ deflates. AC1 uses prevalence-robust chance correction (P_e = (1/(q-1)) Σ pi_k(1-pi_k)) and reads the actual signal.

This is the same trap that motivated AC1 over κ on the relevance and groundedness rows of the original κ table, surfacing here at a different distribution boundary. The κ table footer already explains why per-dimension metric selection matters; v1.1.1's outcome demonstrates the trap induced by the intervention itself.

Effect on the jury aggregate. With κ-derived weights and gpt-4o- mini's v1.1.1 κ at 0 (clipped from +0.000 to weight=0), the jury verdict on completeness is now mathematically equivalent to Haiku-alone on every item (gpt's contribution is multiplied by zero). Jury κ stays at 0.416, identical to v1.1's corrected aggregate. The intervention's per-member improvement is invisible at the jury level under this weighting scheme.

Methodological consequence — v1.2 fix-list addition. The v1.2 fix- list now expands by one item:

  1. Prevalence-robust weights for prevalence-skewed dimensions. v1.1's _compute_kappa_weights uses Cohen's κ for every dimension, which has a self-defeating property on prevalence-skewed gold: improving a member can lower its weight even as it gets more accurate.

    Mechanism. Cohen's κ = (P_o - P_e) / (1 - P_e), where P_e = Σ_k P(gold=k) × P(pred=k) is the chance-agreement term computed from the marginal distributions. P_e is not invariant to the predictor's marginal distribution — when a member's predictions become more diverse (less concentrated at one class), P_e rises as the marginals approach gold's marginals. Concretely: when an intervention moves a member's pred distribution from concentrated- at-one-class toward gold's distribution, P_o and P_e rise together in lockstep. The numerator P_o - P_e stays small, and κ deflates even as raw accuracy improves. This is the same prevalence-induced degeneracy that motivated AC1 over κ on relevance/groundedness rows in the κ table — it surfaces in jury weighting at any distribution-shifting intervention's boundary.

    Empirically observed in v1.1.1. The recency-positioning intervention shifted gpt-4o-mini completeness pred dist from {0:2, 1:19, 2:5} to {0:4, 1:12, 2:12}, closer to gold's {1:5, 2:23} over n=28. Per-cell raw agreement 26.9% → 42.9%. AC1 (Gwet 2008) reads the change correctly: 0.006 → 0.232 (38×). Cohen's κ stays at ~0 (0.020 → 0.000) because P_e is now ≈ P_o ≈ 0.43. v1.1's _compute_kappa_weights clips the new κ at zero, producing weight = 0 — and the jury aggregate loses access to a member that was empirically improved. The intervention's per- member improvement is invisible at the jury level under κ-weighting.

    Architectural decomposition for v1.2. The right separation:

    • Per-dimension metric for κ table reporting (already in v1.1 via agent_bench/evaluation/calibration/report.py::_DIM_METRIC).
    • Per-dimension weight metric for jury aggregation (new in v1.2, reuses _DIM_METRIC). Use κ where the gold's prevalence supports it, AC1 where κ degenerates. Same lookup, same per- dimension policy at both reporting and weighting layers.
    • Per-dimension membership as explicit configuration override for members that are structurally inappropriate (v1.2 fix #3, unchanged) — distinct from "low score on the chosen metric," which is handled by the weight floor.

    Why this is non-obvious. A reader's first instinct is that "weight by κ" is a sensible default — κ is the standard inter- rater statistic. The self-defeating property is invisible until you observe a real intervention that shifts marginals; in static conditions (no intervention, fixed prompts), the κ-weight choice is benign. The v1.1.1 outcome is the first time the agent-bench calibration set has produced an intervention-induced marginal shift on the same gold; the failure mode wouldn't have been visible in v1.0's static calibration sweep.

v1.1.1 code changes (this commit):

  • agent_bench/evaluation/judges/completeness.py — adds PARAPHRASE_RECENCY_CLAUSE constant, inserted between the system answer and the score instruction. Comment cites the 3A probe.
  • tests/evaluation/test_judges.py::TestCompletenessJudge::test_reference_answer_in_prompt — extends to assert the recency clause appears AND is positioned between the answer and the score instruction (position is load- bearing, not just lexical inclusion).
  • results/calibration_v1_judge_jury_kappa_weighted_v1_1_1_members.jsonl — merged sidecar: v1.1 groundedness/relevance rows (unchanged judges) + fresh v1.1.1 gpt-4o-mini completeness rows + v1.1 Haiku completeness rows.
  • measurements/2026-05-06-3a-paraphrase-recency-probe.jsonl — the 5-item probe artifact with reasoning + evidence_quotes for each.
  • scripts/_dev/probe_3a_paraphrase_recency.py, scripts/_dev/rerun_completeness_v1_1_1.py — reproducers; not part of the production calibration runner.

No changes to the κ table. The jury aggregate κ on completeness is unchanged (0.416 → 0.416) because of the κ-as-weight degeneracy described above; adding a jury_kappa_weighted_v1_1_1 row with identical numbers would be visual noise. The v1.1.1 finding lives in the writeup body, not the table — the per-member AC1 improvement (0.006 → 0.232) is the headline number, surfaced as a separate paragraph next to the κ table rather than inside it.

Total spend through Plan 3A: $0.0013 (3A probe) + $0.0075 (full-26 re-run) = $0.0088.

Plan 4A — GPT-4o (full) on the v1.1.1 residual

Date: 2026-05-06. Status: complete. Run after the writeup- framing review surfaced that v1.1.1's "fixed" verdict was overclaim- prone — 5/19 items were recovered, 14 remained unchanged and uncharacterized. 4A was originally scoped as conditional on 3A not being fixed (per the predefined sequencing rule), but became valuable as a post-3A diagnostic to characterize the residual: is it small- model-specific or rubric-under-specified?

Scope. GPT-4o (gpt-4o-2024-08-06) on 5 of the 14 v1.1.1-unchanged items: k8s_006, k8s_018, q011, q012, k8s_001. Same v1.1.1 production prompt (paraphrase recency clause active). The first two (k8s_006, k8s_018) are the items that didn't shift in the original 3A 5-item probe — we have gpt-4o-mini's reasoning on those items with the v1.1.1 intervention, so 4A gives a clean A/B at fixed prompt varying only the model. q011, q012, k8s_001 cover the broader fastapi/k8s residual surface (k8s_001 also a Haiku miscall — 4A checks whether GPT-4o agrees with gold or with Haiku).

Result: 5/5 correct. All 5 items scored 2 by GPT-4o, matching gold exactly. Cost: $0.0011 reported (caveat: pricing config falls back to gpt-4o-mini rates for unlisted models, so actual cost is closer to $0.005–0.01 — the reported number under-reports by ~5–10×).

Sharpened mechanism — criteria-invention, not just literal-match. The original 3-example artifact (q006, k8s_002, k8s_018) was framed as gpt-4o-mini "applying a literal-string-match standard" while correctly extracting paraphrased coverage into evidence_quotes. 4A's side-by-side reasoning on k8s_018 shows a distinct second mechanism:

  • gpt-4o-mini (v1.1.1, score 1): "It mentions some key points from the reference... but does not explicitly state that the new fields in autoscaling/v2 are preserved as annotations when using autoscaling/v1, nor does it mention the need to use autoscaling/v2 directly for memory or custom metric scaling for a Deployment or StatefulSet."
  • gpt-4o (4A, score 2): "The answer covers all the key points from the reference. It mentions that the current stable version is autoscaling/v2, which supports scaling on memory and custom metrics, similar to the reference. It also notes that autoscaling/v1 only supports CPU-based scaling, aligning with the reference's points."

The reference for k8s_018 specifies three points: (1) autoscaling/v2 is the current stable API, (2) it adds memory metrics support beyond v1's CPU-only, (3) it adds custom metrics support. gpt-4o-mini's reasoning step invents additional criteria the reference does not require ("preserved as annotations when using autoscaling/v1," "use autoscaling/v2 directly for ... a Deployment or StatefulSet") and then deducts against them, scoring 1. GPT-4o reads the reference's three points and scores against exactly those, scoring 2.

This is a capacity finding distinct from the paraphrase-recency finding: gpt-4o-mini's reasoning, even with the v1.1.1 prompt directing it toward paraphrase semantics, manufactures additional gold criteria during scoring that aren't in the reference. Recency-positioning the "paraphrase allowed" clause doesn't address this — the bias isn't "missed paraphrase," it's "invented extra requirements." Two failure modes were stacked; v1.1.1 fixed one; the second is what 4A surfaces.

Implication for v1.2. With 5/5 confirmed, v1.2 fix #3 (per- dimension membership) gets clean empirical support: gpt-4o-mini is the wrong tool for 3-point completeness with paraphrase semantics, and no amount of prompt engineering on this rubric is going to bridge the capacity gap. The right v1.2 path is one of:

  • Exclude gpt-4o-mini from completeness scoring (per-dim membership; jury reduces to single-judge Haiku on completeness; explicit and visible in config).
  • Replace gpt-4o-mini with GPT-4o on completeness (per-dim judge selection; jury keeps two members but the second is a frontier-class model on the dimension that needs it).

Both are defensible v1.2 designs. The choice depends on cost budget — gpt-4o is 10× the per-call cost of gpt-4o-mini. For agent-bench's calibration set scale (30 items × per-row), even gpt- 4o is trivially cheap; for production deployment evaluating thousands of agent outputs, the cost trade-off matters more.

4A artifact: measurements/2026-05-06-4a-gpt4o-full-probe.jsonl (per-item reasoning + evidence_quotes for the 5 GPT-4o calls; pairs with the v1.1 sidecar's gpt-4o-mini reasoning on the same items for the side-by-side analysis above).

Updated honest framing for the writeup. "v1.1.1 addressed one identified failure mode (paraphrase-instruction-loss across reasoning, recovered 5/19 disputed items via positional change). 4A confirmed the residual 14 are a distinct failure mode (capacity-limited criteria invention during the reasoning step) — GPT-4o handles all 5 sampled residuals at the same v1.1.1 prompt, so the failure is small-model- specific rather than rubric-limited. v1.2 fix #3 (per-dimension judge membership / model selection) is the right escalation; the rubric itself doesn't need changes."

Total session spend: $0.0099 reported (~$0.013–0.018 actual after gpt-4o pricing correction).